INRAE, MIAT
May 22, 2025
Get a rough idea of Python programming:
We will NOT be able to code in 1 day.
We will learn a very small subset of the programming functions: non vital methods will not be presented here.
Aim: grasp the “way of thinking” of a programmer, help you in the steepest phase of the learning curve.
There are many forums that can help you solving your bug/problems, even the most stupid ones, once you know how to ask.
Easy to start with Python: no need to compile.
One of the most used programming language now; can be run on all OS.
Easy to read and understand.
Many libraries: BioPython, pyTorch (deep learning), etc.
Pros:
You can type the code, and have the results in the same page.
Can also plot figures!
Cons:
Each cell (in grey) accepts Python code.
You can run the code with the “run” (triangle) button.
You can also copy, paste, and add cells.
Compute 15 times 13.
Type
Then press “run”.
+
, -
, *
, /
//
%
**
Strings of characters (a.k.a text) should be surrounded by '
(quote) or "
(double quote).
is not a valid string of characters.
Some strings cannot be (easily) printed:
"\t"
: tabulation"\n"
: carriage return"\""
: the double quote itselfVariables store information, which can be:
an integer,
a float (decimal numbers),
a string of characters,
and much more.
must start with a letter
cannot contain special characters (eg. .
, $
, etc.) except _
do not include quotes ("a_variable"
is a value, not a variable)
are case-sensitive (a
is different from A
)
should be meaningful (avoid i
, a
, x
, etc.)
can be assigned with =
Create a variable named my_first_variable
, with value 3.5.
Create a second variable, named my_second_variable
, such that its value is twice the previous value.
You can check the content of a variable typing its name in a cell.
Functions:
usually take one or several parameters
usually return one value
log
+
is also a special function, with 2 parameters
Some functions are included in Python because they are compulsory for any programming language (e.g. log
).
You can write your own because you want to compute the same operations several times.
function_name(param1, param2, ...)
print
functionThe function print(var)
prints the content of the variable var
.
It also appends a carriage return.
It is probably the most used function.
Some “special” functions (named methods) have a particular syntax:
They are related to object programming (out of the scope of this training).
Print the maximum of the variables my_first_variable
and my_second_variable
.
Hint: There is a function max
.
The open
function opens a file for reading:
f
is a file variable.
You cannot write to the file.
If you want to open the file for writing, type:
The last parameter, if not mentioned, is, by default "r"
.
The file should be closed when not used anymore (note the “object” syntax):
There is another (better) way to open/close file, which is out of the scope of this training.
You can read the whole text of the file with:
This holds iff f
is a file variable.
Read the file /home/mzytnicki/foo.txt
Print the content of the file.
Use the variable name input_file_name
to store the file name.
Use the variable name file_content
to store its content.
Some prefer writing input_file_name
, other inputFileName
. Matter of style!
You can write the content of text
to a file f
with:
Comments are part of the code which is not run.
They are for the reader only (you!).
They are crucial for remembering the meaning of your code.
They start with #
Read the file /home/mzytnicki/foo.txt
Put the content to foo_copy.txt
.
Use the variable name output_file_name
.
Describe meaning of each part in the comments.
input_file_name = "/home/mzytnicki/foo.txt"
output_file_name = "~/foo_copy.txt"
# We store the content of the input file to file_content
input_file = open(input_file_name)
file_content = input_file.read()
input_file.close()
# We write file_content to the output file
output_file = open(output_file_name, "w")
output_file.write(file_content)
output_file.close()
There are many operations on strings.
Considering the string s
:
s.upper()
returns the upper case translation,
s.lower()
returns the lower case translation,
s.strip(chars)
returns a new string, with all the leading and ending characters included in chars
removed.
Note
If chars
is not mentioned, all the leading and ending whitespaces (space, tabulation, carriage return) are removed.
There are many other functions, which are out of the scope of this introduction.
Print hello, world
, using the file /home/mzytnicki/foo.txt
.
Texts that are read from files are strings, even though they contain numbers.
The following functions convert a string s
to:
a integer: int(s)
a float: float(s)
Print the double of the number in /home/mzytnicki/number.txt
.
Formatted string is an easy way to combine variables and constant string.
They are often used in the print
function.
f
before the quotes: f"string"
.f"string{variable}"
.Print number in /home/mzytnicki/number.txt
, and its double in the same line.
Lists are ordered lists of elements.
Elements can be integers, floats, strings, or even lists.
You can use the square brackets []
, and commas ,
as separators.
Strings of characters are lists!
You can get the size (number of elements) of the list l
with len(l)
.
a = l2[0] # the list should have at least one element!
l2[2] = 4 # the list should have at least 3 elements!
l1.append(1) # add one element
l1.extend(l2) # add all elements of l2 to l1
Indices start with zero in Python
Negative indices start from the end: l[-1]
is the last element.
Last element in not included in the sub-list.
If the first index is omitted (e.g. l[:3]
), it is by default 0.
If the last index is omitted (e.g. l[2:]
), it is by default -1
.
Let a string named myString
get the value Hello, world!
Print the 3rd letter.
Print the letter w
.
Print world
.
Print Hello
for
loopThe for
loop iterates over the elements of a list.
There is a colon :
at the end of the first line.
An indentation (preferably, 4 spaces) delimits the block where the variable can be used.
Print each letter of Hello, world!
.
Print each letter of Hello, world!
, together with its position (0: H
, 1: e
, etc.)
You have to use a new variable, say i
, which will store the position of each letter.
It should start with 0, and be incremented each time a letter is printed.
n = n + m
can be shortened as n += m
range
functionThe range(n)
function creates a list from 0 to n
(not included).
The range(m, n)
function creates a list from m
to n
.
The range(m, n, s)
function creates a list from m
to n
, with a step (incrementation) of s
.
The range
function is often used in a for
loop.
The list itself is not created, but you can still iterate on it.
Print each letter of Hello, world!
, together with its position (0: H
, 1: e
, etc.), using the range
function.
A string s
can be split into a list of several substrings:
s.split(separator, maxsplit)
separator
is the string that separates the substrings; if omitted, it is any whitespaces.
maxsplit
is the maximum number of splits (extra splits will be merged); it omitted, splits all.
Split the string 01.23.45.67.89
with the dot .
as a separator, and print the double of each token separately.
If f
is a file, the variable l
gets all the lines (with the carriage return) in
Print, line by line, the content of the file /home/mzytnicki/file1.gtf
.
The GTF file stores an annotation: genes, transcripts, exons, CDSs.
General:
#
.Annotations: they are formatted in a tabulation-separated format, with at most 9 fields.
chr1 havana gene 10000 11000 . + . gene_id "ENSG00000000001"; gene_name "ABC"; gene_source "havana"; gene_biotype "gene";
gene
, transcript
, mRNA
, exon
, CDS
, etc.).
if NA0
, 1
, or 2
) if a CDS, otherwise .
Print the first character of each line of /home/mzytnicki/file1.gtf
.
if
statementThe code do stuff
is executed only if the expression is true.
The code do stuff 1
is executed only if the expression is true. The code do stuff 2
is executed only if the expression is false.
if
/elif
statementThe code do stuff 1
is executed only if the expression1 is true. The code do stuff 2
is executed only if the expression1 is false and the expression2 is true. The code do stuff 3
is executed otherwise.
You can add as many elif
as necessary.
Expressions in the if
often are comparisons:
a == b
is true iff a
equals b
(note the double equal sign, the simple equal is the assignment).a != b
is true iff a
is different from b
.a > b
is true iff a
is greater than b
.a >= b
is true iff a
is no less than b
.<
and <=
)Booleans are variables that can take values True
or False
.
Given boolean a
and b
, the following functions are defined:
a and b
a or b
not a
a == b
a != b
If l
is a list:
a in l
is true iff a
is an element of l
a not in l
is true otherwisePrint the file, skipping the comments.
Print all the gene lines.
In the gene lines, the feature should be gene
. You should thus select these lines.
Compute the average gene size, in /home/mzytnicki/file2.gtf
.
In order to compute the average size, you should:
split them into 9 fields (separated by tabulations);
select the gene
lines;
get the start and end positions as int
;
compute the size of each gene;
sum them;
divide the sum by the number of genes.
This means that you should use two variables:
the sum of the sizes of the genes,
the number of genes.
file_name = "/home/mzytnicki/file2.gtf"
sum_of_sizes = 0
n_genes = 0
f = open(file_name)
for l in f:
# Skip comments
if l[0] != "#":
s = l.split("\t")
# Check that line is a gene
if s[2] == "gene":
# Compute gene size
n_genes += 1
start = int(s[3])
end = int(s[4])
sum_of_sizes += end - start + 1
f.close()
print(sum_of_sizes / n_genes)
A list a
can be sorted:
In place: a.sort()
. a
is then sorted
In a function: b = sorted(a)
. a
is not modified, b
is sorted.
You can sort in decreasing order with reverse = False
.
There are more complex ways to sort data, but it is out of the scope of this training.
Compute the median gene size.
In order to compute the median size, you should:
compute the size of each gene;
add them into a list of sizes;
sort this list by increasing sizes;
get the middle one.
This means that you should use one list:
file_name = "/home/mzytnicki/file2.gtf"
gene_sizes = []
f = open(file_name)
for l in f:
# Skip comments
if l[0] != "#":
s = l.split("\t")
# Check that line is a gene
if s[2] == "gene":
# Compute gene size
start = int(s[3])
end = int(s[4])
gene_sizes.append(end - start + 1)
f.close()
# Print the (approximate) median
gene_sizes.sort()
print(gene_sizes[len(gene_sizes) // 2])
Print the list of gene identifiers.
The gene identifiers can be found in the gene_id
tag of the genes.
You should:
select the gene
lines;
split the attributes, using the semicolon as separator;
iterate over the attributes;
split each attribute into key/value;
print the value if the key is gene_id
(you can strip the extra quotes).
file_name = "/home/mzytnicki/file2.gtf"
f = open(file_name)
for l in f:
# Skip comments
if l[0] != "#":
s = l.split("\t")
# Check that line is a gene
if s[2] == "gene":
# Get the 8th field, which is the attributes
attributes = s[8]
attributes = attributes.split("; ")
# Look for the gene_id tag
for tag in attributes:
tag = tag.split(" ", 1)
if tag[0] == "gene_id":
gene_id = tag[1].strip("\"")
print(gene_id)
f.close()
The standard definition is
Create a function that prints Hello!
.
Add a parameter to the previous function that greets someone.
Create a function that add 1 to the input.
Create a function that reads the attributes (stored in one string), and returns the gene_id.
Update the code printing the gene_id using this function.
VCF
formatThe Variant Call Format stores the variants found with respect to a reference genome. The format is very flexible. It contains a header (lines starting with #
), and the rest is a tab-delimited file following:
.
).
)chr1 100 rs0001 G C . . AA=G
Print all the lines of /home/mzytnicki/variants.vcf
, except the non-SNPs.
Print all the lines of /home/mzytnicki/variants.vcf
, except the non-biallelic variants.
A dictionary is an extension of a list:
You can check whether key
is in your dict d
with:
You can iterate on the keys of the dictionary d
with
If d
is a dictionary, you can get the list of keys with:
If d
is a dictionary, you can get the number number of inserted elements with:
The file /home/mzytnicki/genes_ranked.txt
contains the genes ranked, from the most abundant to the least abundant (fictitious case).
Get the rank of each gene, on the alphabetically sorted list.
Example input:
gene_C
gene_A
gene_B
Output:
gene_A: 2
gene_B: 3
gene_C: 1
genes = {}
i = 1
input_file = open("/home/mzytnicki/genes_ranked.txt")
for l in input_file:
genes[l.strip()] = i
i += 1
for g in sorted(genes.keys()):
print(f"{g}: {genes[g]}")
Or
Print, for each gene, the number of isoforms.
The genes should be lexicographically ordered.
You should there use a dictionary, which stores, for each gene id, the number of isoforms.
In detail:
select the transcript
lines;
extract the gene id;
if the gene id is not in the dictionary, set the corresponding value to 0;
increment the value of the dictionary;
when the loop is over:
sort the keys of the dictionary;
for each key, print it, together with the corresponding value.
file_name = "/home/mzytnicki/file2.gtf"
n_transcripts = {}
f = open(file_name)
for l in f:
if l[0] != "#": # Skip comments
s = l.split("\t")
if s[2] == "transcript": # Keep transcript lines
gene_id = get_gene_id(s[8]) # The 8th field is the attributes
if gene_id not in n_transcripts:
n_transcripts[gene_id] = 0
n_transcripts[gene_id] += 1
f.close()
for gene in sorted(n_transcripts.keys()):
print(f"{gene}: {n_transcript[gene]}")
We will put the previous program into a file, which we will execute.
Create a new file (click on the +
blue button in the top left, then select Python file
) named n_isoforms.py
.
Open it with the text editor (double-click on in in the left panel).
Copy the previous code and the function, and paste it to the file.
Open a console (click on the +
blue button in the top left, then select `Console).
Run with:
You can now run the program with
We will give the input parameter as an argument of the program.
We have to add a library: additional code that can be useful in some applications.
Here is it sys
(for “system”).
Modify your file, so that it starts with:
The variable sys.argv
is a list, with all the arguments of the program (including the program itself).
Set the input file as an argument of the program.
The program should start with:
In Python (or other modern languages), it is best to use classes.
Classes:
model a concept,
can be accessed using a limited interface,
hide the implementation.
A string can be considered as a class.
A typical implementation is:
class Person:
def __init__(self, name, age):
self.name = name
self.age = age
def greet(self):
print(f"Hi {self.name}!")
A object of a class can be accessed this way:
When writing production code my_project
, the main Python script usually follows this pattern.
Write a Python file moduleA.py
with a function greet
that prints Hi!
.
Write a Python file printHello
with the structure of a main file, that calls the previous function.
moduleA.py
:
printHello
:
From now on, we will apply notions we learned until now.
If time permits.
Create a program that opens the file /home/mzytnicki/file1.fasta
, and prints, for each sequence, its name and its size.
size
, that records the size of the sequence (seen so far).name
, that records the current sequence name.>
, so compute the size of the line, and increment size
.>
line is seen, print the sequence name and the sequence size; store the new name, and reset the size
to 0.To summarize, if you read a line starting with >
it could be:
It is thus convenient to initially set the name
to False
: as soon as the name
is assigned to any string it is truthy
.
#!/usr/bin/env python3
import sys
file_name = sys.argv[1]
sequence_name = False
sequence_size = 0
file = open(file_name)
for line in file:
line = line.strip()
if line[0] == '>':
if sequence_name:
print(f"{sequence_name}: {sequence_size}")
sequence_size = 0
sequence_name = line[1:]
else:
sequence_size += len(line)
f.close()
print(f"{sequence_name}: {sequence_size}")
Create a program that opens the file /home/mzytnicki/file1.fasta
, and prints, for each sequence, the total number of each nucleotides (A, C, G, T, or N).
numbers
, that records the number of each (seen so far).name
, that records the current sequence name.#!/usr/bin/env python3
import sys
file_name = sys.argv[1]
def print_numbers (name, numbers):
print(f"{name}:)
for n in sorted(numbers.keys()):
print(f"\t{n}: {numbers[n]}")
sequence_name = False
sequence_sizes = {}
file = open(file_name)
for line in file:
line = line.strip()
if line[0] == '>':
if sequence_name:
print_numbers(sequence_name, sequence_sizes)
sequence_sizes = {}
sequence_name = line[1:]
else:
for c in line:
if c not in line:
sequence_sizes[c] = 0
sequence_sizes[c] += 1
f.close()
print_numbers(sequence_name, sequence_sizes)
The BED format also is a tabulation-separated format, and it comes with different flavours. The “BED-6” format is
chr start end name score strand
Where:
start
is 0-based,end
is 1-based,gene_id
as the name
,.
(equivalent to NA).#!/usr/bin/env python3
import sys
file_name = sys.argv[1]
file = open(file_name)
for l in file:
l = l.strip()
# Skip comments
if l[0] != "#":
s = l.split("\t")
# Check that line is a gene
if s[2] == "gene":
chromosome = s[0]
start = int(s[3]) - 1
end = int(s[4])
strand = s[6]
name = get_gene_id(s[8])
print(f"{chromosome}\t{start}\t{end}\t{name}\t.\t{strand}")
f.close()
The GTF should have a structure, where the gene line is written before its transcript line(s). Any transcript line should refer to the last seen gene line. The transcript line should be written before the exon line(s). Any exon line should refer to the last seen transcript line.
The gene should have a gene_id
attribute, which is also present in the transcript lines. Similarly, the transcript should have a transcript_id
, also present in the exon lines.
If you spot a problem in the structure, you should indicate the line, and the problem.
The for
loop should increment the line number.
You should write a new function, similar to get_gene_id
, which will find the value of the given key.
It should return False
is the key is missing.
You should record current_gene_id
and current_transcript_id
.
In case, the first line does not start with a gene line, set the previous variable to False
.
#!/usr/bin/env python3
import sys
file_name = sys.argv[1]
def get_attribute_value(attributes, key):
attributes = attributes.split(";")
for tag in attributes:
tag = tag.split(" ", 1)
if tag[0] == key:
return tag[1].strip("\"")
return False
file = open(file_name)
line_number = 1
gene_id = False
transcript_id = False
for l in file:
l = l.strip()
if l[0] != "#":
s = l.split("\t")
if s[2] == "gene":
current_gene_id = get_attribute_value(s[8], "gene_id")
if not current_gene_id:
print(f"Line {line_number}: Gene id is missing in gene line.")
elif s[2] == "transcript":
gene_id = get_attribute_value(s[8], "gene_id")
if not gene_id:
print(f"Line {line_number}: Gene id is missing in transcript line.")
elif gene_id != current_gene_id:
print(f"Line {line_number}: Gene id does not match in transcript line.")
current_transcript_id = get_attribute_value(s[8], "transcript_id")
if not current_transcript_id:
print(f"Line {line_number}: Transcript id id is missing in transcript line.")
elif s[2] == "exon":
transcript_id = get_attribute_value(s[8], "transcript_id")
if not transcript_id:
print(f"Line {line_number}: Transcript id is missing in exon line.")
elif transcript_id != current_transcript_id:
print(f"Line {line_number}: Transcript id does not match in exon line.")
line_number += 1
f.close()
variables
main control flow commands: if
, for
files: open, read, write
comments
simple Python programs
libraries
more programming primitives: enumerate
, items
, etc.
other general libraries: os
, etc.
dedicated libraries: BioPython
plots,
object programming,
version control system (git).
A full Python course: https://python.sdv.univ-paris-diderot.fr/cours-python.pdf
A Python CheatSheet: https://perso.limsi.fr/pointal/_media/python:cours:mementopython3-english.pdf
Sandra Dérozier & Thomas Duigou for “Introduction à Python”: https://sandraderozier.pages.mia.inra.fr/formation-python/supports/00-intro-general.html
The SouthGreen platform for their Python training page: https://southgreenplatform.github.io/trainings/python/
Christophe Klopp for his Python Course: http://genoweb.toulouse.inra.fr/~klopp/BioComp/3.pdf
Please fill in the satisfaction survey: https://sondages.inrae.fr/index.php/84236
This document is CC-BY!