INRAE, MIAT
May 22, 2024
Get a rough idea of Python programming:
We will NOT be able to code in 1 day.
We will learn a very small subset of the programming functions: non vital methods will not be presented here.
Aim: grasp the “way of thinking” of a programmer, help you in the steepest phase of the learning curve.
There are many forums that can help you solving your bug/problems, even the most stupid ones, once you know how to ask.
Easy to start with Python: no need to compile.
One of the most used programming language now; can be run on all OS.
Easy to read and understand.
Many libraries: BioPython, pyTorch (deep learning), etc.
Pros:
You can type the code, and have the results in the same page.
Can also plot figures!
Cons:
You need the visu
access (to be asked beforehand). Please find here how to ask.
Start the Jupyter Notebook.
Each cell (in grey) accepts Python code.
You can run the code with the “run” (triangle) button.
You can also copy, paste, and add cells.
Compute 15 times 13.
Type
Then press “run”.
+
, -
, *
, /
//
%
**
Strings of characters (a.k.a text) should be surrounded by '
(quote) or "
(double quote).
is not a valid string of characters.
Some strings cannot be (easily) printed:
"\t"
: tabulation"\n"
: carriage return"\""
: the double quote itselfVariables store information, which can be:
an integer,
a float (decimal numbers),
a string of characters,
and much more.
must start with a letter
cannot contain special characters (eg. .
, $
, etc.) except _
are case-sensitive (a
is different from A
)
should be meaningful (avoid i
, a
, x
, etc.)
can be assigned with =
Create a variable named my_first_variable
, with value 3.5.
Create a second variable, named my_second_variable
, such that its value is twice the previous value.
You can check the content of a variable typing its name in a cell.
Functions:
usually take one or several parameters
usually return one value
log
+
is also a special function, with 2 parameters
Some functions are included in Python because they are compulsory for any programming language (e.g. log
).
You can write your own because you want to compute the same operations several times.
function_name(param1, param2, ...)
print
functionThe function print(var)
prints the content of the variable var
.
It also appends a carriage return.
It is probably the most used function.
Some “special” functions (named methods) have a particular syntax:
They are related to object programming (out of the scope of this training).
Print the maximum of the variables my_first_variable
and my_second_variable
.
Hint: There is a function max
.
The open
function opens a file for reading:
f
is a file variable.
You cannot write to the file.
If you want to open the file for writing, type:
The last parameter, if not mentioned, is, by default "r"
.
The file should be closed when not used anymore (note the “object” syntax):
There is another (better) way to open/close file, which is out of the scope of this training.
You can read the whole text of the file with:
This holds iff f
is a file variable.
Read the file /home/mzytnicki/foo.txt
Print the content of the file.
Use the variable name input_file_name
to store the file name.
Use the variable name file_content
to store its content.
Some prefer writing input_file_name
, other inputFileName
. Matter of style!
You can write the content of text
to a file f
with:
Comments are part of the code which is not run.
They are for the reader only (you!).
They are crucial for remembering the meaning of your code.
They start with #
Read the file /home/mzytnicki/foo.txt
Put the content to ~/foo_copy.txt
.
Use the variable name output_file_name
.
Describe meaning of each part in the comments.
input_file_name = "/home/mzytnicki/foo.txt"
output_file_name = "~/foo_copy.txt"
# We store the content of the input file to file_content
input_file = open(input_file_name)
file_content = input_file.read()
input_file.close()
# We write file_content to the output file
output_file = open(output_file_name, "w")
output_file.write(file_content)
output_file.close()
There are many operations on strings.
Considering the string s
:
s.upper()
returns the upper case translation,
s.lower()
returns the lower case translation,
s.strip(chars)
returns a new string, with all the leading and ending characters included in chars
removed.
Note
If chars
is not mentioned, all the leading and ending whitespaces (space, tabulation, carriage return) are removed.
There are many other functions, which are out of the scope of this introduction.
Print hello, world
, using the file /home/mzytnicki/foo.txt
.
Texts that are read from files are strings, even though they contain number.
The following functions convert a string s
to:
a integer: int(s)
a float: float(s)
Print the double of the number in /home/mzytnicki/number.txt
.
Formatted string is an easy way to combine variables and constant string.
They are often used in the print
function.
f
before the quotes: f"string"
.f"string{variable}"
.Print number in /home/mzytnicki/number.txt
, and its double in the same line.
Lists are ordered lists of elements.
Elements can be integers, floats, strings, or even lists.
You can use the square brackets []
, and commas ,
as separators.
Strings of characters are lists!
You can get the size (number of elements) of the list l
with len(l)
.
a = l2[0] # the list should have at least one element!
l2[2] = 4 # the list should have at least 3 elements!
l1.append(1) # add one element
l1.extend(l2) # add all elements of l2 to l1
Indices start with zero in Python
Negative indices start from the end: l[-1]
is the last element.
Last element in not included in the sub-list.
If the first index is omitted (e.g. l[:3]
), it is by default 0.
If the last index is omitted (e.g. l[2:]
), it is by default -1
.
Let a string named myString
get the value Hello, world!
Print the 3rd letter.
Print the letter w
.
Print world
.
Print Hello
for
loopThe for
loop iterates over the elements of a list.
There is a colon :
at the end of the first line.
An indentation (preferably, 4 spaces) delimits the block where the variable can be used.
Print each letters of Hello, world!
.
Print each letter of Hello, world!
, together with its position (0: H
, 1: e
, etc.)
You have to use a new variable, say i
, which will store the position of each letter.
It should start with 0, and be incremented each time a letter is printed.
n = n + m
can be shortened as n += m
A string s
can be split into a list of several substrings:
s.split(separator, maxsplit)
separator
is the string that separates the substrings; if omitted, it is any whitespaces.
maxsplit
is the maximum number of splits (extra splits will be merged); it omitted, splits all.
Split the string 01.23.45.67.89
with the dot .
as a separator, and print the double of each token separately.
If f
is a file, the variable l
gets all the lines (with the carriage return) in
Print, line by line, the content of the file /home/mzytnicki/file1.gtf
.
The GTF file stores an annotation: genes, transcripts, exons, CDSs.
General:
#
.Annotations: they are formatted in a tabulation-separated format, with at most 9 fields.
chr1 havana gene 10000 11000 . + . gene_id "ENSG00000000001"; gene_name "ABC"; gene_source "havana"; gene_biotype "gene";
gene
, transcript
, mRNA
, exon
, CDS
, etc.).
if NA0
, 1
, or 2
) if a CDS, otherwise .
Print the first character of each line of /home/mzytnicki/file1.gtf
.
if
statementThe code do stuff
is executed only if the expression is true.
The code do stuff 1
is executed only if the expression is true. The code do stuff 2
is executed only if the expression is false.
if
/elif
statementThe code do stuff 1
is executed only if the expression1 is true. The code do stuff 2
is executed only if the expression1 is false and the expression2 is true. The code do stuff 3
is executed otherwise.
You can add as many elif
as necessary.
Expressions in the if
often are comparisons:
a == b
is true iff a
equals b
(note the double equal sign, the simple equal is the assignment).a != b
is true iff a
is different from b
.a > b
is true iff a
is greater than b
.a >= b
is true iff a
is no less than b
.<
and <=
)Booleans are variables that can take values True
or False
.
Given boolean a
and b
, the following functions are defined:
a and b
a or b
not a
a == b
a != b
Print the file, skipping the comments.
Print all the gene lines.
In the gene lines, the feature should be gene
. You should thus select these lines.
Compute the average gene size, in /home/mzytnicki/file2.gtf
.
In order to compute the average size, you should:
select the gene
lines;
split them into 9 fields (separated by tabulations);
get the start and end positions;
compute the size of each gene;
sum them;
divide the sum by the number of genes.
This means that you should use two variables:
the sum of the sizes of the genes,
the number of genes.
file_name = "/home/mzytnicki/file2.gtf"
sum_of_sizes = 0
n_genes = 0
f = open(file_name)
for l in f:
# Skip comments
if l[0] != "#":
s = l.split("\t")
# Check that line is a gene
if s[2] == "gene":
# Compute gene size
n_genes += 1
start = int(s[3])
end = int(s[4])
sum_of_sizes += end - start + 1
f.close()
print(sum_of_sizes / n_genes)
A list a
can be sorted:
In place: a.sort()
. a
is then sorted
In a function: b = sorted(a)
. a
is not modified, b
is sorted.
You can sort in decreasing order with reverse = False
.
There are more complex ways to sort data, but it is out of the scope of this training.
Compute the median gene size.
In order to compute the median size, you should:
compute the size of each gene;
add them into a list of sizes;
sort this list by increasing sizes;
get the middle one.
This means that you should use one list:
file_name = "/home/mzytnicki/file2.gtf"
gene_sizes = []
f = open(file_name)
for l in f:
# Skip comments
if l[0] != "#":
s = l.split("\t")
# Check that line is a gene
if s[2] == "gene":
# Compute gene size
start = int(s[3])
end = int(s[4])
gene_sizes.append(end - start + 1)
f.close()
# Print the (approximate) median
gene_sizes.sort()
print(gene_sizes[len(gene_sizes) // 2])
Print the list of gene identifiers.
The gene identifiers can be found in the gene_id
tag of the genes.
You should:
select the gene
lines;
split the attributes, using the semicolon as separator;
iterate over the attributes;
split each attribute into key/value;
print the value if the key is gene_id
(you can strip the extra quotes).
file_name = "/home/mzytnicki/file2.gtf"
f = open(file_name)
for l in f:
# Skip comments
if l[0] != "#":
s = l.split("\t")
# Check that line is a gene
if s[2] == "gene":
# Get the 8th field, which is the attributes
attributes = s[8]
attributes = attributes.split(";")
# Look for the gene_id tag
for tag in attributes:
tag = tag.split(" ", 1)
if tag[0] == "gene_id":
gene_id = tag[1].strip("\"")
print(gene_id)
f.close()
The standard definition is
Create a function that reads the attributes (stored in one string), and returns the gene_id.
Update the code printing the gene_id using this function.
A dictionary is an extension of a list:
You can check whether key
is in your dict d
with:
You can iterate on the keys of the dictionary d
with
If d
is a dictionary, you can get the list of keys with:
Print, for each gene, the number of isoforms.
The genes should be lexicographically ordered.
You should there use a dictionary, which stores, for each gene id, the number of isoforms.
In detail:
select the transcript
lines;
extract the gene id;
if the gene id is not in the dictionary, set the corresponding value to 0;
increment the value of the dictionary;
when the loop is over:
sort the keys of the dictionary;
for each key, print it, together with the corresponding value.
file_name = "/home/mzytnicki/file2.gtf"
n_transcripts = {}
f = open(file_name)
for l in f:
if l[0] != "#": # Skip comments
s = l.split("\t")
if s[2] == "transcript": # Keep transcript lines
gene_id = get_gene_id(s[8]) # The 8th field is the attributes
if gene_id not in n_transcripts:
n_transcripts[gene_id] = 0
n_transcripts[gene_id] += 1
f.close()
for gene in sorted(n_transcripts.keys()):
print(f"{gene}: {n_transcript[gene]}")
We will put the previous program into a file, which we will execute.
Connect to the Genotoul-Bioinfo cluster with ssh (or MobaXterm).
Go to you work
directory.
Call your favorite text editor (such as gedit
).
Copy the previous code and the function, and paste them to your text editor.
Save to n_isoforms.py
.
Run it:
You can now run the program with
We will give the input parameter as an argument of the program.
We have to add a library: additional code that can be useful in some applications.
Here is it sys
(for “system”).
Modify your file, so that it starts with:
The variable sys.argv
is a list, with all the arguments of the program (including the program itself).
Set the input file as an argument of the program.
The program should start with:
From now on, we will apply ntions we learnt until now.
If time permits.
Create a program that opens the file /home/mzytnicki/file1.fasta
, and prints, for each sequence, its name and its size.
size
, that records the size of the sequence (seen so far).name
, that records the current sequence name.>
, so compute the size of the line, and increment size
.>
line is seen, print the sequence name and the sequence size; store the new name, and reset the size
to 0.To summarize, if you read a line starting with >
it could be: - the first line: just store the name, - any other line: print the sequence name and size.
It is thus convenient to initially set the name
to False
: as soon as the name
is assigned to any string it is truthy
.
#!/usr/bin/env python3
import sys
file_name = sys.argv[1]
sequence_name = False
sequence_size = 0
file = open(file_name)
for line in file:
line = line.strip()
if line[0] == '>':
if sequence_name:
print(f"{sequence_name}: {sequence_size}")
sequence_size = 0
sequence_name = line[1:]
else:
sequence_size += len(line)
f.close()
print(f"{sequence_name}: {sequence_size}")
The BED format also is a tabulation-separated format, and it comes with different flavours. The “BED-6” format is
chr start end name score strand
Where:
start
is 0-based,end
is 1-based,gene_id
as the name
,.
(equivalent to NA).#!/usr/bin/env python3
import sys
file_name = sys.argv[1]
file = open(file_name)
for l in file:
l = l.strip()
# Skip comments
if l[0] != "#":
s = l.split("\t")
# Check that line is a gene
if s[2] == "gene":
chromosome = s[0]
start = int(s[3]) - 1
end = int(s[4])
strand = s[6]
name = get_gene_id(s[8])
print(f"{chromosome}\t{start}\t{end}\t{name}\t.\t{strand}")
f.close()
The GTF should have a structure, where the gene line is written before its transcript line(s). Any transcript line should refer to the last seen gene line. The transcript line should be written before the exon line(s). Any exon line should refer to the last seen transcript line.
The gene should have a gene_id
attribute, which is also present in the transcript lines. Similarly, the transcript should have a transcript_id
, also present in the exon lines.
If you spot a problem in the structure, you should indicate the line, and the problem.
The for
loop should increment the line number.
You should write a new function, similar to get_gene_id
, which will find the value of the given key.
It should return False
is the key is missing.
You should record current_gene_id
and current_transcript_id
.
In case, the first line does not start with a gene line, set the previous variable to False
.
#!/usr/bin/env python3
import sys
file_name = sys.argv[1]
def get_attribute_value(attributes, key):
attributes = attributes.split(";")
for tag in attributes:
tag = tag.split(" ", 1)
if tag[0] == key:
return tag[1].strip("\"")
return False
file = open(file_name)
line_number = 1
gene_id = False
transcript_id = False
for l in file:
l = l.strip()
if l[0] != "#":
s = l.split("\t")
if s[2] == "gene":
current_gene_id = get_attribute_value(s[8], "gene_id")
if not current_gene_id:
print(f"Line {line_number}: Gene id is missing in gene line.")
elif s[2] == "transcript":
gene_id = get_attribute_value(s[8], "gene_id")
if not gene_id:
print(f"Line {line_number}: Gene id is missing in transcript line.")
elif gene_id != current_gene_id:
print(f"Line {line_number}: Gene id does not match in transcript line.")
current_transcript_id = get_attribute_value(s[8], "transcript_id")
if not current_transcript_id:
print(f"Line {line_number}: Transcript id id is missing in transcript line.")
elif s[2] == "exon":
transcript_id = get_attribute_value(s[8], "transcript_id")
if not transcript_id:
print(f"Line {line_number}: Transcript id is missing in exon line.")
elif transcript_id != current_transcript_id:
print(f"Line {line_number}: Transcript id does not match in exon line.")
line_number += 1
f.close()
variables
main control flow commands: if
, for
files: open, read, write
comments
simple Python programs
libraries
more programming primitives: enumerate
, range
, etc.
other general libraries: os
, etc.
dedicated libraries: BioPython
plots,
object programming,
version control system (git).
A full Python course: https://python.sdv.univ-paris-diderot.fr/cours-python.pdf
A Python CheatSheet: https://perso.limsi.fr/pointal/_media/python:cours:mementopython3-english.pdf
Sandra Dérozier & Thomas Duigou for “Introduction à Python”: https://sandraderozier.pages.mia.inra.fr/formation-python/supports/00-intro-general.html
The SouthGreen platform for their Python training page: https://southgreenplatform.github.io/trainings/python/
Christophe Klopp for his Python Course: http://genoweb.toulouse.inra.fr/~klopp/BioComp/3.pdf
This document is CC-BY!