Python for Biology

Genotoul-Bioinfo

INRAE, MIAT

May 22, 2024

Introduction

Schedule

  • Start at 9AM, end at 5PM (roughly)
  • A break in the morning, one in the afternoon.
  • A lunch break of 1 hour around noon.

This training

  • Exercise-driven: we will apply most concepts right away.
  • Free to speak: do not hesitate to interrupt and ask/comment/complain.

Who are we?

  • What do you do?
  • Why do you need Python?

Objectives

  • Get a rough idea of Python programming:

    • main concepts: variables, structures;
    • open a file, read it, write to another file;
    • parse a long file, and extract information.
  • We will NOT be able to code in 1 day.

  • We will learn a very small subset of the programming functions: non vital methods will not be presented here.

  • Aim: grasp the “way of thinking” of a programmer, help you in the steepest phase of the learning curve.

  • There are many forums that can help you solving your bug/problems, even the most stupid ones, once you know how to ask.

Why Python?

  • Easy to start with Python: no need to compile.

  • One of the most used programming language now; can be run on all OS.

  • Easy to read and understand.

  • Many libraries: BioPython, pyTorch (deep learning), etc.

What is programming?

  • Data:
    • simple data: integers, decimals, letters…
    • complex data: vectors, sets, matrices, tables…
    • input/output data: read from file, write results
  • Computation:
    • mathematical operations
    • logical flow: “if this, then do that, otherwise do that”, “repeat this X times”, “repeat this until that happens”, etc.
  • Code:
    • transform concepts into words
    • use of a controlled hierarchy (functions, blocks) and vocabulary (keywords)

The Jupyter Notebook interface

Pros:

  • You can type the code, and have the results in the same page.

  • Can also plot figures!

Cons:

  • Not suitable for production code, though.

Connect to Jupyter Notebook

Jupyter

Using the Jupyter Notebook interface

  • Each cell (in grey) accepts Python code.

  • You can run the code with the “run” (triangle) button.

  • You can also copy, paste, and add cells.

Jupyter

Exercise Basic math operations

Compute 15 times 13.

Solution

Type

15 * 13

Then press “run”.

Basic arithmetic operations

  • Usual +, -, *, /
  • Floor division: //
  • Modulus: %
  • Exponentiation: **

String constants

Strings of characters (a.k.a text) should be surrounded by ' (quote) or " (double quote).

Examples (good)

"A long string"
'A very long string'

Example (bad)

A long string

is not a valid string of characters.

Special strings

Some strings cannot be (easily) printed:

  • "\t": tabulation
  • "\n": carriage return
  • "\"": the double quote itself

Variables

Variables store information, which can be:

  • an integer,

  • a float (decimal numbers),

  • a string of characters,

  • and much more.

Variable names

  • must start with a letter

  • cannot contain special characters (eg. ., $, etc.) except _

  • are case-sensitive (a is different from A)

  • should be meaningful (avoid i, a, x, etc.)

  • can be assigned with =

Exercise using variables

  • Create a variable named my_first_variable, with value 3.5.

  • Create a second variable, named my_second_variable, such that its value is twice the previous value.

Solution

my_first_variable = 3.5
my_second_variable = my_first_variable * 2

Tip

You can check the content of a variable typing its name in a cell.

Functions

Functions:

  • usually take one or several parameters

  • usually return one value

Examples

  • log

  • + is also a special function, with 2 parameters

Why using function?

  • Some functions are included in Python because they are compulsory for any programming language (e.g. log).

  • You can write your own because you want to compute the same operations several times.

How to use functions

  • Exemple: function_name(param1, param2, ...)

The print function

  • The function print(var) prints the content of the variable var.

  • It also appends a carriage return.

  • It is probably the most used function.

Note

Some “special” functions (named methods) have a particular syntax:

variable1.function(variable2, variable3, ...)

They are related to object programming (out of the scope of this training).

Exercise Use a function

Print the maximum of the variables my_first_variable and my_second_variable.

Hint: There is a function max.

Solution

print(max(my_first_variable, my_second_variable))

Opening a file

The open function opens a file for reading:

f = open(file_name, "r")

f is a file variable.

You cannot write to the file.

If you want to open the file for writing, type:

f = open(file_name, "w")

The last parameter, if not mentioned, is, by default "r".

Closing a file

The file should be closed when not used anymore (note the “object” syntax):

f.close()

Note

There is another (better) way to open/close file, which is out of the scope of this training.

Reading the whole content of a file

You can read the whole text of the file with:

text = f.read()

This holds iff f is a file variable.

Exercise Print the content of a file

  • Read the file /home/mzytnicki/foo.txt

  • Print the content of the file.

  • Use the variable name input_file_name to store the file name.

  • Use the variable name file_content to store its content.

Note

Some prefer writing input_file_name, other inputFileName. Matter of style!

Solution

input_file_name = "/home/mzytnicki/foo.txt"
input_file = open(input_file_name)
file_content = input_file.read()
input_file.close()
print(file_content)

Writing to a file

You can write the content of text to a file f with:

f.write(text)

Comments

  • Comments are part of the code which is not run.

  • They are for the reader only (you!).

  • They are crucial for remembering the meaning of your code.

  • They start with #

# This is a comment

Exercise Copy the content of a file

  • Read the file /home/mzytnicki/foo.txt

  • Put the content to ~/foo_copy.txt.

  • Use the variable name output_file_name.

  • Describe meaning of each part in the comments.

Solution

input_file_name = "/home/mzytnicki/foo.txt"
output_file_name = "~/foo_copy.txt"
# We store the content of the input file to file_content
input_file = open(input_file_name)
file_content = input_file.read()
input_file.close()
# We write file_content to the output file
output_file = open(output_file_name, "w")
output_file.write(file_content)
output_file.close()

String manipulation

There are many operations on strings.

Considering the string s:

  • s.upper() returns the upper case translation,

  • s.lower() returns the lower case translation,

  • s.strip(chars) returns a new string, with all the leading and ending characters included in chars removed.

Note

  • If chars is not mentioned, all the leading and ending whitespaces (space, tabulation, carriage return) are removed.

  • There are many other functions, which are out of the scope of this introduction.

Exercise String manipulation

Print hello, world, using the file /home/mzytnicki/foo.txt.

Solution

input_file_name = "/home/mzytnicki/foo.txt"
input_file = open(input_file_name)
file_content = input_file.read()
input_file.close()
lower_case = file.content.lower()
no_mark = lower_case.strip("!")
print(no_mark)

String conversion

Texts that are read from files are strings, even though they contain number.

The following functions convert a string s to:

  • a integer: int(s)

  • a float: float(s)

Exercise String conversion

Print the double of the number in /home/mzytnicki/number.txt.

Solution

input_file_name = "/home/mzytnicki/number.txt"
input_file = open(input_file_name)
file_content = input_file.read()
input_file.close()
number = int(file_content)
print(number * 2)

Formatted string

Formatted string is an easy way to combine variables and constant string.

They are often used in the print function.

Syntax

  • There is a f before the quotes: f"string".
  • The variables are surrounded by braces: f"string{variable}".

Examples

a = 3
print(f"The value of a is {a}.")

Exercise Formatted string

Print number in /home/mzytnicki/number.txt, and its double in the same line.

Solution

input_file_name = "/home/mzytnicki/number.txt"
input_file = open(input_file_name)
file_content = input_file.read()
input_file.close()
number = int(file_content)
double = number * 2
print(f"The double of {number} is {double}.")

Lists

  • Lists are ordered lists of elements.

  • Elements can be integers, floats, strings, or even lists.

List definition

You can use the square brackets [], and commas , as separators.

l1 = []              # empty string
l2 = [ 2, 3, 3, 5 ]  # list with 4 elements

Notes

  • Strings of characters are lists!

  • You can get the size (number of elements) of the list l with len(l).

Selecting/assigning/adding elements

Syntax

a = l2[0]      # the list should have at least one element!
l2[2] = 4      # the list should have at least 3 elements!
l1.append(1)   # add one element
l1.extend(l2)  # add all elements of l2 to l1

Notes

  • Indices start with zero in Python

  • Negative indices start from the end: l[-1] is the last element.

Selecting a sub-list

b = l[1:3]  # equivalent to [ l[1], l[2] ]

Notes

  • Last element in not included in the sub-list.

  • If the first index is omitted (e.g. l[:3]), it is by default 0.

  • If the last index is omitted (e.g. l[2:]), it is by default -1.

Exercise Lists and strings

  • Let a string named myString get the value Hello, world!

  • Print the 3rd letter.

  • Print the letter w.

  • Print world.

  • Print Hello

Solution

myString = "Hello, world!"
print(myString[2])
print(myString[7])
print(myString[7:-2])
print(myString[:5])

The for loop

The for loop iterates over the elements of a list.

for variable in list:
    # do stuff with variable
    # do stuff again

Notes

  • There is a colon : at the end of the first line.

  • An indentation (preferably, 4 spaces) delimits the block where the variable can be used.

Exercise

Print each letters of Hello, world!.

Solution

    for c in myString:
        print(c)

Exercise

Print each letter of Hello, world!, together with its position (0: H, 1: e, etc.)

Method

You have to use a new variable, say i, which will store the position of each letter.

It should start with 0, and be incremented each time a letter is printed.

Solution

i = 0
for c in myString:
    print(f"{i}: c")
    i = i + 1

Note

n = n + m can be shortened as n += m

Splitting a string

A string s can be split into a list of several substrings:

s.split(separator, maxsplit)
  • separator is the string that separates the substrings; if omitted, it is any whitespaces.

  • maxsplit is the maximum number of splits (extra splits will be merged); it omitted, splits all.

Exercise Splitting a string

Split the string 01.23.45.67.89 with the dot . as a separator, and print the double of each token separately.

Solution

my_string = "01.23.45.67.89"
split_string = my_string.split(".")
for token in split_string:
    print(int(token) * 2)

Reading a file line after line

If f is a file, the variable l gets all the lines (with the carriage return) in

for l in f:
    # do stuff with l

Exercise

Print, line by line, the content of the file /home/mzytnicki/file1.gtf.

Solution

file_name = "/home/mzytnicki/file1.gtf"
f = open(file_name) as f:
for l in f:
    print(l)
f.close()

The GTF file

Aim

The GTF file stores an annotation: genes, transcripts, exons, CDSs.

Format

General:

  • Comment lines start with the dash sign #.
  • Other lines are annotations.

The GTF file – annotation

Annotations: they are formatted in a tabulation-separated format, with at most 9 fields.

chr1  havana  gene  10000 11000  .  +  . gene_id "ENSG00000000001"; gene_name "ABC"; gene_source "havana"; gene_biotype "gene";
  1. the chromosome name
  2. the source of the annotation
  3. the feature (usually, gene, transcript, mRNA, exon, CDS, etc.)
  4. the start position (start with 1, includes the feature first position)
  5. the end position (start with 1, includes the feature first position)
  6. the score (float), or . if NA
  7. the strand
  8. the frame (0, 1, or 2) if a CDS, otherwise .
  9. attributes: semicolon-separated list of tag-value pairs, providing additional information about the feature

Exercise

Print the first character of each line of /home/mzytnicki/file1.gtf.

Solution

file_name = "/home/mzytnicki/file1.gtf"
f = open(file_name)
for l in f:
    print(l[0])
f.close()

The if statement

The code do stuff is executed only if the expression is true.

    if expression:
        # do stuff

The code do stuff 1 is executed only if the expression is true. The code do stuff 2 is executed only if the expression is false.

    if expression:
        # do stuff 1
    else:
        # do stuff 2

The if/elif statement

The code do stuff 1 is executed only if the expression1 is true. The code do stuff 2 is executed only if the expression1 is false and the expression2 is true. The code do stuff 3 is executed otherwise.

    if expression1:
        # do stuff 1
    elif expression2:
        # do stuff 2
    else:
        # do stuff 3

Note

You can add as many elif as necessary.

Comparisons

Expressions in the if often are comparisons:

  • a == b is true iff a equals b (note the double equal sign, the simple equal is the assignment).
  • a != b is true iff a is different from b.
  • a > b is true iff a is greater than b.
  • a >= b is true iff a is no less than b.
  • (the same for < and <=)

Booleans

Booleans are variables that can take values True or False.

Given boolean a and b, the following functions are defined:

  • a and b
  • a or b
  • not a
  • a == b
  • a != b

Exercise

Print the file, skipping the comments.

Solution

file_name = "/home/mzytnicki/file1.gtf"
f = open(file_name)
for l in f:
    if l[0] != "#":
        print(l[0])
f.close()

Exercise

Print all the gene lines.

Note

In the gene lines, the feature should be gene. You should thus select these lines.

Solution

file_name = "/home/mzytnicki/file1.gtf"
f = open(file_name)
    # Skip comments
    if l[0] != "#":
        s = l.split("\t")
        if s[2] == "gene":
            print(l)
f.close()

Exercise

Compute the average gene size, in /home/mzytnicki/file2.gtf.

Method

In order to compute the average size, you should:

  • select the gene lines;

  • split them into 9 fields (separated by tabulations);

  • get the start and end positions;

  • compute the size of each gene;

  • sum them;

  • divide the sum by the number of genes.

This means that you should use two variables:

  • the sum of the sizes of the genes,

  • the number of genes.

Solution

file_name = "/home/mzytnicki/file2.gtf"
sum_of_sizes = 0
n_genes      = 0
f = open(file_name)
for l in f:
    # Skip comments
    if l[0] != "#":
        s = l.split("\t")
        # Check that line is a gene
        if s[2] == "gene":
            # Compute gene size
            n_genes      += 1
            start         = int(s[3])
            end           = int(s[4])
            sum_of_sizes += end - start + 1
f.close()
print(sum_of_sizes / n_genes)

List sorting

A list a can be sorted:

  • In place: a.sort(). a is then sorted

  • In a function: b = sorted(a). a is not modified, b is sorted.

Notes

  • You can sort in decreasing order with reverse = False.

  • There are more complex ways to sort data, but it is out of the scope of this training.

Exercise

Compute the median gene size.

Method

In order to compute the median size, you should:

  • compute the size of each gene;

  • add them into a list of sizes;

  • sort this list by increasing sizes;

  • get the middle one.

This means that you should use one list:

  • the distribution of gene sizes.

Solution

file_name  = "/home/mzytnicki/file2.gtf"
gene_sizes = []
f = open(file_name)
for l in f:
    # Skip comments
    if l[0] != "#":
        s = l.split("\t")
        # Check that line is a gene
        if s[2] == "gene":
            # Compute gene size
            start = int(s[3])
            end   = int(s[4])
            gene_sizes.append(end - start + 1)
f.close()
# Print the (approximate) median
gene_sizes.sort()
print(gene_sizes[len(gene_sizes) // 2])

Exercise

Print the list of gene identifiers.

The gene identifiers can be found in the gene_id tag of the genes.

Method

You should:

  • select the gene lines;

  • split the attributes, using the semicolon as separator;

  • iterate over the attributes;

  • split each attribute into key/value;

  • print the value if the key is gene_id (you can strip the extra quotes).

Solution

file_name = "/home/mzytnicki/file2.gtf"
f = open(file_name)
for l in f:
    # Skip comments
    if l[0] != "#":
        s = l.split("\t")
        # Check that line is a gene
        if s[2] == "gene":
            # Get the 8th field, which is the attributes
            attributes = s[8]
            attributes = attributes.split(";")
            # Look for the gene_id tag
            for tag in attributes:
                tag = tag.split(" ", 1)
                if tag[0] == "gene_id":
                    gene_id = tag[1].strip("\"")
                    print(gene_id)
f.close()

Functions

The standard definition is

def function_name (parameter1, parameter2):
   # do stuff
   return something

Exercise

Create a function that reads the attributes (stored in one string), and returns the gene_id.

Solution

def get_gene_id(attributes):
    attributes = attributes.split(";")
    for tag in attributes:
        tag = tag.split(" ", 1)
        if tag[0] == "gene_id":
            return tag[1].strip("\"")

Exercise

Update the code printing the gene_id using this function.

Solution

file_name = "/home/mzytnicki/file2.gtf"
f = open(file_name)
for l in f:
    # Skip comments
    if l[0] != "#":
        s = l.split("\t")
        # Check that line is a gene
        if s[2] == "gene":
            # Get the 8th field, which is the attributes
            print(get_gene_id(s[8]))
f.close()

Dictionaries

Definition

A dictionary is an extension of a list:

  • the key can by a string, a number, a float, etc.
  • the value too

Syntax

d = {}                   # empty dictionary
d["foo"] = "bar"         # assignment
x = d["foo"]             # "foo" should be an element of d!
d = {"one": 1, "two": 2} # equivalent of d["one"] = 1; d["two"] = 2

Dictionaries — continued

Key check

You can check whether key is in your dict d with:

if key in d:
    # do stuff

Iterate on a dictionary

You can iterate on the keys of the dictionary d with

for k in d:
    # do stuff with k or d[k]

List of keys

If d is a dictionary, you can get the list of keys with:

d.keys()

Exercise Number of isoforms

Print, for each gene, the number of isoforms.

The genes should be lexicographically ordered.

Method

You should there use a dictionary, which stores, for each gene id, the number of isoforms.

In detail:

  • select the transcript lines;

  • extract the gene id;

  • if the gene id is not in the dictionary, set the corresponding value to 0;

  • increment the value of the dictionary;

  • when the loop is over:

    • sort the keys of the dictionary;

    • for each key, print it, together with the corresponding value.

Solution

file_name     = "/home/mzytnicki/file2.gtf"
n_transcripts = {}
f = open(file_name)
for l in f:
    if l[0] != "#":                             # Skip comments
        s = l.split("\t")
        if s[2] == "transcript":                # Keep transcript lines
            gene_id = get_gene_id(s[8])         # The 8th field is the attributes
            if gene_id not in n_transcripts:
                n_transcripts[gene_id] = 0
            n_transcripts[gene_id] += 1
f.close()
for gene in sorted(n_transcripts.keys()):
    print(f"{gene}: {n_transcript[gene]}")

Create/run a Python program

We will put the previous program into a file, which we will execute.

  • Connect to the Genotoul-Bioinfo cluster with ssh (or MobaXterm).

  • Go to you work directory.

  • Call your favorite text editor (such as gedit).

  • Copy the previous code and the function, and paste them to your text editor.

  • Save to n_isoforms.py.

  • Run it:

module load devel/python/Python-3.11.1
python3 n_isoforms.py

Create an executable

  • Add the “shebang” in the first line of your file
#!/usr/bin/env python3
  • Add executable rights in bash
chmod +x n_isoforms.py

You can now run the program with

./n_isoforms.py

Read parameters

We will give the input parameter as an argument of the program.

We have to add a library: additional code that can be useful in some applications.

Here is it sys (for “system”).

Modify your file, so that it starts with:

#!/usr/bin/env python3

import sys

The variable sys.argv is a list, with all the arguments of the program (including the program itself).

Exercise

Set the input file as an argument of the program.

Solution

The program should start with:

#!/usr/bin/env python3

import sys

file_name = sys.argv[1]

Other exercises

From now on, we will apply ntions we learnt until now.

If time permits.

Exercise Size of sequences

Create a program that opens the file /home/mzytnicki/file1.fasta, and prints, for each sequence, its name and its size.

Method

  • Use a variable size, that records the size of the sequence (seen so far).
  • Use a variable name, that records the current sequence name.
  • Read the file line by line.
  • If it is the first line, it should start with ‘>’, so just record the name (i.e. remove the first character).
  • The follwing line(s) should not start with a >, so compute the size of the line, and increment size.
  • When the next > line is seen, print the sequence name and the sequence size; store the new name, and reset the size to 0.
  • Also print the last sequence!

To summarize, if you read a line starting with > it could be: - the first line: just store the name, - any other line: print the sequence name and size.

It is thus convenient to initially set the name to False: as soon as the name is assigned to any string it is truthy.

Answer

#!/usr/bin/env python3

import sys

file_name = sys.argv[1]

sequence_name = False
sequence_size = 0

file = open(file_name)
for line in file:
    line = line.strip()
    if line[0] == '>':
        if sequence_name:
            print(f"{sequence_name}: {sequence_size}")
        sequence_size = 0
        sequence_name = line[1:]
    else:
        sequence_size += len(line)
f.close()
print(f"{sequence_name}: {sequence_size}")

Exercise GTF to BED

The BED format also is a tabulation-separated format, and it comes with different flavours. The “BED-6” format is

chr  start  end  name  score  strand

Where:

  • each line will be a gene,
  • start is 0-based,
  • end is 1-based,
  • we will use the gene_id as the name,
  • the score will written . (equivalent to NA).

Answer

#!/usr/bin/env python3

import sys

file_name = sys.argv[1]

file = open(file_name)
for l in file:
    l = l.strip()
    # Skip comments
    if l[0] != "#":
        s = l.split("\t")
        # Check that line is a gene
        if s[2] == "gene":
            chromosome = s[0]
            start      = int(s[3]) - 1
            end        = int(s[4])
            strand     = s[6]
            name       = get_gene_id(s[8])
            print(f"{chromosome}\t{start}\t{end}\t{name}\t.\t{strand}")
f.close()

Exercise Check the GTF structure

The GTF should have a structure, where the gene line is written before its transcript line(s). Any transcript line should refer to the last seen gene line. The transcript line should be written before the exon line(s). Any exon line should refer to the last seen transcript line.

The gene should have a gene_id attribute, which is also present in the transcript lines. Similarly, the transcript should have a transcript_id, also present in the exon lines.

If you spot a problem in the structure, you should indicate the line, and the problem.

Method

  • The for loop should increment the line number.

  • You should write a new function, similar to get_gene_id, which will find the value of the given key.

  • It should return False is the key is missing.

  • You should record current_gene_id and current_transcript_id.

  • In case, the first line does not start with a gene line, set the previous variable to False.

Solution

#!/usr/bin/env python3

import sys

file_name = sys.argv[1]

def get_attribute_value(attributes, key):
    attributes = attributes.split(";")
    for tag in attributes:
        tag = tag.split(" ", 1)
        if tag[0] == key:
            return tag[1].strip("\"")
    return False

file = open(file_name)
line_number   = 1
gene_id       = False
transcript_id = False
for l in file:
    l = l.strip()
    if l[0] != "#":
        s = l.split("\t")
        if s[2] == "gene":
            current_gene_id = get_attribute_value(s[8], "gene_id")
            if not current_gene_id:
                print(f"Line {line_number}: Gene id is missing in gene line.")
        elif s[2] == "transcript":
            gene_id = get_attribute_value(s[8], "gene_id")
            if not gene_id:
                print(f"Line {line_number}: Gene id is missing in transcript line.")
            elif gene_id != current_gene_id:
                print(f"Line {line_number}: Gene id does not match in transcript line.")
            current_transcript_id = get_attribute_value(s[8], "transcript_id")
            if not current_transcript_id:
                print(f"Line {line_number}: Transcript id id is missing in transcript line.")
        elif s[2] == "exon":
            transcript_id = get_attribute_value(s[8], "transcript_id")
            if not transcript_id:
                print(f"Line {line_number}: Transcript id is missing in exon line.")
            elif transcript_id != current_transcript_id:
                print(f"Line {line_number}: Transcript id does not match in exon line.")
    line_number += 1
f.close()

Summary — what we learned

  • variables

    • simple variables: integers, floats, strings, booleans
    • complex variables: lists, dictionaries
  • main control flow commands: if, for

  • files: open, read, write

  • comments

  • simple Python programs

  • libraries

What is next?

  • more programming primitives: enumerate, range, etc.

  • other general libraries: os, etc.

  • dedicated libraries: BioPython

  • plots,

  • object programming,

  • version control system (git).

Other useful resources

  • A full Python course: https://python.sdv.univ-paris-diderot.fr/cours-python.pdf

  • A Python CheatSheet: https://perso.limsi.fr/pointal/_media/python:cours:mementopython3-english.pdf

Thanks to

  • Sandra Dérozier & Thomas Duigou for “Introduction à Python”: https://sandraderozier.pages.mia.inra.fr/formation-python/supports/00-intro-general.html

  • The SouthGreen platform for their Python training page: https://southgreenplatform.github.io/trainings/python/

  • Christophe Klopp for his Python Course: http://genoweb.toulouse.inra.fr/~klopp/BioComp/3.pdf

This document is CC-BY!