Python for Biology

Introduction

Schedule

Start at 9AM, end at 5PM (roughly)
A break in the morning, one in the afternoon.
A lunch break of 1 hour around noon.

This training

Exercise-driven: we will apply most concepts right away.
Free to speak: do not hesitate to interrupt and ask/comment/complain.

Who are we?

What do you do?
Why do you need Python?

Objectives

Get a rough idea of Python programming:
- main concepts: variables, structures;
- open a file, read it, write to another file;
- parse a long file, and extract information.
We will NOT be able to code in 1 day.
We will learn a very small subset of the programming functions: non vital methods will not be presented here.
Aim: grasp the “way of thinking” of a programmer, help you in the steepest phase of the learning curve.
There are many forums that can help you solving your bug/problems, even the most stupid ones, once you know how to ask.

Why Python?

Easy to start with Python: no need to compile.
One of the most used programming language now; can be run on all OS.
Easy to read and understand.
Many libraries: BioPython, pyTorch (deep learning), etc.

What is programming?

Data:
- simple data: integers, decimals, letters…
- complex data: vectors, sets, matrices, tables…
- input/output data: read from file, write results
Computation:
- mathematical operations
- logical flow: “if this, then do that, otherwise do that”, “repeat this X times”, “repeat this until that happens”, etc.
Code:
- transform concepts into words
- use of a controlled hierarchy (functions, blocks) and vocabulary (keywords)

The Jupyter Notebook interface

Pros:

You can type the code, and have the results in the same page.
Can also plot figures!

Cons:

Not suitable for production code, though.

Connect to Jupyter Notebook

Start the Jupyter Notebook.

Jupyter

Using the Jupyter Notebook interface

Each cell (in grey) accepts Python code.
You can run the code with the “run” (triangle) button.
You can also copy, paste, and add cells.

Jupyter

Exercise Basic math operations

Compute 15 times 13.

Solution

Type

15 * 13

Then press “run”.

Basic arithmetic operations

Usual +, -, *, /
Floor division: //
Modulus: %
Exponentiation: **

String constants

Strings of characters (a.k.a text) should be surrounded by ' (quote) or " (double quote).

Examples (good)

"A long string"
'A very long string'

Example (bad)

A long string

is not a valid string of characters.

Special strings

Some strings cannot be (easily) printed:

"\t": tabulation
"\n": carriage return
"\"": the double quote itself

Variables

Variables store information, which can be:

an integer,
a float (decimal numbers),
a string of characters,
and much more.

Variable names

must start with a letter
cannot contain special characters (eg. ., $, etc.) except _
do not include quotes ("a_variable" is a value, not a variable)
are case-sensitive (a is different from A)
should be meaningful (avoid i, a, x, etc.)
can be assigned with =

Exercise using variables

Create a variable named my_first_variable, with value 3.5.
Create a second variable, named my_second_variable, such that its value is twice the previous value.

Solution

my_first_variable = 3.5
my_second_variable = my_first_variable * 2

Tip

You can check the content of a variable typing its name in a cell.

Functions

Functions:

usually take one or several parameters
usually return one value

Examples

log
+ is also a special function, with 2 parameters

Why using function?

Some functions are included in Python because they are compulsory for any programming language (e.g. log).
You can write your own because you want to compute the same operations several times.

How to use functions

Exemple: function_name(param1, param2, ...)

The `print` function

The function print(var) prints the content of the variable var.
It also appends a carriage return.
It is probably the most used function.

Note

Some “special” functions (named methods) have a particular syntax:

variable1.function(variable2, variable3, ...)

They are related to object programming (out of the scope of this training).

Exercise Use a function

Print the maximum of the variables my_first_variable and my_second_variable.

Hint: There is a function max.

Solution

print(max(my_first_variable, my_second_variable))

Opening a file

The open function opens a file for reading:

f = open(file_name, "r")

f is a file variable.

You cannot write to the file.

If you want to open the file for writing, type:

f = open(file_name, "w")

The last parameter, if not mentioned, is, by default "r".

Closing a file

The file should be closed when not used anymore (note the “object” syntax):

f.close()

Note

There is another (better) way to open/close file, which is out of the scope of this training.

Reading the whole content of a file

You can read the whole text of the file with:

text = f.read()

This holds iff f is a file variable.

Exercise Print the content of a file

Read the file /home/mzytnicki/foo.txt
Print the content of the file.
Use the variable name input_file_name to store the file name.
Use the variable name file_content to store its content.

Note

Some prefer writing input_file_name, other inputFileName. Matter of style!

Solution

input_file_name = "/home/mzytnicki/foo.txt"
input_file = open(input_file_name)
file_content = input_file.read()
input_file.close()
print(file_content)

Writing to a file

You can write the content of text to a file f with:

f.write(text)

Comments

Comments are part of the code which is not run.
They are for the reader only (you!).
They are crucial for remembering the meaning of your code.
They start with #

# This is a comment

Exercise Copy the content of a file

Read the file /home/mzytnicki/foo.txt
Put the content to foo_copy.txt.
Use the variable name output_file_name.
Describe meaning of each part in the comments.

Solution

input_file_name = "/home/mzytnicki/foo.txt"
output_file_name = "~/foo_copy.txt"
# We store the content of the input file to file_content
input_file = open(input_file_name)
file_content = input_file.read()
input_file.close()
# We write file_content to the output file
output_file = open(output_file_name, "w")
output_file.write(file_content)
output_file.close()

String manipulation

There are many operations on strings.

Considering the string s:

s.upper() returns the upper case translation,
s.lower() returns the lower case translation,
s.strip(chars) returns a new string, with all the leading and ending characters included in chars removed.

Note

If chars is not mentioned, all the leading and ending whitespaces (space, tabulation, carriage return) are removed.
There are many other functions, which are out of the scope of this introduction.

Exercise String manipulation

Print hello, world, using the file /home/mzytnicki/foo.txt.

Solution

input_file_name = "/home/mzytnicki/foo.txt"
input_file = open(input_file_name)
file_content = input_file.read()
input_file.close()
lower_case = file.content.lower()
no_mark = lower_case.strip("!\n")
print(no_mark)

String conversion

Texts that are read from files are strings, even though they contain numbers.

The following functions convert a string s to:

a integer: int(s)
a float: float(s)

Exercise String conversion

Print the double of the number in /home/mzytnicki/number.txt.

Solution

input_file_name = "/home/mzytnicki/number.txt"
input_file = open(input_file_name)
file_content = input_file.read()
input_file.close()
number = int(file_content)
print(number * 2)

Formatted string

Formatted string is an easy way to combine variables and constant string.

They are often used in the print function.

Syntax

There is a f before the quotes: f"string".
The variables are surrounded by braces: f"string{variable}".

Examples

a = 3
print(f"The value of a is {a}.")

Exercise Formatted string

Print number in /home/mzytnicki/number.txt, and its double in the same line.

Solution

input_file_name = "/home/mzytnicki/number.txt"
input_file = open(input_file_name)
file_content = input_file.read()
input_file.close()
number = int(file_content)
double = number * 2
print(f"The double of {number} is {double}.")

Lists

Lists are ordered lists of elements.
Elements can be integers, floats, strings, or even lists.

List definition

You can use the square brackets [], and commas , as separators.

l1 = []              # empty string
l2 = [ 2, 3, 3, 5 ]  # list with 4 elements

Notes

Strings of characters are lists!
You can get the size (number of elements) of the list l with len(l).

Selecting/assigning/adding elements

Syntax

a = l2[0]      # the list should have at least one element!
l2[2] = 4      # the list should have at least 3 elements!
l1.append(1)   # add one element
l1.extend(l2)  # add all elements of l2 to l1

Notes

Indices start with zero in Python
Negative indices start from the end: l[-1] is the last element.

Selecting a sub-list

b = l[1:3]  # equivalent to [ l[1], l[2] ]

Notes

Last element in not included in the sub-list.
If the first index is omitted (e.g. l[:3]), it is by default 0.
If the last index is omitted (e.g. l[2:]), it is by default -1.

Exercise Lists and strings

Let a string named myString get the value Hello, world!
Print the 3rd letter.
Print the letter w.
Print world.
Print Hello

Solution

myString = "Hello, world!"
print(myString[2])
print(myString[7])
print(myString[7:-1])
print(myString[:5])

The `for` loop

The for loop iterates over the elements of a list.

for variable in list:
    # do stuff with variable
    # do stuff again

Notes

There is a colon : at the end of the first line.
An indentation (preferably, 4 spaces) delimits the block where the variable can be used.

Exercise

Print each letter of Hello, world!.

Solution

    for c in myString:
        print(c)

Exercise

Print each letter of Hello, world!, together with its position (0: H, 1: e, etc.)

Method

You have to use a new variable, say i, which will store the position of each letter.

It should start with 0, and be incremented each time a letter is printed.

Solution

i = 0
for c in myString:
    print(f"{i}: {c}")
    i = i + 1

Note

n = n + m can be shortened as n += m

The `range` function

The range(n) function creates a list from 0 to n (not included).
The range(m, n) function creates a list from m to n.
The range(m, n, s) function creates a list from m to n, with a step (incrementation) of s.

Notes

The range function is often used in a for loop.
The list itself is not created, but you can still iterate on it.

Exercise

Print each letter of Hello, world!, together with its position (0: H, 1: e, etc.), using the range function.

Solution

for i in range(len(myString)):
    print(f"{i}: {myString[i]}")

Splitting a string

A string s can be split into a list of several substrings:

s.split(separator, maxsplit)

separator is the string that separates the substrings; if omitted, it is any whitespaces.
maxsplit is the maximum number of splits (extra splits will be merged); it omitted, splits all.

Exercise Splitting a string

Split the string 01.23.45.67.89 with the dot . as a separator, and print the double of each token separately.

Solution

my_string = "01.23.45.67.89"
split_string = my_string.split(".")
for token in split_string:
    print(int(token) * 2)

Reading a file line after line

If f is a file, the variable l gets all the lines (with the carriage return) in

for l in f:
    # do stuff with l

Exercise

Print, line by line, the content of the file /home/mzytnicki/file1.gtf.

Solution

file_name = "/home/mzytnicki/file1.gtf"
f = open(file_name) as f:
for l in f:
    print(l)
f.close()

The GTF file

Aim

The GTF file stores an annotation: genes, transcripts, exons, CDSs.

Format

General:

Comment lines start with the hash sign #.
Other lines are annotations.

The GTF file – annotation

Annotations: they are formatted in a tabulation-separated format, with at most 9 fields.

chr1  havana  gene  10000 11000  .  +  . gene_id "ENSG00000000001"; gene_name "ABC"; gene_source "havana"; gene_biotype "gene";

the chromosome name
the source of the annotation
the feature (usually, gene, transcript, mRNA, exon, CDS, etc.)
the start position (start with 1, includes the feature first position)
the end position (start with 1, includes the feature first position)
the score (float), or . if NA
the strand
the frame (0, 1, or 2) if a CDS, otherwise .
attributes: semicolon-separated list of tag-value pairs, providing additional information about the feature

Exercise

Print the first character of each line of /home/mzytnicki/file1.gtf.

Solution

file_name = "/home/mzytnicki/file1.gtf"
f = open(file_name)
for l in f:
    print(l[0])
f.close()

The `if` statement

The code do stuff is executed only if the expression is true.

    if expression:
        # do stuff

The code do stuff 1 is executed only if the expression is true. The code do stuff 2 is executed only if the expression is false.

    if expression:
        # do stuff 1
    else:
        # do stuff 2

The `if`/`elif` statement

The code do stuff 1 is executed only if the expression1 is true. The code do stuff 2 is executed only if the expression1 is false and the expression2 is true. The code do stuff 3 is executed otherwise.

    if expression1:
        # do stuff 1
    elif expression2:
        # do stuff 2
    else:
        # do stuff 3

Note

You can add as many elif as necessary.

Comparisons

Expressions in the if often are comparisons:

a == b is true iff a equals b (note the double equal sign, the simple equal is the assignment).
a != b is true iff a is different from b.
a > b is true iff a is greater than b.
a >= b is true iff a is no less than b.
(the same for < and <=)

Booleans

Booleans are variables that can take values True or False.

Given boolean a and b, the following functions are defined:

a and b
a or b
not a
a == b
a != b

If l is a list:

a in l is true iff a is an element of l
a not in l is true otherwise

Exercise

Print the file, skipping the comments.

Solution

file_name = "/home/mzytnicki/file1.gtf"
f = open(file_name)
for l in f:
    if l[0] != "#":
        print(l[0])
f.close()

Exercise

Print all the gene lines.

Note

In the gene lines, the feature should be gene. You should thus select these lines.

Solution

file_name = "/home/mzytnicki/file1.gtf"
f = open(file_name)
    # Skip comments
    if l[0] != "#":
        s = l.split("\t")
        if s[2] == "gene":
            print(l)
f.close()

Exercise

Compute the average gene size, in /home/mzytnicki/file2.gtf.

Method

In order to compute the average size, you should:

split them into 9 fields (separated by tabulations);
select the gene lines;
get the start and end positions as int;
compute the size of each gene;
sum them;
divide the sum by the number of genes.

This means that you should use two variables:

the sum of the sizes of the genes,
the number of genes.

Solution

file_name = "/home/mzytnicki/file2.gtf"
sum_of_sizes = 0
n_genes      = 0
f = open(file_name)
for l in f:
    # Skip comments
    if l[0] != "#":
        s = l.split("\t")
        # Check that line is a gene
        if s[2] == "gene":
            # Compute gene size
            n_genes      += 1
            start         = int(s[3])
            end           = int(s[4])
            sum_of_sizes += end - start + 1
f.close()
print(sum_of_sizes / n_genes)

List sorting

A list a can be sorted:

In place: a.sort(). a is then sorted
In a function: b = sorted(a). a is not modified, b is sorted.

Notes

You can sort in decreasing order with reverse = False.
There are more complex ways to sort data, but it is out of the scope of this training.

Exercise

Compute the median gene size.

Method

In order to compute the median size, you should:

compute the size of each gene;
add them into a list of sizes;
sort this list by increasing sizes;
get the middle one.

This means that you should use one list:

the distribution of gene sizes.

Solution

file_name  = "/home/mzytnicki/file2.gtf"
gene_sizes = []
f = open(file_name)
for l in f:
    # Skip comments
    if l[0] != "#":
        s = l.split("\t")
        # Check that line is a gene
        if s[2] == "gene":
            # Compute gene size
            start = int(s[3])
            end   = int(s[4])
            gene_sizes.append(end - start + 1)
f.close()
# Print the (approximate) median
gene_sizes.sort()
print(gene_sizes[len(gene_sizes) // 2])

Exercise

Print the list of gene identifiers.

The gene identifiers can be found in the gene_id tag of the genes.

Method

You should:

select the gene lines;
split the attributes, using the semicolon as separator;
iterate over the attributes;
split each attribute into key/value;
print the value if the key is gene_id (you can strip the extra quotes).

Solution

file_name = "/home/mzytnicki/file2.gtf"
f = open(file_name)
for l in f:
    # Skip comments
    if l[0] != "#":
        s = l.split("\t")
        # Check that line is a gene
        if s[2] == "gene":
            # Get the 8th field, which is the attributes
            attributes = s[8]
            attributes = attributes.split("; ")
            # Look for the gene_id tag
            for tag in attributes:
                tag = tag.split(" ", 1)
                if tag[0] == "gene_id":
                    gene_id = tag[1].strip("\"")
                    print(gene_id)
f.close()

Functions

The standard definition is

def function_name (parameter1, parameter2):
   # do stuff
   return something

Exercise

Create a function that prints Hello!.

Solution

def printHello ():
    print("Hello!")

Exercise

Add a parameter to the previous function that greets someone.

Solution

def printHello (name):
    print(f"Hello {name}!")

Exercise

Create a function that add 1 to the input.

Solution

def plusOne (i):
    return i + 1

Exercise

Create a function that reads the attributes (stored in one string), and returns the gene_id.

Solution

def get_gene_id(attributes):
    attributes = attributes.split(";")
    for tag in attributes:
        tag = tag.split(" ", 1)
        if tag[0] == "gene_id":
            return tag[1].strip("\"")

Exercise

Update the code printing the gene_id using this function.

Solution

file_name = "/home/mzytnicki/file2.gtf"
f = open(file_name)
for l in f:
    # Skip comments
    if l[0] != "#":
        s = l.split("\t")
        # Check that line is a gene
        if s[2] == "gene":
            # Get the 8th field, which is the attributes
            print(get_gene_id(s[8]))
f.close()

The `VCF` format

The Variant Call Format stores the variants found with respect to a reference genome. The format is very flexible. It contains a header (lines starting with #), and the rest is a tab-delimited file following:

the chromosome
the position
an identifier
the reference allele
the alternative allele(s)
the quality of the alleles (or .)
a set of filters (or .)
other info
(optional) sample formats
(optional, several) sample values

chr1  100 rs0001    G       C       .       .       AA=G

Exercise Filter SNPs

Print all the lines of /home/mzytnicki/variants.vcf, except the non-SNPs.

Solution

file_name = "/home/mzytnicki/variants.vcf"
f = open(file_name)
for l in f:
    # Print comments
    if l[0] == "#":
        print(l)
    else:
        s = l.split("\t")
        if (len(s[3]) == 1) and (len(s[4]) == 1):
            print(l)
f.close()

Exercise Filter biallelic variants

Print all the lines of /home/mzytnicki/variants.vcf, except the non-biallelic variants.

Solution

file_name = "/home/mzytnicki/variants.vcf"
f = open(file_name)
for l in f:
    # Print comments
    if l[0] == "#":
        print(l)
    else:
        s = l.split("\t")
        if "," not in s[4]:
            print(l)
f.close()

Dictionaries

Definition

A dictionary is an extension of a list:

the key can by a string, a number, a float, etc.
the value too

Syntax

d = {}                   # empty dictionary
d["foo"] = "bar"         # assignment (may replace a previous value)
x = d["foo"]             # "foo" should be an element of d!
d = {"one": 1, "two": 2} # equivalent of d["one"] = 1; d["two"] = 2

Dictionaries — continued

Key check

You can check whether key is in your dict d with:

if key in d:
    # do stuff

Iterate on a dictionary

You can iterate on the keys of the dictionary d with

for k in d:
    # do stuff with k or d[k]

List of keys

If d is a dictionary, you can get the list of keys with:

d.keys()

Number of elements

If d is a dictionary, you can get the number number of inserted elements with:

len(d)

Exercise Ranked list of genes

The file /home/mzytnicki/genes_ranked.txt contains the genes ranked, from the most abundant to the least abundant (fictitious case).

Get the rank of each gene, on the alphabetically sorted list.

Example input:

gene_C
gene_A
gene_B

Output:

gene_A: 2
gene_B: 3
gene_C: 1

Solution

genes = {}
i     = 1
input_file = open("/home/mzytnicki/genes_ranked.txt")
for l in input_file:
    genes[l.strip()] = i
    i += 1
for g in sorted(genes.keys()):
    print(f"{g}: {genes[g]}")

Or

genes = {}
input_file = open("/home/mzytnicki/genes_ranked.txt")
for l in input_file:
    genes[l.strip()] = len(genes) + 1
for g in sorted(genes.keys()):
    print(f"{g}: {genes[g]}")

Exercise Number of isoforms

Print, for each gene, the number of isoforms.

The genes should be lexicographically ordered.

Method

You should there use a dictionary, which stores, for each gene id, the number of isoforms.

In detail:

select the transcript lines;
extract the gene id;
if the gene id is not in the dictionary, set the corresponding value to 0;
increment the value of the dictionary;
when the loop is over:
- sort the keys of the dictionary;
- for each key, print it, together with the corresponding value.

Solution

file_name     = "/home/mzytnicki/file2.gtf"
n_transcripts = {}
f = open(file_name)
for l in f:
    if l[0] != "#":                             # Skip comments
        s = l.split("\t")
        if s[2] == "transcript":                # Keep transcript lines
            gene_id = get_gene_id(s[8])         # The 8th field is the attributes
            if gene_id not in n_transcripts:
                n_transcripts[gene_id] = 0
            n_transcripts[gene_id] += 1
f.close()
for gene in sorted(n_transcripts.keys()):
    print(f"{gene}: {n_transcript[gene]}")

Create/run a Python program

We will put the previous program into a file, which we will execute.

Create a new file (click on the + blue button in the top left, then select Python file) named n_isoforms.py.
Open it with the text editor (double-click on in in the left panel).
Copy the previous code and the function, and paste it to the file.
Open a console (click on the + blue button in the top left, then select `Console).
Run with:

cd
python3 n_isoforms.py

Create an executable

Add the “shebang” in the first line of your file

#!/usr/bin/env python3

Add executable rights in bash

chmod +x n_isoforms.py

You can now run the program with

./n_isoforms.py

Read parameters

We will give the input parameter as an argument of the program.

We have to add a library: additional code that can be useful in some applications.

Here is it sys (for “system”).

Modify your file, so that it starts with:

#!/usr/bin/env python3

import sys

The variable sys.argv is a list, with all the arguments of the program (including the program itself).

Exercise

Set the input file as an argument of the program.

Solution

The program should start with:

#!/usr/bin/env python3

import sys

file_name = sys.argv[1]

Understanding the structure of a Python script/project

Objects

In Python (or other modern languages), it is best to use classes.

Classes:

model a concept,
can be accessed using a limited interface,
hide the implementation.

A string can be considered as a class.

Understanding the structure of a Python script/project

Objects

A typical implementation is:

class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
    def greet(self):
        print(f"Hi {self.name}!")

A object of a class can be accessed this way:

p1 = Person("Matthias", 44)
p1.greet()

Understanding the structure of a Python script/project

The main script

When writing production code my_project, the main Python script usually follows this pattern.

#!/usr/bin/env python3

import sys         # load typical external libraries
import os
import argparse

import moduleA  # load typical internal libraries in the moduleA file
from moduleB import functionB

# Functions and variables for parsing parameters
# ...

if __name__ == "__main__":
    # real stuff starts here

Exercise A Python structure

Write a Python file moduleA.py with a function greet that prints Hi!.
Write a Python file printHello with the structure of a main file, that calls the previous function.

Solution

moduleA.py:

def greet():
    print("Hi!")

printHello:

#!/usr/bin/env python3

import moduleA

if __name__ == "__main__":
    moduleA.greet()

Other exercises

From now on, we will apply notions we learned until now.

If time permits.

Exercise Size of sequences

Create a program that opens the file /home/mzytnicki/file1.fasta, and prints, for each sequence, its name and its size.

Method

Use a variable size, that records the size of the sequence (seen so far).
Use a variable name, that records the current sequence name.
Read the file line by line.
If it is the first line, it should start with ‘>’, so just record the name (i.e. remove the first character).
The follwing line(s) should not start with a >, so compute the size of the line, and increment size.
When the next > line is seen, print the sequence name and the sequence size; store the new name, and reset the size to 0.
Also print the last sequence!

Method

To summarize, if you read a line starting with > it could be:

the first line: just store the name,
any other line: print the sequence name and size.

It is thus convenient to initially set the name to False: as soon as the name is assigned to any string it is truthy.

Answer

#!/usr/bin/env python3

import sys

file_name = sys.argv[1]

sequence_name = False
sequence_size = 0

file = open(file_name)
for line in file:
    line = line.strip()
    if line[0] == '>':
        if sequence_name:
            print(f"{sequence_name}: {sequence_size}")
        sequence_size = 0
        sequence_name = line[1:]
    else:
        sequence_size += len(line)
f.close()
print(f"{sequence_name}: {sequence_size}")

Exercise Number of different nucleotides per sequence

Create a program that opens the file /home/mzytnicki/file1.fasta, and prints, for each sequence, the total number of each nucleotides (A, C, G, T, or N).

Method

Use a dict numbers, that records the number of each (seen so far).
Use a variable name, that records the current sequence name.
Conveniently use a function to print the number of nucleotides.

Answer

#!/usr/bin/env python3

import sys

file_name = sys.argv[1]

def print_numbers (name, numbers):
    print(f"{name}:)
    for n in sorted(numbers.keys()):
        print(f"\t{n}: {numbers[n]}") 

sequence_name = False
sequence_sizes = {}

file = open(file_name)
for line in file:
    line = line.strip()
    if line[0] == '>':
        if sequence_name:
            print_numbers(sequence_name, sequence_sizes)
        sequence_sizes = {}
        sequence_name = line[1:]
    else:
        for c in line:
            if c not in line:
                sequence_sizes[c] = 0
            sequence_sizes[c] += 1
f.close()
print_numbers(sequence_name, sequence_sizes)

Exercise GTF to BED

The BED format also is a tabulation-separated format, and it comes with different flavours. The “BED-6” format is

chr  start  end  name  score  strand

Where:

each line will be a gene,
start is 0-based,
end is 1-based,
we will use the gene_id as the name,
the score will written . (equivalent to NA).

Answer

#!/usr/bin/env python3

import sys

file_name = sys.argv[1]

file = open(file_name)
for l in file:
    l = l.strip()
    # Skip comments
    if l[0] != "#":
        s = l.split("\t")
        # Check that line is a gene
        if s[2] == "gene":
            chromosome = s[0]
            start      = int(s[3]) - 1
            end        = int(s[4])
            strand     = s[6]
            name       = get_gene_id(s[8])
            print(f"{chromosome}\t{start}\t{end}\t{name}\t.\t{strand}")
f.close()

Exercise Check the GTF structure

The GTF should have a structure, where the gene line is written before its transcript line(s). Any transcript line should refer to the last seen gene line. The transcript line should be written before the exon line(s). Any exon line should refer to the last seen transcript line.

The gene should have a gene_id attribute, which is also present in the transcript lines. Similarly, the transcript should have a transcript_id, also present in the exon lines.

If you spot a problem in the structure, you should indicate the line, and the problem.

Method

The for loop should increment the line number.
You should write a new function, similar to get_gene_id, which will find the value of the given key.
It should return False is the key is missing.
You should record current_gene_id and current_transcript_id.
In case, the first line does not start with a gene line, set the previous variable to False.

Solution

#!/usr/bin/env python3

import sys

file_name = sys.argv[1]

def get_attribute_value(attributes, key):
    attributes = attributes.split(";")
    for tag in attributes:
        tag = tag.split(" ", 1)
        if tag[0] == key:
            return tag[1].strip("\"")
    return False

file = open(file_name)
line_number   = 1
gene_id       = False
transcript_id = False
for l in file:
    l = l.strip()
    if l[0] != "#":
        s = l.split("\t")
        if s[2] == "gene":
            current_gene_id = get_attribute_value(s[8], "gene_id")
            if not current_gene_id:
                print(f"Line {line_number}: Gene id is missing in gene line.")
        elif s[2] == "transcript":
            gene_id = get_attribute_value(s[8], "gene_id")
            if not gene_id:
                print(f"Line {line_number}: Gene id is missing in transcript line.")
            elif gene_id != current_gene_id:
                print(f"Line {line_number}: Gene id does not match in transcript line.")
            current_transcript_id = get_attribute_value(s[8], "transcript_id")
            if not current_transcript_id:
                print(f"Line {line_number}: Transcript id id is missing in transcript line.")
        elif s[2] == "exon":
            transcript_id = get_attribute_value(s[8], "transcript_id")
            if not transcript_id:
                print(f"Line {line_number}: Transcript id is missing in exon line.")
            elif transcript_id != current_transcript_id:
                print(f"Line {line_number}: Transcript id does not match in exon line.")
    line_number += 1
f.close()

Summary — what we learned

variables
- simple variables: integers, floats, strings, booleans
- complex variables: lists, dictionaries
main control flow commands: if, for
files: open, read, write
comments
simple Python programs
libraries

What is next?

more programming primitives: enumerate, items, etc.
other general libraries: os, etc.
dedicated libraries: BioPython
plots,
object programming,
version control system (git).

Other useful resources

A full Python course: https://python.sdv.univ-paris-diderot.fr/cours-python.pdf
A Python CheatSheet: https://perso.limsi.fr/pointal/_media/python:cours:mementopython3-english.pdf

Thanks to

Sandra Dérozier & Thomas Duigou for “Introduction à Python”: https://sandraderozier.pages.mia.inra.fr/formation-python/supports/00-intro-general.html
The SouthGreen platform for their Python training page: https://southgreenplatform.github.io/trainings/python/
Christophe Klopp for his Python Course: http://genoweb.toulouse.inra.fr/~klopp/BioComp/3.pdf
Please fill in the satisfaction survey: https://sondages.inrae.fr/index.php/84236

This document is CC-BY!