Five things I hate about teaching Python

I’ve just finished teaching an intensive Python programming course and, as usual, spending a week thinking about how best to introduce my students to programming has given me something to write about. I realised that, while I’ve spent a lot time talking about why Python is a great language, I have a number of pet peeves that I’ve never written down.

I’m not talking about the usual problems, like Python’s relative lack of performance or lack of compile-time type checking – these things are deliberate design trade-offs and changing them would involve making Python not-Python. I’m talking about the small things that cause friction, especially in a teaching environment.

Note: I realize that there are good reasons for all these things to be the way they are, so don’t take this too seriously….

1. Floating point vs. integer division

Anyone who’s written in Python for any length of time probably types this line automatically without really thinking about it:

from __future__ import division

but take a moment to consider how you would explain what’s going on in this piece of code to a beginner. In order to really understand what’s happing here, you have to know about:

  • Python’s system for importing modules
  • Python’s system for grouping modules into packages
  • the fact that there are different versions of Python with slightly different behaviour
  • the difference between floating-point and integer numbers
  • the mechanisms of operator overloading, whereby we can define the behaviour of things like + and / for different types
  • the concept of polymorphic functions and operators, which allow us to treat different classes the same, some of the time

Explaining all this to someone who has never written a line of code before is unlikely to be productive, but none of the alternatives are particularly attractive either. We can just present this as a magic piece of code and save the explanation for later (this is normally what I do). We can instruct students to use explicit floating point numbers:

answer = float(4)/3
answer = 4.0/3

, but eventually they will forget and use integers and find that it works some of the time. We can carefully craft our examples and exercises to avoid the need for floating point division, but this is setting students up for pain further down the line. We can use the command-line argument -Q to force floating-point division, or just use Python 3 for teaching, but both of these options will cause confusion once the student goes back to their own environment.

2. split() vs. join()

“OK class, this is how we take a string and split it up into a list of strings using a fixed delimiter:”

sentence = "The all-England summarize Proust competition"
words = sentence.split(" ")

“So I guess, logically, to put the words back together again we just say:

sentence = words.join(" ")

right? Look at that elegant symmetry…… Wait a minute, you’re telling me it doesn’t work like that? The list and the delimiter actually go the other way around, so that we have to write this ugly line?

sentence = " ".join(words)

Wow, that just looks wrong.”

Yes, I know that there are good reasons for collection classes to only have methods that are type-agnostic, but would it really be so bad to just str() everything?

3. Exhaustible files

It’s perfectly logical that you shouldn’t be able to iterate through a file object twice without re-opening it….. once you know a fair bit about how iteration is actually implemented in Python. As a beginner, thought, it’s a bit like Python is giving with one hand and taking away with the other – you can use an opened file object just like a list, except in this one specific but very important way:

my_list = [1,2,3,4]
for number in my_list:
    do_something(number)
# second loop works just as you'd expect
for number in my_list:
    do_something_else(number)

my_file = open("some.input")
for line in my_file:
    do_something(line)
# second loop silently never runs
for line in my_file:
    do_something_else(line)

This problem also rears its ugly head when students try to iterate over a file having already consumed its contents using read():

my_file = open("some.input")
my_contents = my_file.read()
....
# this loop silently never runs
for line in my_file:
    do_something(line)

That second line can be difficult to spot for student and teacher alike when there are many intervening lines between it and the loop.

4. Lambda expressions

OK, this one is more annoying when writing code than when teaching it, since I rarely get round to talking about functional programming in introductory courses. I totally get why there should be a big, obvious flag when we are doing something clever (which lambda expressions generally are). Nevertheless, it seems a shame to have a style of coding that lends itself to elegant brevity marred by so many unnecessary keystrokes.

I think that the reason this bugs me so much is that I first got into functional programming by way of Groovy, which has (to me) a very pleasing syntax for anonymous functions (actually closures):

{x,y -> x**y}

compared to Python:

lambda x,y : x**y

Of course, Python lessens the sting of having to type lambda with its various comprehensions:

squares = map(lambda x : x**2, range(10))
squares = [x**2 for x in range(10)]

so I can’t complain too loudly.

5. Variables aren’t declared

It’s just way too easy for beginners to make a typo that brings their progress to a screeching halt. Consider this real-life example from my most recent course:

positions = [0]
for pos in [12,54,76,103]:
    postions  = positions + [pos]
print(positions) # prints [0] rather than [0,12,54,76,103]

Leaving aside that this particular example could have been salvaged by using positions.append(), it took way to long for us to track down the typo. In real-life code, this is the kind of thing that would ideally be caught by unit testing. This is one (rare!) case in which I pine for the old days of teaching Perl – use strict and my would have taken care of this type of problem.

4


Manipulating tabular output with command-line tools

Introduction

Surprisingly, given that this is a website about programming, this article is going to be about how to avoid programming. Specifically: how to use simple existing command-line tools to process tabular data.

Even for folks who are comfortable writing programs to process tabular data, being able to use command-line tools to slice, dice and generally manipulate data files is a useful skill to have. For very simple tasks it’s often quicker to write a command line to do the job rather than switching mental gears and starting to write a new program. Also, the command-line tools we’ll be discussing here are likely to be faster than Python for many jobs.

Tabular data

By tabular data I mean simple text files where each line stores a single record and consists of a number of different fields separated by a delimiter. The delimiter – i.e. the thing that separates the fields – can be anything, but the most common examples are tabs, spaces, or commas. In this last case, we sometimes refer to the file as being in Comma Separated Value or CSV format.

Let’s look at a few concrete examples. A common type of tabular file that we encounter in biology is BLAST output. Normally when we run BLAST, we get the output in a human-readable format which shows the alignments (or if we’re using a web BLAST server we get a nice webpage with graphics). This format is easy for humans to read but not so easy for computers to deal with; if we need to do some processing of out BLAST results then tabular output is often a better option. We can get tabular output from BLAST by giving it the -m 9 option if we’re using the old BLAST package, or giving it the -outfmt 7 option if we’re using the newer BLAST+ package. If we’re using a web-based BLAST server, we can download our results in tabular format using the download link at the top:

Selection_011

However we get hold of our tabular BLAST results they’re going to look something like this:

# blastp
# Iteration: 0
# Query: my_protein
# RID: BFFAX0Z3015
# Database: nr
# Fields: query id, subject ids, % identity, % positives, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 100 hits found
my_protein gi|60654447|gb|AAX29914.1| 100.00 100.00 599 0 0 1 599 1 599 0.0 1243
my_protein gi|397526517|ref|XP_003833169.1| 99.50 100.00 599 3 0 1 599 1 599 0.0 1240
my_protein gi|18104967|ref|NP_000953.2|;gi|317373262|sp|P23219.2|PGH1_HUMAN 99.83 99.83 599 1 0 1 599 1 599 0.0 1240
my_protein gi|674214810|gb|AIK97765.1| 99.83 99.83 599 1 0 1 599 1 599 0.0 1239
my_protein gi|694936914|ref|XP_009455498.1| 99.33 100.00 599 4 0 1 599 1 599 0.0 1239
...

This is only the first few lines; in reality there will be many more, but they’ll all share the same format. The first few lines are comments which tell us various things – the name of the query, the number of hits, etc. By far the most important comment line is this one:

# Fields: query id, subject ids, % identity, % positives, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score

because it tells us the order of the fields in each of the result lines – in other words, it tells us how to interpret the other lines. From looking at the list of fields we can see that, for example, the second field in each line is the name of the subject sequence, and the fifth field is the length of the match.

We’ll be using BLAST tabular output for the examples in this tutorial, but it’s worth remembering that exactly the same tools can be used for any type of tabular data. For example, here’s the first few lines of a BED file which holds annotation for genomic coordinates:

browser position chr7:127471196-127495720
browser hide all
track name="ItemRGBDemo" description="Item RGB demonstration" visibility=2
itemRgb="On"
chr7    127471196  127472363  Pos1  0  +  127471196  127472363  255,0,0
chr7    127472363  127473530  Pos2  0  +  127472363  127473530  255,0,0
chr7    127473530  127474697  Pos3  0  +  127473530  127474697  255,0,0
chr7    127474697  127475864  Pos4  0  +  127474697  127475864  255,0,0
chr7    127475864  127477031  Neg1  0  -  127475864  127477031  0,0,255
chr7    127477031  127478198  Neg2  0  -  127477031  127478198  0,0,255
chr7    127478198  127479365  Neg3  0  -  127478198  127479365  0,0,255
chr7    127479365  127480532  Pos5  0  +  127479365  127480532  255,0,0
chr7    127480532  127481699  Neg4  0  -  127480532  127481699  0,0,255

The data being stored are different, but the layout is just the same as our BLAST example: a handful of comment lines at the top followed by a bunch of lines with tab-separated fields. One more example: here’s GFF format, also used for annotating sequence data:

X	Ensembl	Repeat	2419108	2419128	42	.	.	hid=trf; hstart=1; hend=21
X	Ensembl	Repeat	2419108	2419410	2502	-	.	hid=AluSx; hstart=1; hend=303
X	Ensembl	Repeat	2419108	2419128	0	.	.	hid=dust; hstart=2419108; hend=2419128
X	Ensembl	Pred.trans.	2416676	2418760	450.19	-	2	genscan=GENSCAN00000019335
X	Ensembl	Variation	2413425	2413425	.	+	.
X	Ensembl	Variation	2413805	2413805	.	+	.

Thinking about the data formats that you encounter in your own work will furnish further examples.

Before we dive in and start looking at the tools we can use to manipulate tabular data, a few points about your computing environment. In this tutorial we’ll be looking at command-line tools – cut, sort, head, tail, uniq, wc, sed and awk. If you’re running Linux, or any UNIX-flavoured operating system, including Mac OS, these tools are part of the standard library and should already be installed. If you’re using Windows, you’ll need to either install them (cygwin is probably the best way to do this) or find somebody who can give you an account on a Linux server.

All the example command lines we’ll be looking at use the BLAST example from above – if you want to play along at home you can download a copy here. For the command-line examples below, the lines starting with a dollar sign ($) are the ones that we type, and the first few lines of the output is shown below. This will look slightly different on your machine depending on your settings.

Getting rid of comments

The first thing we need to do in order to start processing the file is to remove the comment lines at the start. There are seven comment lines, so we need to start looking at the eighth line. The easiest way to do this is with tail. The tail command is normally used for extracting the last few lines of a file, but with the -n option we can tell it to extract all the lines starting from a given line. Here, we want to start at line number eight, so we can use the command

$ tail -n +8 blast_example.txt
my_protein	gi|60654447|gb|AAX29914.1|	100.00	100.00	599	0	0	1	599	1	599	0.0	 1243
my_protein	gi|397526517|ref|XP_003833169.1|	99.50	100.00	599	3	0	1	599	1	599	0.0	 1240
my_protein	gi|18104967|ref|NP_000953.2|;gi|317373262|sp|P23219.2|PGH1_HUMAN	99.83	99.83	599	1	0	1	599	1	599	0.0	 1240
my_protein	gi|674214810|gb|AIK97765.1|	99.83	99.83	599	1	0	1	599	1	599	0.0	 1239
my_protein	gi|694936914|ref|XP_009455498.1|	99.33	100.00	599	4	0	1	599	1	599	0.0	 1239
my_protein	gi|387018|gb|AAA36439.1|	99.50	99.50	599	3	0	1	599	1	599	0.0	 1236

Notice that we have to put a plus sign (+) before the 8. The output of this command will be all the lines of the file from the eighth line onwards, which we can either save by redirecting it into a new file:

$ tail -n +8 blast_example.txt >blast_example_nocomments.txt

or use a pipe to send the output directly to another program:

$ tail -n +8 blast_example.txt | some_other_program

For the rest of the examples we’ll use the file blast_example_nocomments.txt so that we don’t need to worry about the comment lines.

Extracting fields

One of the simplest things we can do with tabular data is to just display the fields we’re interested in. The command for doing this is called cut, and we use the -f option to say which field we want to extract. For example, we can extract just the percent identity for each hit, which is stored in the third field:

$ cut -f 3 blast_example_nocomments.txt
100.00
99.50
99.83
99.83
99.33
99.50

To get multiple fields, we just list them separated by commas. Here’s how to get the hit name, number of miss-matches, and bitscore:

$ cut -f 2,6,13 blast_example_nocomments.txt
gi|60654447|gb|AAX29914.1|	0	 1243
gi|397526517|ref|XP_003833169.1|	3	 1240
gi|18104967|ref|NP_000953.2|;gi|317373262|sp|P23219.2|PGH1_HUMAN	1	 1240
gi|674214810|gb|AIK97765.1|	1	 1239
gi|694936914|ref|XP_009455498.1|	4	 1239
gi|387018|gb|AAA36439.1|	3	 1236
gi|76885916|gb|ABA60099.1|	1	 1234
...

Notice that in the output from this command, the fields don’t line up nicely because some of the hit names are longer than others. By piping the results to the column -t command we can get the fields to line up:

$ cut -f 2,6,13 blast_example_nocomments.txt | column -t
gi|60654447|gb|AAX29914.1|                                        0  1243
gi|60654447|gb|AAX29914.1|                                        0  1243
gi|60654447|gb|AAX29914.1|                                        0  1243
gi|397526517|ref|XP_003833169.1|                                  3  1240
gi|18104967|ref|NP_000953.2|;gi|317373262|sp|P23219.2|PGH1_HUMAN  1  1240
gi|674214810|gb|AIK97765.1|                                       1  1239
gi|674214810|gb|AIK97765.1|                                       1  1239
gi|694936914|ref|XP_009455498.1|                                  4  1239
gi|387018|gb|AAA36439.1|                                          3  1236
gi|76885916|gb|ABA60099.1|                                        1  1234

Be careful when using column -t, as the fields are effectively padded to the length of the longest value, so if you have a very long value somewhere in the file you’ll end up with a lot of whitespace. Note that if you want to use cut on a file where the delimiter is something other than a tab, you’ll need to use the -d option. For example, on a CSV file use

$ cut -d , -f 1,2,3 myfile.txt

Sorting tabular data

The sort command, by default, sorts alphabetically on entire lines, so if we use it without any options it will effectively sort on the query name (if our BLAST report contains multiple queries) and then on the hit name (since that is the second field):

$ sort blast_example_nocomments.txt
my_protein	gi|109110319|ref|XP_001088270.1|;gi|544492470|ref|XP_005580931.1|;gi|685591740|ref|XP_009186406.1|	98.33	99.33	599	10	0	1	599	1	599	0.0	 1227
my_protein	gi|109110321|ref|XP_001088157.1|	98.32	99.33	596	10	0	4	599	37	632	0.0	 1222
my_protein	gi|130485856|ref|NP_001076150.1|;gi|75039091|sp|O97554.1|PGH1_RABIT;gi|4103591|gb|AAD01796.1|	91.49	95.83	576	49	0	24	599	31	606	0.0	 1121
my_protein	gi|14278642|pdb|1HT5|A;gi|14278643|pdb|1HT5|B;gi|14278644|pdb|1HT8|A;gi|14278645|pdb|1HT8|B	93.83	97.10	551	34	0	32	582	1	551	0.0	 1093
my_protein	gi|157835592|pdb|2OYE|P;gi|157835593|pdb|2OYU|P	92.57	95.85	579	43	0	21	599	22	600	0.0	 1130
....
my_protein	gi|76885914|gb|ABA60098.1|	99.83	99.83	596	1	0	4	599	35	630	0.0	 1233
my_protein	gi|76885916|gb|ABA60099.1|	99.83	99.83	596	1	0	4	599	35	630	0.0	 1234
my_protein	gi|999675|pdb|1PRH|A;gi|999676|pdb|1PRH|B	93.48	96.92	552	36	0	32	583	1	552	0.0	 1091

For our example file, where the hit names begin with GI numbers, this effectively sorts by GI number (though notice that the sorting is not numerical: 76885916 comes before 999675 because it starts with a lower digit, even though it’s a higher number).

We can get much more use out of sort if we specify the field we want to sort on. To do this we use the -k option (short for key) and give two numbers (separated by a comma) which are the first and last fields to sort on. This looks a bit weird since most of the time we want to sort on just a single field, so the first and last fields are the same. Here’s how we sort by percent identity:

$ sort -k3,3  blast_example_nocomments.txt
my_protein	gi|60654447|gb|AAX29914.1|	100.00	100.00	599	0	0	1	599	1	599	0.0	 1243
my_protein	gi|507558178|ref|XP_004662827.1|	90.62	94.79	576	54	0	24	599	25	600	0.0	 1103
my_protein	gi|507558180|ref|XP_004662828.1|	90.62	94.79	576	54	0	24	599	25	600	0.0	 1103
my_protein	gi|507558182|ref|XP_004662829.1|	90.62	94.79	576	54	0	24	599	24	599	0.0	 1102
my_protein	gi|586470237|ref|XP_006865713.1|	90.62	96.01	576	54	0	24	599	25	600	0.0	 1094
...
my_protein	gi|674214810|gb|AIK97765.1|	99.83	99.83	599	1	0	1	599	1	599	0.0	 1239
my_protein	gi|76885914|gb|ABA60098.1|	99.83	99.83	596	1	0	4	599	35	630	0.0	 1233
my_protein	gi|76885916|gb|ABA60099.1|	99.83	99.83	596	1	0	4	599	35	630	0.0	 1234

Percent identity is the third field, so the first and last fields to sort on are both three. Looking closely at the output above reveals something odd: the first line, where the percent identity is 100, should be last, but instead it is first. The reason for this is that we’re still sorting alphabetically, so 100 comes before 99 because it starts with a 1. To switch to numerical sorting, we use the -n option:

$ sort -k3,3  -n blast_example_nocomments.txt
my_protein	gi|507558178|ref|XP_004662827.1|	90.62	94.79	576	54	0	24	599	25	600	0.0	 1103
my_protein	gi|507558180|ref|XP_004662828.1|	90.62	94.79	576	54	0	24	599	25	600	0.0	 1103
my_protein	gi|507558182|ref|XP_004662829.1|	90.62	94.79	576	54	0	24	599	24	599	0.0	 1102
...
my_protein	gi|76885914|gb|ABA60098.1|	99.83	99.83	596	1	0	4	599	35	630	0.0	 1233
my_protein	gi|76885916|gb|ABA60099.1|	99.83	99.83	596	1	0	4	599	35	630	0.0	 1234
my_protein	gi|60654447|gb|AAX29914.1|	100.00	100.00	599	0	0	1	599	1	599	0.0	 1243

and the hit with 100% identity moves to its correct place at the end of the output. If we want to see the highest identity first – i.e. to reverse the sorting order – we can just use the -r option:

$ sort -k3,3  -n -r blast_example_nocomments.txt
my_protein	gi|60654447|gb|AAX29914.1|	100.00	100.00	599	0	0	1	599	1	599	0.0	 1243
my_protein	gi|76885916|gb|ABA60099.1|	99.83	99.83	596	1	0	4	599	35	630	0.0	 1234
my_protein	gi|76885914|gb|ABA60098.1|	99.83	99.83	596	1	0	4	599	35	630	0.0	 1233
...
my_protein	gi|507558182|ref|XP_004662829.1|	90.62	94.79	576	54	0	24	599	24	599	0.0	 1102
my_protein	gi|507558180|ref|XP_004662828.1|	90.62	94.79	576	54	0	24	599	25	600	0.0	 1103
my_protein	gi|507558178|ref|XP_004662827.1|	90.62	94.79	576	54	0	24	599	25	600	0.0	 1103

If we want to sort on multiple fields we can use multiple -k options. For example, here’s how we sort by percent identity (field number three) then by hit length (field number five):

$ sort -k3,3  -k5,5 -n -r blast_example_nocomments.txt

Head

Often when we sort BLAST hits by some criterion we’re interested in the first or last hits. For example, we might only want to see the five longest hits. To do this, we can take the output of sort and pipe it directly into head. The head command displays just the few lines of its input – by default, the first ten lines, though we can specify a different number with the -n option. Here’s how we extract the five longest hits from our BLAST report – we sort by hit length (the fifth field) in reverse order then pipe the results into head and extract the first 5 lines:

$ sort -k5,5 -n -r blast_example_nocomments.txt | head -n 5
my_protein	gi|724957135|ref|XP_010353695.1|	98.33	98.83	599	10	0	1	599	1	599	0.0	 1224
my_protein	gi|694936914|ref|XP_009455498.1|	99.33	100.00	599	4	0	1	599	1	599	0.0	 1239
my_protein	gi|674214810|gb|AIK97765.1|	        99.83	99.83	599	1	0	1	599	1	599	0.0	 1239
my_protein	gi|635071455|ref|XP_007966350.1|	98.33	99.33	599	10	0	1	599	60	658	0.0	 1225
my_protein	gi|60654447|gb|AAX29914.1|	       100.00	100.00	599	0	0	1	599	1	599	0.0	 1243

If we combine this with the cut command that we discussed earlier, we can do more sophisticated things. For example, we can look at just the hit names and bitscores for the five longest hits by piping the results of the above command into cut and extracting the second and thirteenth fields:

$ sort -k5,5 -n -r blast_example_nocomments.txt | head -n 5 | cut -f 2,13
gi|724957135|ref|XP_010353695.1|	 1224
gi|694936914|ref|XP_009455498.1|	 1239
gi|674214810|gb|AIK97765.1|	         1239
gi|635071455|ref|XP_007966350.1|	 1225
gi|60654447|gb|AAX29914.1|	         1243

This is an interesting example because the field that we’re using for sorting – the hit length – doesn’t form part of the final output. Because of this, we have to be careful with the order of commands – the sort has to come before the cut. If we try and do it the other way round:

$ cut -f 2,13 | sort -k5,5 -n -r blast_example_nocomments.txt | head -n 5

it won’t work, because by the time sort sees the data the hit length field is no longer there – it’s been removed by cut.

Filtering with awk

The last example we saw involved grabbing the five longest hits; what if, instead, we wanted to grab all the hits that were longer than a given length? We can’t use the sort/head approach outlined above, because we don’t know in advance how many lines we want to display. Instead, we need a tool that’s capable of filtering lines based on specific criteria, and the one we’re going to use is called awk. The awk command is a very flexible tool for text manipulation and can do lots of things; here we’re only going to use a tiny bit of its capability. To use awk for filtering, we run it with a command string that describes the criterion which lines must pass. Inside this string, we can use $1 to refer to the first field, $2 to refer to the second, and so on. We can also use mathematical comparisons like equals and less-than. A few examples will make it clearer – here’s how we display just the hits that are longer than 590 bases (i.e. just the lines where the fifth field is greater than 590):

$ awk '$5 > 590' blast_example_nocomments.txt
my_protein	gi|60654447|gb|AAX29914.1|	100.00	100.00	599	0	0	1	599	1	599	0.0	 1243
my_protein	gi|397526517|ref|XP_003833169.1|	99.50	100.00	599	3	0	1	599	1	599	0.0	 1240
my_protein	gi|18104967|ref|NP_000953.2|;gi|317373262|sp|P23219.2|PGH1_HUMAN	99.83	99.83	599	1	0	1	599	1	599	0.0	 1240
my_protein	gi|674214810|gb|AIK97765.1|	99.83	99.83	599	1	0	1	599	1	599	0.0	 1239
my_protein	gi|694936914|ref|XP_009455498.1|	99.33	100.00	599	4	0	1	599	1	599	0.0	 1239
my_protein	gi|387018|gb|AAA36439.1|	99.50	99.50	599	3	0	1	599	1	599	0.0	 1236
my_protein	gi|76885916|gb|ABA60099.1|	99.83	99.83	596	1	0	4	599	35	630	0.0	 1234
my_protein	gi|76885914|gb|ABA60098.1|	99.83	99.83	596	1	0	4	599	35	630	0.0	 1233
my_protein	gi|109110319|ref|XP_001088270.1|;gi|544492470|ref|XP_005580931.1|;gi|685591740|ref|XP_009186406.1|	98.33	99.33	599	10	0	1	599	1	599	0.0	 1227
my_protein	gi|635071455|ref|XP_007966350.1|	98.33	99.33	599	10	0	1	599	60	658	0.0	 1225
my_protein	gi|724957135|ref|XP_010353695.1|	98.33	98.83	599	10	0	1	599	1	599	0.0	 1224
my_protein	gi|109110321|ref|XP_001088157.1|	98.32	99.33	596	10	0	4	599	37	632	0.0	 1222
my_protein	gi|544492468|ref|XP_005580930.1|;gi|355567462|gb|EHH23803.1|;gi|355753051|gb|EHH57097.1|	98.32	99.33	596	10	0	4	599	37	632	0.0	 1221
my_protein	gi|403266039|ref|XP_003925205.1|	97.16	98.50	599	17	0	1	599	1	599	0.0	 1209
my_protein	gi|667269156|ref|XP_008570258.1|	92.80	95.98	597	43	0	3	599	35	631	0.0	 1123
my_protein	gi|470608129|ref|XP_004314816.1|	90.95	94.81	597	54	0	3	599	4	600	0.0	 1122
my_protein	gi|403310639|ref|NP_001258093.1|	91.82	91.82	599	1	1	1	599	1	551	0.0	 1115
my_protein	gi|194033503|ref|XP_001926164.1|	91.12	95.48	597	53	0	3	599	4	600	0.0	 1102

Here’s how we display just the hits that include the very start of the query sequence (i.e. the lines where the eighth field is equal to one):

$ awk '$8 == 1' blast_example_nocomments.txt
my_protein	gi|60654447|gb|AAX29914.1|	100.00	100.00	599	0	0	1	599	1	599	0.0	 1243
my_protein	gi|397526517|ref|XP_003833169.1|	99.50	100.00	599	3	0	1	599	1	599	0.0	 1240
my_protein	gi|18104967|ref|NP_000953.2|;gi|317373262|sp|P23219.2|PGH1_HUMAN	99.83	99.83	599	1	0	1	599	1	599	0.0	 1240
my_protein	gi|674214810|gb|AIK97765.1|	99.83	99.83	599	1	0	1	599	1	599	0.0	 1239
my_protein	gi|694936914|ref|XP_009455498.1|	99.33	100.00	599	4	0	1	599	1	599	0.0	 1239
my_protein	gi|387018|gb|AAA36439.1|	99.50	99.50	599	3	0	1	599	1	599	0.0	 1236
my_protein	gi|109110319|ref|XP_001088270.1|;gi|544492470|ref|XP_005580931.1|;gi|685591740|ref|XP_009186406.1|	98.33	99.33	599	10	0	1	599	1	599	0.0	 1227
my_protein	gi|635071455|ref|XP_007966350.1|	98.33	99.33	599	10	0	1	599	60	658	0.0	 1225
my_protein	gi|724957135|ref|XP_010353695.1|	98.33	98.83	599	10	0	1	599	1	599	0.0	 1224
my_protein	gi|403266039|ref|XP_003925205.1|	97.16	98.50	599	17	0	1	599	1	599	0.0	 1209
my_protein	gi|403310639|ref|NP_001258093.1|	91.82	91.82	599	1	1	1	599	1	551	0.0	 1115

Here’s how we display just the hits with fewer than four miss-matches (i.e. the lines where the sixth field is less than four):

$ awk '$6 < 4' blast_example_nocomments.txt
my_protein	gi|60654447|gb|AAX29914.1|	100.00	100.00	599	0	0	1	599	1	599	0.0	 1243
my_protein	gi|397526517|ref|XP_003833169.1|	99.50	100.00	599	3	0	1	599	1	599	0.0	 1240
my_protein	gi|18104967|ref|NP_000953.2|;gi|317373262|sp|P23219.2|PGH1_HUMAN	99.83	99.83	599	1	0	1	599	1	599	0.0	 1240
my_protein	gi|674214810|gb|AIK97765.1|	99.83	99.83	599	1	0	1	599	1	599	0.0	 1239
my_protein	gi|387018|gb|AAA36439.1|	99.50	99.50	599	3	0	1	599	1	599	0.0	 1236
my_protein	gi|76885916|gb|ABA60099.1|	99.83	99.83	596	1	0	4	599	35	630	0.0	 1234
my_protein	gi|76885914|gb|ABA60098.1|	99.83	99.83	596	1	0	4	599	35	630	0.0	 1233
my_protein	gi|530391358|ref|XP_005252162.1|	99.47	99.65	571	3	0	29	599	4	574	0.0	 1186
my_protein	gi|403310639|ref|NP_001258093.1|	91.82	91.82	599	1	1	1	599	1	551	0.0	 1115

You get the idea. We can combine multiple criteria with && – here’s how we display just the hits with fewer than four miss-matches whose length is less than 599 bases:

$ awk '$6 < 4 && $5 < 599' blast_example_nocomments.txt
my_protein	gi|76885916|gb|ABA60099.1|	99.83	99.83	596	1	0	4	599	35	630	0.0	 1234
my_protein	gi|76885914|gb|ABA60098.1|	99.83	99.83	596	1	0	4	599	35	630	0.0	 1233
my_protein	gi|530391358|ref|XP_005252162.1|	99.47	99.65	571	3	0	29	599	4	574	0.0	 1186

Unsurprisingly, combining awk with the other tools allows us to ask even more specific questions. For example, what are the bitscores (field number 13) of the hits with at least 99% identical positions (field number 3)? To get the answer, we filter the lines using awk then pipe the output to cut to extract just the fields we want:

 $ awk '$3 >= 99' blast_example_nocomments.txt | cut -f 13
 1243
 1240
 1240
 1239
 1239
 1236
 1234
 1233
 1186

Summarizing output

All of the examples we’ve seen so far result in multiple lines of output, but sometimes we want to summarize the output. Consider a variation on the example above: how many hits have at least 99% identical bases? We know how to use awk to filter out just the hits that satisfy the criterion, so all we need to do is count the number of lines of output we get. The wc tool (the name is short for word count) will do this if we use the -l (for lines) option:

$ awk '$3 >= 99' blast_example_nocomments.txt | wc -l
9

So that rather than getting back a list, we now get back a single number. Another way to summarize output is to take the average of a list of numbers – for example, what’s the average bitscore of all hits with at least 99% identity? A combination of awk and cut will give us the bitscores:

$ awk '$3 >= 99' blast_example_nocomments.txt | cut -f 13
 1243
 1240
 1240
 1239
 1239
 1236
 1234
 1233
 1186

and in fact, awk is also capable of calculating the mean average. The command that does so is ‘{a+=$13} END{print a/NR}’, which roughly translates as “for each line, add the thirteenth field to a variable called a, then at the end of the file divide a by the number of lines and print it”. Any further explanation of this command string would require a massive diversion into the syntax of awk, so we’ll just treat it as a magic string for now and look at how it’s used:

$ awk '$3 >= 99' blast_example_nocomments.txt | awk '{a+=$13} END{print a/NR}'
1232.22

To calculate the average for a different field, we just replace $13 with something else. Notice that in the above example we’re using awk twice: once to filter out the lines and once to calculate the average. We could do the whole thing in a single awk command:

$ awk '$3 >= 99  {a+=$13;b+=1} END{print a/b}' blast_example_nocomments.txt
1232.22

but it’s harder to read and would take even more space to explain.

An interesting feature of BLAST reports is that we can (depending on the settings) potentially have multiple hits to a single subject sequence, which will show up as multiple lines with identical second columns. To turn a list with duplicates into a unique list, we use the uniq command. For example, to get a list of all unique subject sequence names:

$ cut -f 2 blast_example_nocomments.txt | uniq 
gi|60654447|gb|AAX29914.1|
gi|397526517|ref|XP_003833169.1|
gi|18104967|ref|NP_000953.2|;gi|317373262|sp|P23219.2|PGH1_HUMAN
gi|674214810|gb|AIK97765.1|
...
gi|14278642|pdb|1HT5|A;gi|14278643|pdb|1HT5|B;gi|14278644|pdb|1HT8|A;gi|14278645|pdb|1HT8|B
gi|999675|pdb|1PRH|A;gi|999676|pdb|1PRH|B
gi|507932370|ref|XP_004677426.1|
gi|7245654|pdb|1EBV|A

If we compare the number of lines with and without uniq, we can get see how many subject names are duplicates:

$ cut -f 2 blast_example_nocomments.txt | wc -l
102
$ cut -f 2 blast_example_nocomments.txt | uniq | wc -l
98

In this case we get 102 lines but only 98 unique lines, so we know that 4 are duplicated. But we have no way of knowing how the duplicates are distributed; do we have a single subject sequence with five hits (resulting in four duplicates) or four subject sequences each with two hits (also resulting in four duplicates)? To find out, we can add the -c option to uniq, which will prefix each line with the number of times it occurs:

$ cut -f 2 blast_example_nocomments.txt | uniq -c
      3 gi|60654447|gb|AAX29914.1|
      1 gi|397526517|ref|XP_003833169.1|
      1 gi|18104967|ref|NP_000953.2|;gi|317373262|sp|P23219.2|PGH1_HUMAN
      2 gi|674214810|gb|AIK97765.1|
      1 gi|694936914|ref|XP_009455498.1|
      ...
      1 gi|471370351|ref|XP_004375688.1|
      2 gi|545512378|ref|XP_005625066.1|
      ...
      1 gi|507932370|ref|XP_004677426.1|
      1 gi|7245654|pdb|1EBV|A

An important limitation of uniq is that it only identifies duplicated lines that are adjacent in the file. This is fine in the case of subject sequence names, because BLAST groups multiple hits to the same sequence together, but if we try the same trick with a field where duplicated values are spread throughout the file it won’t work. For example, we might want to see how many hits there are of each length (remember, alignment length is the fifth field), but if we try doing this:

$ cut -f 5 blast_example_nocomments.txt | uniq -c
      1 599
      2 123
      3 599
      1 123
      2 599
      ...

we can see that the lines containing 599 have not been correctly grouped together because they are not adjacent. To make this command work we need to sort the alignment lengths numerically before passing them to uniq:

$ cut -f 5 blast_example_nocomments.txt | sort -n | uniq -c
      3 123
      1 309
      2 551
      1 552
      2 553
      1 567
     12 568
      1 569
      2 570
      3 571
      1 573
      5 574
      1 575
     35 576
      9 579
      1 580
      4 582
      4 596
      3 597
     11 599

Sorting the lengths in this way not only gives us the correct counts, but also makes it easier to interpret the results.

Editing fields

In the last section of this tutorial we’ll look at a few different ways that we can make changes to the fields in a tabular data file. Let’s start with our old friend awk; we already know that inside an awk command we can use $3, for example, to get the value of the third field. We can also use the same notation to set the value of the third field – say, to a specific number:

$ awk '$3=123456789' blast_example_nocomments.txt 
my_protein gi|60654447|gb|AAX29914.1| 123456789 100.00 599 0 0 1 599 1 599 0.0 1243
my_protein gi|60654447|gb|AAX29914.1| 123456789 100.00 123 0 0 1 599 602 866 0.0 1243
my_protein gi|60654447|gb|AAX29914.1| 123456789 100.00 123 0 0 1 599 702 904 0.0 1243
my_protein gi|397526517|ref|XP_003833169.1| 123456789 100.00 599 3 0 1 599 1 599 0.0 1240
...

This by itself is not very useful – let’s look at a more realistic example. Imagine that we want to express the proportion of identical bases for each hit as a decimal fraction rather than as a percentage. To accomplish this, we use the awk command ‘$3=$3/100′ i.e. “set the new value of the third field to be the current value of the third field divided by one hundred”:

$ awk '$3=$3/100' blast_example_nocomments.txt 
my_protein gi|60654447|gb|AAX29914.1| 1 100.00 599 0 0 1 599 1 599 0.0 1243
my_protein gi|60654447|gb|AAX29914.1| 1 100.00 123 0 0 1 599 602 866 0.0 1243
my_protein gi|60654447|gb|AAX29914.1| 1 100.00 123 0 0 1 599 702 904 0.0 1243
my_protein gi|397526517|ref|XP_003833169.1| 0.995 100.00 599 3 0 1 599 1 599 0.0 1240
my_protein gi|18104967|ref|NP_000953.2|;gi|317373262|sp|P23219.2|PGH1_HUMAN 0.9983 99.83 599 1 0 1 599 1 599 0.0 1240
my_protein gi|674214810|gb|AIK97765.1| 0.9983 99.83 599 1 0 1 599 1 599 0.0 1239
my_protein gi|674214810|gb|AIK97765.1| 0.9983 99.83 123 1 0 1 599 600 700 0.0 1239
my_protein gi|694936914|ref|XP_009455498.1| 0.9933 100.00 599 4 0 1 599 1 599 0.0 1239
my_protein gi|387018|gb|AAA36439.1| 0.995 99.50 599 3 0 1 599 1 599 0.0 1236
...

You’ll notice that in the output shown above, awk has separated each field with a space rather than a tab as they were originally. Depending on what we want to do with the data at this point, this may not be a problem, but if we want to get the original tab separator back we just have to set the -OFS (short for Output Field Separator) option to the tab character, which we represent as ‘\t’:

$ awk  '$3=$3/100' OFS='\t' blast_example_nocomments.txt 
my_protein	gi|60654447|gb|AAX29914.1|	1	100.00	599	0	0	1	599	1	599	0.0	1243
my_protein	gi|60654447|gb|AAX29914.1|	1	100.00	123	0	0	1	599	602	866	0.0	1243
my_protein	gi|60654447|gb|AAX29914.1|	1	100.00	123	0	0	1	599	702	904	0.0	1243
my_protein	gi|397526517|ref|XP_003833169.1|	0.995	100.00	599	3	0	1	599	1	599	0.0	1240
my_protein	gi|18104967|ref|NP_000953.2|;gi|317373262|sp|P23219.2|PGH1_HUMAN	0.9983	99.83	599	1	0	1	599	1	599	0.0	1240
my_protein	gi|674214810|gb|AIK97765.1|	0.9983	99.83	599	1	0	1	599	1	599	0.0	1239
my_protein	gi|674214810|gb|AIK97765.1|	0.9983	99.83	123	1	0	1	599	600	700	0.0	1239
...

If we want to make multiple edits to each line, we simply separate them with a comma. Imagine that we have want to take our list of hits to a protein sequence and translate the start/stop positions to the corresponding DNA sequence by multiplying them both by three. Here’s how we do it (remember that the query start and stop are fields number 8 and 9):

$ awk  '$8=$8*3,$9=$9*3' OFS='\t' blast_example_nocomments.txt 
my_protein	gi|60654447|gb|AAX29914.1|	100.00	100.00	599	0	0	3	1797	1	599	0.0	1243
my_protein	gi|60654447|gb|AAX29914.1|	100.00	100.00	123	0	0	3	1797	602	866	0.0	1243
my_protein	gi|60654447|gb|AAX29914.1|	100.00	100.00	123	0	0	3	1797	702	904	0.0	1243
my_protein	gi|397526517|ref|XP_003833169.1|	99.50	100.00	599	3	0	3	1797	1	599	0.0	1240
...

Notice that the command for multiplication is an asterisk (*), not an x.

Finally, what if we want to alter just a part of a field, rather than the whole field? For example, all those bar characters (|) in the subject hit names make them difficult to read, so we might want to change them to spaces. We could do this job with awk, but the sed tool is a bit easier to use in this case. When using sed to replace one character with another, we give it a command string that looks like this: ‘s/X/Y/g’, where X is the character we want to change and Y is the replacement. In our example, we want to change all bar characters to spaces:

$ sed 's/|/ /g' blast_example_nocomments.txt
my_protein	gi 60654447 gb AAX29914.1 	100.00	100.00	599	0	0	1	599	1	599	0.0	 1243
my_protein	gi 60654447 gb AAX29914.1 	100.00	100.00	123	0	0	1	599	602	866	0.0	 1243
my_protein	gi 60654447 gb AAX29914.1 	100.00	100.00	123	0	0	1	599	702	904	0.0	 1243
my_protein	gi 397526517 ref XP_003833169.1 	99.50	100.00	599	3	0	1	599	1	599	0.0	 1240

It’s important to note that these last few examples of editing fields do not alter the original file – if we want to save the ouput then we need to redirect it to a new file.

Doing more complicated things

The examples in this tutorial just give a very brief overview of the capabilities of command-line tools to deal with tabular data. Each of the tools mentioned here – especially sed and awk – have many more options and are capable of doing much more complicated jobs. However, as command-lines and scripts get larger, they become harder to edit and maintain.

For any job involving tabular data that’s more complex than the examples outlined above, it’s probably best to switch to a “real” programming language – Python and Perl are both excellent for this type of text manipulation. If you’d like to learn how to use Python to tackle more complicated versions of these types of problems, sign up for a mini-course by email below, or check out the Python for Biologists books.

0


Randomly sampling reads from a FASTQ file

Attention readers: this article is about how to write a Python program to randomly sample reads from a FASTQ file. If you just want to run the program, save it from this link and run it with -h to view usage. Alternatively, use one of the many other tools which perform this job, and were probably not written in an afternoon as an example. 

If you’re interested in how to write such a program yourself, then read on…

A common task in bioinformatics is to take a data file – here we’re looking at next-generation sequencing reads in FASTQ format – and generate random samples from it. Let’s say we want to sample 10% of the reads from a given file. Here’s one way of doing it – we can take advantage of the fact that each FASTQ record is exactly four lines long and grab four lines at a time, and only write them to the output if the number of records we’ve seen is an exact multiple of 10:

record_number = 0
with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if record_number % 10 == 0:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)
            record_number += 1

This works pretty well – on a test file containing 10 million FASTQ records it runs in about two minutes on my computer, with no particular attempt at optimization. An obvious problem with this code is that we’re going to get exactly the same 10% of reads every time we run it, and for some types of analysis we’d like to be able to generate a bunch of random 10% samples. To do this we need to change the logic of our code a bit; rather than looking at the record number to decide whether or not to write a given read to the output file, we’ll pick a random number between 0 and 9 and write the file if it’s equal to zero:

import random

with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if random.randrange(0,10) == 0:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)

This actually slows things down quite a bit – this version of the code takes twice as long to run, as generating random numbers is relatively computationally expensive.

This version lets us easily extend the idea for any sample size – for each record we will pick a number between 1 and 100 and if it’s less than our sample percent, then we write the record to the output file:

import random

percent = 35
with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if random.randrange(1,101) <= percent:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)

This will do nicely for when we want to sample a given proportion of reads from our file, but what about when we need to sample a given number of reads? The easiest thing to do is probably to add a pre-processing step where we count the number of lines in the file and divide by four to get the number of records, then turn that into a percentage to use in our existing code:

from __future__ import division
import random

number_to_sample = 3000000

with open("test.fastq") as input:
    num_lines = sum([1 for line in input])
total_records = int(num_lines / 4)
print("sampling " + str(number_to_sample) + " out of " + str(total_records) + " records")

percent = (number_to_sample / total_records) * 100
print("sampling " + str(percent) + " % of records")

with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if random.randrange(1,101) <= percent:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)

This code works but has a pretty severe limitation; because random.randrange() only returns integers it can’t deal correctly with situations where the correct percentage is not an integer. To fix this, we could express the proportion of reads to keep as a decimal and switch from using randrange() to random(), which returns a floating point number. However, this might still run into problems with floating-point accuracy. A neater way to do it might be to use random.sample() to pick which records to keep after counting the total number, then just compare each record to that list. Storing the list of record numbers to keep as a set allows for rapid lookup:

from __future__ import division
import random

number_to_sample = 3000000

with open("test.fastq") as input:
    num_lines = sum([1 for line in input])
total_records = int(num_lines / 4)
print("sampling " + str(number_to_sample) + " out of " + str(total_records) + " records")

records_to_keep = set(random.sample(xrange(total_records + 1), number_to_sample))
record_number = 0
with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if record_number in records_to_keep:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)
            record_number += 1

This version of the code works to sample any number of records, and as a bonus is much quicker (at the expense of memory) since it only has to generate one random number per output sequence, rather than one random number per input sequence as in the earlier versions.

If we’re planning on using this program to create multiple samples of records from a single file, then there’s one more refinement we might want to make. Rather than simply running the program multiple times, we can create a bunch of output files in a single pass, meaning that we only have to count the number of lines in the input file once, and iterate over the input file once. Here we use a list to store the output file objects, and another list to store the sets of selected records for each output file:

from __future__ import division
import random

number_to_sample = 3000000
number_of_replicates = 10

with open("test.fastq") as input:
    num_lines = sum([1 for line in input])
total_records = int(num_lines / 4)
print("sampling " + str(number_to_sample) + " out of " + str(total_records) + " records")

output_files = []
output_sequence_sets = []
for i in range(number_of_replicates):
    output_files.append(open("sample.fastq." + str(i), "w"))
    output_sequence_sets.append(set(random.sample(xrange(total_records + 1), number_to_sample)))

record_number = 0
with open("test.fastq") as input:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            for i, output in enumerate(output_files):
                if record_number in output_sequence_sets[i]:
                        output.write(line1)
                        output.write(line2)
                        output.write(line3)
                        output.write(line4)
                record_number += 1

for output in output_files:
    output.close()

Here’s a link to a slight tidied-up version of the code with a minimal user interface to make running it a bit more pleasant. Interested in learning how to write tools like this yourself? Check out the Python for Biologists books.

Update: see the very informative comment below by Andrew Dalke, who suggests a number of changes to improve performance and fix stylistic things. The most important of these is probably to replace the list comprehension:

num_lines = sum([1 for line in input])

with a generator expression, which we can do just by changing the square brackets to round ones:

num_lines = sum((1 for line in input))

This greatly reduces the memory footprint of the program, since it no longer has to store a list of 1′s in memory but rather sums them as it goes along.

2


Bad variable and function names

I teach a lot of beginner-targeted programming courses, and something that I’ve experimented with recently is trying to introduce the idea of self-documenting code early on in the learning process. I usually start off by talking about the difference between good and bad names for things (mostly functions and variables, though many of the same arguments apply to class names) , and I’ve noticed a few common patterns that tend to crop up in beginners code. I thought it might be useful to lay out these common errors in one place…. most examples are biological in nature, but hopefully they will still be comprehensible to non-biologists.

Single-letter names

OK, we’re writing a program and we need to create a new variable, but we can’t think of a good name….let’s just start with a and work our way through the alphabet. Later on we find that we have a bit of code like this:

a = 'acgatagc'
b = len(a) - 2
d = ""
for e in range(0,f,3):
    g = a[e:e+3]
    h = i.get(g.upper(), 'X')
    d = d + h

Which is well on its way to becoming completely incomprehensible. Sometimes single-letter names are the result of a desire to avoid typing or a worry about running out of space in your text editor. Neither of these are good enough reasons to write code as unreadable as the example above! Rest assured that using longer, more descriptive names will not make your program slow, or make any appreciable difference to the time it takes to type a line. Here’s a sane version:

dna = 'acgatagc'
last_codon_start = len(dna) - 2
protein = ""
for codon_start in range(0,last_codon_start,3):
    codon = dna[codon_start:codon_start+3]
    amino_acid = genetic_code.get(codon.upper(), 'X')
    protein = protein + amino_acid

which might now be interpretable as part of a program that translates DNA sequences.

Sometimes we might want to use non-meaningful variable names to illustrate a very generic bit of code: for example, if we want to demonstrate how to append a value to a list:

a = []
a.append(b)

but for these purposes it’s better to use the well-known set of metasyntatic variables:

foo = []
foo.append(bar)

You can find a handy list of such variable names in several languages at Wikipedia.

There are a couple of situations where single-letter variables do make sense; mostly where there are strong conventions for their use. For example, if we’re writing a program to deal with Cartesian co-ordinates then I won’t be too upset to see variables called x and y (though I might make a case for x_pos and y_pos). Similarly, i and j are the traditional names for variables used as counters in a loop:

for i in range(10):
    for j in range(20):

but remember that the most common use for these variables – to hold an index when iterating over a list – doesn’t often occur in Python because we generally iterate over the elements of lists directly, in which case there’s no excuse for not picking a meaningful variable name.

Naming thing after their type

This is a habit that people are most likely to fall into shortly after having learned of the existence of types, or shortly after having learned about a new type. The logic goes something like this: I’ve just been told that it’s important to remember whether a variable is a string or a number, so I’ll make that fact part of the name. This is not necessarily a terrible idea – in fact there is an entire system of variable naming based on it. Where it becomes a problem is when the type becomes the most important part of the name:

my_number = 20
my_string = "Homo sapiens"
the_list = [1,2,3]
a_file = open('foo.txt')
def my_function(dna):
    ...

This is obviously problematic: it’s generally much more important to know what values are stored in a variable:

minimum_orf_length = 20
species_name = "Homo sapiens"
reading_frames = [1,2,3]
input_file = open('foo.txt')
def translate_dna(dna):
    ...

There’s a more subtle problem with types-as-variable-names – the dynamic nature of Python means that it tends to work best when we worry about the various ways that a variable can be used, rather than its type (see Duck Typing). It’s this magic that allows us, for example, to iterate over lists, strings and files using a single syntax.

Extremely vague names

Often when we create a variable, or start writing a function, we’re not exactly sure what its job is going to be in our program. This is especially true when we first start writing code. Unfortunately, this can lead to some very unhelpful variable names – examples I have seen in the wild include data, input, output, do_stuff(), process_files(), params, object, and true_or_false. If you find yourself using a “placeholder” name like these during the process of coding, it’s a good idea to go back and change them once you’ve figure out what the function or variable is actually doing.

Sequential names

This is a perennial problem in the world of naming, whether we are talking about Python variables or word-processing files (how many people have a folder containing files with names like final_draft.doc, final_draft2.doc, final_draft2.1.doc, final_draft2_update.doc?) Thought of the perfect variable name, but then realized that you’ve already used it? No problem, just stick a “2″ on the end of that bad boy:

dna = 'agtcgnatgc'
dna2 = dna.upper()
dna3 = dna2.replace('N', '')

Hopefully it’s not necessary to point out why this can be confusing when you come back to read the code. We can rescue the above example in a couple of ways. One is to use more descriptive names:

dna = 'agtcgnatgc'
uppercase_dna = dna.upper()
clean_dna = uppercase_dna.replace('N', '')

Another way is to recognize that we’re probably not going to use dna or uppercase_dna in our program, and just do the whole thing in one step:

clean_dna = 'agtcgnatgc'.upper().replace('N', '')

When we find ourselves needing to create a new variable name by sticking a number onto the end of an existing one it’s often a good indication that the code in question should be turned into a function. One of the great thing about encapsulation using functions is that they provide a way for multiple variables with the same name to happily co-exist in a program without interfering with each other (have a look at the concept of a namespace for more on this idea.)

An even worse version of sequential names is….

Re-using names

Thought of the perfect variable name but you’ve already used it? Never mind, just overwrite it! Often this is a symptom of variables names that are too general to begin with:

# store a DNA sequence
sequence = 'ATGC'
...
# now we need to store a protein sequence
sequence = 'LSPV'

Re-using variables to store different things can make a program extremely difficult to understand. Solutions involve either using more descriptive names, which renders the above example fine:

dna_sequence = 'ATGC'
...
protein_sequence = 'LSPV'

or splitting up the code into functions where the same variable name can be re-used without fear of confusion.

Don’t confuse the idea of re-using variables for a different type of data, as in the dna/protein example above, with the idea of changing the value that’s stored in a variable, but keeping the type the same. For example, we often want to update a count, or append a character to a string:

count = count + 1
dna = dna + codon

This is fine, and doesn’t count as re-use, because the variables are still storing the same thing.

Names that only vary by case or punctuation

How many different ways can we write essentially the same variable name?

mydna
myDna
myDNA
Mydna
MyDna
MYDNA
my_dna
my_Dna
my_DNA
My_dna
My_Dna
MY_DNA

Quite apart from the Python style guidelines (all-caps names should be used only for global variables), using more than one of the above in a program will lead to madness…. don’t do it!

Names copied from example code

This is a trap that’s easy to fall into when working from examples in a course or a textbook. Imagine we are looking at a piece of example code that prints the number of each of the four bases in a DNA sequence:

dna = 'ATCGTGACTGTCAT'
for base in ['A', 'T', 'G', 'C']:
    print("count for " + base + " is " + str(dna.count(base)))

Later on, we want to implement the same idea for counting the number of times each amino acid occurs in a protein sequence, so we copy and paste the example code and modify it. We replace the dna sequence with a protein sequence, and replace the bases with amino acid codes:

dna = 'MSRSLLLRFLLFLLLLPPLP'
for base in ['M', 'L', 'F']:
    print("count for " + base + " is " + str(dna.count(base)))

The program works, but the variable names are now very misleading. Using a piece of example code as a starting point for your own programs is an excellent way to learn – but be sure to go back once you’ve finished modifying it and check that the variable names still make sense.

Anything I’ve missed, or that you disagree with? leave a comment!

0


Why readable, documented code is especially important for scientists (and a three-step plan for getting there)

During my most recent teaching engagement I spent some time talking specifically about code readability and documentation. As often happens, presenting these ideas to a roomful of novice programmers helped to crystallize my thoughts on the topic, and made me realize that I’d never written about it – plus I thought that it would be an ideal topic for first post of the new year, since documentation is something that many programmers constantly resolve to do better at!

There’s no shortage of articles and book chapters explaining the general importance of documenting your code – if you learned to program from a book or an online tutorial (such as Python for Biologists) then it will almost certainly have been mentioned. The arguments in favour of documentation are well-rehearsed: it makes it easier for you to work on your own code over a long period of time, it makes it easier for others to contribute fixes and features, it forces you to think about the purpose of each section, etc. In this post, I want to talk about why documentation is particularly important for you – somebody who is using programming for carrying out scientific work. The basis of my argument is that for us as scientists, code serves two important features over and above simply being executed: it acts as both the ultimate supplementary methods, and it’s the way in which you express your original ideas.

Code as supplementary methods

It’s probably fair to say that most users of most programs don’t care about how they actually work, as long as they do what they’re supposed to. When you fire up an image editing program to brighten up a dull digital photo or rotate one where the horizon isn’t straight, you probably aren’t interested in exactly what transformation is being applied to the RGB values, or what trigonometry is being used to straighten the image – you’re only interested in the end result.

Scientific software is different: the end-users are often extremely interested in how the program works internally, since understanding that is a part of understanding the results. And the ultimate way to resolve questions or disagreements about what a program is doing is to examine the source code. This is a great advantage we have when working in bioinformatics. For wet-lab work, there is only so much information you can give in the pages of a journal about how an experiment was carried out. Using supplementary information can help, but even then you’re limited to what the authors thought important enough to write down. For a bioinformatics experiment, however, one can always see exactly where the data came from and what happened to it, providing one has access to the source code. You can read about a piece of bioinformatics software in a journal, listen to a talk on it, discuss it with the authors, but at the end of the day if you still have questions about how it works, you can always go back to the source code.

The vast majority of programmers don’t have to worry about their users wanting to read the source, but we do – so we should make readability and documentation a priority to make sure that it’s as useful as possible.

Code as a way of expressing original ideas

The vast majority of software projects don’t implement any ideas that are particularly original. This isn’t a problem, it’s just a reflection of the fact that many pieces of software do very similar things to other pieces of software, and do them in similar ways. There are fairly standard ways of writing a blog engine, a stock management program, an image manipulation program etc. We could make an argument, therefore, that for those categories of software it’s not super-important that the code is really well-documented, since it’s unlikely to be doing anything surprising, and a reader can probably work out what’s going on in each section by referring to other software that carries out the same task.

Scientific software is different. Yes, we tend to write scripts to carry out tedious everyday tasks like tweaking file formats and correcting sequence headers, but we also use it for implementing entirely new ideas about how to assemble genomes, or how to correct frameshift mutations, or how to pick species for phylogenetic analysis. We’re far more likely than other programmers to write code that does something entirely new. As such our programs (at least the ones that do something interesting) are going to be harder to understand than yet another text editor or chat program.

As programmers, we’re very lucky in that the language we use to implement our original ideas – code – is also an excellent way to communicate them to other researchers. But the usefulness of that language depends on whether we write it in a readable way and document it well.

Three steps to readable, documented code

Documentation is a skill that is learned over the course of a career, but here’s an exercise that I often have my students do. Using a framework like this can make documenting your code less daunting if you’ve no idea where to start.

Step one: make sure your variable and function names are meaningful

Programmers are fond of talking about self-documenting code – i.e. code that doesn’t require external documentation to be understood. A large part of this is using meaningful variable names. Examples of bad variable and function/method names include:

  • Single-letter names e.g. a, b, f (with the exception of variable names that follow common conventions such as x and y for co-ordinates or i for an index)
  • Names that describe the type of data rather than the contents e.g. my_list, dict
  • Names that are extremely generic e.g. process_file(), do_stuff(), my_data
  • Names that come in multiples e.g. file1, file2
  • Names that are excessively shortened e.g. gen_ref_seq_uc
  • Multiple names that are only distinguished by case or punctuation e.g. input_file and inputfile, DNA_seq and dna_seq
  • Names that are misspelled – the computer does not care about spelling but your readers might

Go through your code and look for any instances of the above, and replace them with good names. Good variable names tell us the job of the variable or function. This is also a good opportunity to replace so-called magic numbers – constants that appear in the code with no explanation – with meaningful variable names e.g. 64 might be replaced by number_of_codons.

Example: we want to define two variables which hold the DNA sequence for a contig and a frame, then pass them to a method which will carry out protein translation and store the result. Here’s how not to do it, even though the code is perfectly valid Python:

a = 2
b = 'ATGCGATTGGA'
c = do_stuff(a, b)

This is much better:

frame = 2
contig_dna_seq = 'ATGCGATTGGA'
contig_protein_seq = translate(frame, contig_dna_seq)

Step two: write brief comments explaining the reasoning behind particularly important or complex statements

For most programs, it’s probably true to say that the complexity lies in a very small proportion of the code. There tends to be a lot of straightforward code concerned with parsing command-line options, opening files, getting user input, etc. The same applies to functions and methods: there are likely many statements that do things like unpacking tuples, iterating over lists, and concatenating strings. These lines of code, if you’ve followed step one above, are self-documenting – they don’t require any additional commentary to understand, so there’s no need to write comments for them.

This allows you to concentrate your documentation efforts on the few lines of code that are harder to understand – those whose purpose is not clear, or which are inherently difficult to understand. Here’s one such example – this is the start of a function for processing a DNA sequence codon-by-codon (e.g. for producing a protein translation):

for codon_start in range(0, len(dna)-2, 3):
codon_stop = codon_start+3
codon = dna[codon_start:codon_stop]
    ...

The first line is not trivial to understand, so we want to write a comment explaining it. Here’s an example of how not to do it:

# iterate over numbers from zero to the length of
# the dna sequence minus two in steps of three
for codon_start in range(0, len(dna)-2, 3):
...

The reason that this is a bad comment is that it simply restates what the code does – it doesn’t tell us why. Reading the comment leaves us no better off in knowing why the last start position is the length of the DNA sequence minus two. This is much better:

# get the start position for each codon
# the final codon starts two bases before the end of the sequence
# so we don't get an incomplete codon if the length isn't a multiple of three
for codon_start in range(0, len(dna)-2, 3):
...

Now we can see from reading the comment that the reason for the -2 is to ensure that we don’t end up processing a codon which is only one or two bases long in the event that there are incomplete codons at the end of the DNA sequence.

Go through your code and look for lines whose function isn’t obvious just from reading them, and add explanations

Step three: add docstrings to your functions/methods/classes/modules

Functions and methods are the way that we break up our code into discrete, logical units, so it makes sense that we should also document them as discrete, logical units. Everything in this section also applies to methods, classes and modules, but it keep things readable I’ll just refer to functions below.

Python has a very straightforward convention for documenting functions: we add a triple-quoted string at the start of the function which holds the documentation e.g.

def get_at_content(dna):
  """return the AT content of a DNA string.
     The string must be in upper case.
     The AT content is returned as a float"""
  length = len(dna)
  a_count = dna.count('A')
  t_count = dna.count('T')
  at_content = float(a_count + t_count) / length
  return at_content

This triple-quoted line is called a docstring. The advantage of including function documentation in this way as opposed to in a comment is that, because it uses a standard format, the docstring can be extracted automatically. This allows us to do useful things like automatically generate API documentation from docstrings, or provide interactive help when running the Python interpreter in a shell (take a look at the chapter on testing and documentation in Advanced Python for Biologists for an in-depth look at how this works).

There are various different conventions for writing docstrings. As a rule, useful docstrings need to describe the order and types of the function arguments and the description and type of the return value. It’s also helpful to mention any restrictions on the argument (for instance, as above, that the DNA string must be in upper case). The example above is written in a very unstructured way, but because triple-quoted strings can span multiple lines, we could also adopt a more structured approach:

def get_at_content(dna):
  """return the AT content of a DNA string.

     Arguments: a string containing a DNA sequence.
                The string must be in upper case.

     Returns: the AT content as a float"""
  ...

If you think it’s helpful, you can also give examples of how to use the function in the docstring. Notice that we’re not saying anything in the docstring about how the function works. The whole point of encapsulating code into functions is that we can change the implementation without worrying about how it will affect the calling code!

Summary

These three steps represent the minimum amount of work that you should do on any code that you plan on keeping around for more than a few weeks, or that you plan on showing to anybody else. As always, if you have questions or suggestions, leave a comment.

1


What you have in common with the Wright brothers

Warning: vast historical oversimplification below in pursuit of a point :-)

Famously, the Wright brothers built and flew the first aircraft capable of sustained, powered flight in 1903. Looking at the famous photos with eyes used to seeing modern aircraft, it looks pretty airworthy:

220px-1904WrightFlyer

 

There were plenty of other people working on heavier-than-air flying machines around that time, many with much more money and far more resources. So what was the key to the Wright brothers’ success? Did they invent a new type of engine? A new type of wing? Not really – their greatest invention was this:

wright_tunnel

This unprepossessing-looking box is a wind tunnel, which the Wright brothers – realizing that it was far too time-consuming to test wing designs by building them full scale – used to test their aeronautical designs using models. The innovation that prompted their break-through was not an improvement to aircraft, but an improvement in the process for designing aircraft. By using a wind tunnel, they were simply able to make their mistakes faster than anyone else, and to learn from them. Others had to learn by building, and crashing, full-size aircraft.

This is far from an original observation, but I think it has some connection with programming. The story of the Wright brothers illustrates the power of rapid iterative improvement – their approach would probably be called “agile” if it were being used today. The difference between the Wright brothers and their contemporary rivals mirrors one that I often see between the different approaches to writing code I see being used by my students.

On the one hand, you have people who favour small, incremental improvements when writing a program or a function, testing each bit of code as soon as possible and uncovering bugs and mistakes early. Students who program in this way end up with programs and functions that resemble the Wright Flyer pictured above: crude and primitive, perhaps, but certainly fit-for-purpose and relatively unlikely to result in broken bones.

On the other hand, you have people who try to write an entire program or function all in one go, never testing any bit of it until the whole thing is written. Students who program in this way end up with programs and function that resemble other products of early aviation:

images

 

As the picture above attests, this is a recipe for pain.

 

0


Improving database web interfaces with userscripts

Those of us who spend most of our day working on the command line have generally got into the habit of writing small, simple scripts to solve everyday problems and annoyances. The web-browser equivalent of these everyday problem-solving scripts are userscripts – small snippets of Javascript which run on specific web pages. In this post I’m going to show a quick example of how we can use a userscript to add a missing feature to the NCBI taxonomy browser.

Two quick notes before we get started…..

What is a userscript? A userscript is just a piece of Javascript plus a few bits of metadata stored as comments.

How do I use a userscript? You have to install an extension for your web browser – follow the instructions here.

The problem

When I’m in a phylogenetic frame of mind, I often find myself browsing the NCBI taxonomy. This is the database that stores taxonomic information and relationships for all organisms with sequence data in GenBank and we can use it to view a page for any taxonomic group – here’s a screenshot of a bit of the page for Endopterygota (click to embiggen):

Selection_006

This page shows us the various taxonomic groups that belong to Endopterygota laid out in hierarchically. Clicking on the name of one of these groups will take us to the appropriate page, so we can navigate around the tree of life quite easily. We can also display some extra information on this page – checking the “Nucleotide” box above the tree and then clicking “Display” will cause the number of nucleotide records belonging to each group to be displayed after the name:

Selection_007

This is pretty useful – we can go straight to the list of nucleotide records for a given group by clicking the number, but we can also just use the numbers to survey the distribution of sequence data for the groups we’re looking at. For example, in the above view, there are lots of sequence for butterflies and moths, and beetles. The trouble is that the view presented above isn’t a very intuitive way to look at the relative numbers – reading the counts and comparing requires a fair amount of mental overhead. What would be great is if there were an option on the NCBI website to display the number of nucleotide records for each group visually – say, as a bar whose width corresponds to the number of records:

Selection_008

The NCBI website doesn’t have such a feature but, as you can probably guess from the above screenshot, we are going to add it ourselves using a userscript.

The Solution

Before we start coding, we can break the problem down into a few steps. First, we need to get a list of all the counts that appear on the original web page. Then, we need to figure out what the largest count is so that we can scale the bars to a sensible size. Finally, we need to replace each count with a bar of the appropriate width.

Getting the list of counts is pretty easy – if we look at the source HTML for the page we can see that each of the nucleotide counts is an a element with the title attribute set to “Nucleotide”. We can use the JQuery library to grab the list of count elements and store it as nuc_counts:

// get list of nucleotide counts
var nuc_counts = $('[title="Nucleotide"]');

Calculating the maximum count is a little bit trickier. We need to take each count, strip out all the commas, turn the count into an integer, then grab the maximum value from the list of integers. I won’t spend time here going into the craziness required to get the maximum value from an array in Javascript: suffice it to say that we’ll use JQuery’s map function to turn our list of string counts into a list of integers, then find the maximum and store it in a variable called max_nuc_count:

// calcaulate maximum nucleotide count
max_nuc_count = Math.max.apply(Math, $.map(nuc_counts, function(x,i){return parseInt(x.text.replace(/,/g,""))}))

Now for the main body of the script. We’ll iterate over our array of count elements, and for each one use JQuery to construct a new element to replace it. The new element will be a div, and we’ll need to set its width to a value that reflects the original count. To do this, we’ll take the count, multiply it by five hundred, then divide the result by the maximum count that we calculated earlier – in other words, we’ll scale all the bars so that the widest one is five hundred pixels wide. The only other tricky bit is making sure that the bar gets displayed inline with the taxon name, rather than on a line of its own – to do this, we set the “display” attribute of the bar to “inline” or “inline-block”:

// for each count element...
for (var i=0; i<nuc_counts.length; i++){
    var count_element = nuc_counts[i];

    // remove the commas from the number and turn it into an integer
    var count = parseInt(count_element.text.replace(/,/g,""));

    // use jquery to create a new div element which will be the bar representing the nucleotide record count
    bar = $('<div>&nbsp;</div>')	// the div needs to contain a non-breaking space; if it is completely empty then it will not be displayed
    	.css('margin-bottom', 2)	// add a tiny space at the bottom so that there's a little gap between bars
    	.css('display', 'inline-block')	// force the div to display as an inline element so that it can share a line with the taxon name
    	.css('background-color', 'RoyalBlue') // pick a nice colour for the bar
    	.css('width', (count * 500) / max_nuc_count);	// calculate the width for the bar, scaled to the max

    // replace the original count element with the new bar
    $(count_element).replaceWith(bar);
}

So far so good: this gives us a nicely scaled set of bars and makes sure that the widest bar (i.e. the one at the top, which is the sum of all the others) fits on the screen. We could easily make the bar scale bigger or smaller by changing the 500 in the above code to something else – we could even take into account the width of the browser window if we wanted.

Finally, let’s add a couple of finishing touches. There are two things missing from the above solution: firstly, there’s no way to see the actual numbers, and there’s no way to click through to the list of records themselves. We can solve the second problem by creating an anchor element to wrap around the bar, with the target url copied from the original count. And we can solve the first problem by giving the anchor a “title” attribute which contains the original count, so that when we hover the mouse cursor over a given bar, it will display the exact number of nucleotide records. JQuery does most of the hard work here:

// get list of nucleotide counts
var nuc_counts = $('[title="Nucleotide"]');

// calcaulate maximum nucleotide count
max_nuc_count = Math.max.apply(Math, $.map(nuc_counts, function(x,i){return parseInt(x.text.replace(/,/g,""))}))

// for each count element...
for (var i=0; i<nuc_counts.length; i++){
    var count_element = nuc_counts[i];
    
    // remove the commas from the number and turn it into an integer
    var count = parseInt(count_element.text.replace(/,/g,""));
    
    // use jquery to create a new anchor element which will link to the nucleotide records
    anchor = $('<a></a>')
    	.attr('href', count_element.href)	// use the original count as a tooltip
    	.attr('title', count_element.text); // grap the nucleotide search url from the original element
    
    // use jquery to create a new div element which will be the bar representing the nucleotide record count
    bar = $('<div>&nbsp;</div>')	// the div needs to contain a non-breaking space; if it is completely empty then it will not be displayed
    	.css('margin-bottom', 2)	// add a tiny space at the bottom so that there's a little gap between bars
    	.css('display', 'inline-block')	// force the div to display as an inline element so that it can share a line with the taxon name
    	.css('background-color', 'RoyalBlue') // pick a nice colour for the bar
    	.css('width', (count * 500) / max_nuc_count);	// calculate the width for the bar, scaled to the max
    
    // put the bar inside the anchor so that you can click on
    anchor.append(bar);	
    
    // replace the original count element with the new anchor/bar
    $(count_element).replaceWith(anchor);
}

And there we have it. To turn this into a userscript, all we have to do is add a set of specially-formatted comments at the top which can be parsed by whichever browser extension we want to use. In particular, we need to specify which web pages the script should run on using a regular expression (the @match line below). Here’s the script in full:

// ==UserScript==
// @name       NCBI Taxonomy nucleotide record count barchart
// @namespace  http://pythonforbiologists.com/
// @version    0.1
// @description replace nucleotide record counts in NCBI taxonomy with bars, see http://pythonforbiologists.com/index.php/adding-features-ncbi-taxonomy/
// @match      http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi*
// @copyright  2012+, You
// ==/UserScript==

// @require  http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js

// get list of nucleotide counts
var nuc_counts = $('[title="Nucleotide"]');

// calcaulate maximum nucleotide count
max_nuc_count = Math.max.apply(Math, $.map(nuc_counts, function(x,i){return parseInt(x.text.replace(/,/g,""))}))

// for each count element...
for (var i=0; i<nuc_counts.length; i++){
    var count_element = nuc_counts[i];
    
    // remove the commas from the number and turn it into an integer
    var count = parseInt(count_element.text.replace(/,/g,""));
    
    // use jquery to create a new anchor element which will link to the nucleotide records
    anchor = $('<a></a>')
    	.attr('href', count_element.href)	// use the original count as a tooltip
    	.attr('title', count_element.text); // grap the nucleotide search url from the original element
    
    // use jquery to create a new div element which will be the bar representing the nucleotide record count
    bar = $('<div>&nbsp;</div>')	// the div needs to contain a non-breaking space; if it is completely empty then it will not be displayed
    	.css('margin-bottom', 2)	// add a tiny space at the bottom so that there's a little gap between bars
    	.css('display', 'inline-block')	// force the div to display as an inline element so that it can share a line with the taxon name
    	.css('background-color', 'RoyalBlue') // pick a nice colour for the bar
    	.css('width', (count * 500) / max_nuc_count);	// calculate the width for the bar, scaled to the max
    
    // put the bar inside the anchor so that you can click on
    anchor.append(bar);	
    
    // replace the original count element with the new anchor/bar
    $(count_element).replaceWith(anchor);
}

If you want to install this extension and try it out, I’ve added it to the userscripts.org repository – you should be able to install it by going here, once you have installed the browser extension. If you come up with any improvements to the code, or have any suggestions for other database web interface fixes or features, shout out in the comments!

0


The world’s worst genome assembler in six lines of Python

So, after I posted my new business cards the other day I got a comment to the effect that I should have made one with an aligner. That got me thinking about the biggest thing that I could conceivably fit on a business card, if I didn’t care about readability. So I decided that I could probably fit an incredibly bad sequence assembly program on one. Just for fun, I wrote the whole thing in a purely functional style – a total of six lambda expressions.

lp (4)

Warning: the rest of this post contains discussions of horrible code, rampant abuse of Python features, and the complete opposite of all good programming practise. Read on with caution!

The API is quite straightforward: give the ah() expression a list of DNA sequences, and it will return a consensus string. Here’s an ultra-short example:

reads = ['TCCCAGTGAACCCA', 'TTCCGTGCGGTTAAG', 'GTCCCAGTGAACCCACAA', 'TGAACCCACAAAACG', 'ACCCACAAAACGTGA', 'GAACCCACAAAACGTGA', 'TCCGTGCGGTTAAGC', 'TGAACCCACAAA', 'CCGTGCGGTTAAGCGTGA', 'TGACAGTCCCAGTGAA', 'AACCCACAAAACGTGA', 'AGTGAACCCACAAAACGT', 'GTTAAGCGTGA', 'CCGTGCGGTTAAGCGTGA', 'AGCGTGACAGT', 'TGCGGTTAAGCG', 'ACAAAACGTGATG', 'ACAGTCCCAGTGAACC', 'TAAGCGTGACAGTCCCA', 'TCGAATTCCGT', 'TTCTCGAATTCCGTGCG', 'ACAAAACGTG', 'CCACAAAACGTG', 'TGCGGTTAAG', 'GAACCCACAAAACGTGA', 'TCTCGAATTCC', 'ATTCCGTGCGGTTAA', 'ACCCACAAAAC', 'CGTGCGGTTAAGCGTGA', 'CCAGTGAACCCACAA', 'TGCGGTTAAGCGTG', 'CCCACAAAACG', 'TCTCGAATTC', 'AATTCCGTGCGGTT', 'ACAGTCCCAGTGA', 'GTCCCAGTGAACCCA', 'TGAACCCACAAA', 'CCCACAAAACGTG', 'TCCCAGTGAACCCACA', 'CTCGAATTCCGTGCG']
print(ah(reads))
# TGCGGACAAAACGTGTGAACGTGAGGTTAAGCGTGACAGTCCCAGTGAACCCACAAAACGT

And here’s a local alignment to the sequence from which the fake reads were generated, just to prove that it’s actually doing something slightly better than picking the longest read:

 6 GAATTCCGTGCGGTTAAGCGTGACAGTCCCAGTGAACCCACAAAACGT     53
   |||   ||||.|||||||||||||||||||||||||||||||||||||
17 GAA---CGTGAGGTTAAGCGTGACAGTCCCAGTGAACCCACAAAACGT     61

The longest read is 20 bases, and the local match region is 48 bases.

How it works

Let’s start by putting each expression on its own line:

cm = lambda d,ds: max([m(d,e) for e in ds if e != d])
m = lambda d,e: max([(s(d,e,o),o,e,d) for o in range(1-len(e),len(d))])
s = lambda d,e,o: sum([1 for p in range(max(0-o,0), min([len(e)-o, len(e), len(d)-o])) if e[p] == d[p+o]])
con = lambda x,o,s,c : c[0:max(0,o)] + s +  c[len(s)+o:]
a = lambda s, o : con(*cm(s, o)) if len(o) == 1 else a(con(*cm(s, o)), [ y for y in o if y != cm(s, o)[2]])
ah = lambda d : a(d[0],d[1:])

Next we’ll convert each expression to an equivalent function:

def cm(d,ds):
    return max([m(d,e) for e in ds if e != d])

def m(d,e):
    return max([(s(d,e,o),o,e,d) for o in range(1-len(e),len(d))])

def s(d,e,o):
    return sum([1 for p in range(max(0-o,0), min([len(e)-o, len(e), len(d)-o])) if e[p] == d[p+o]])

def con(x,o,s,c):
    return c[0:max(0,o)] + s +  c[len(s)+o:]

def a(s, o):
    return con(*cm(s, o)) if len(o) == 1 else a(con(*cm(s, o)), [ y for y in o if y != cm(s, o)[2]])

def ah(d):
    return a(d[0],d[1:])

And change the names to be slightly more descriptive (this makes some lines extremely long, so you’ll have to scroll right to read them), and rearrange them slightly:

# given two sequences and an offset, count the number of matching bases
def score(sequence1,sequence2,offset):
    return sum([1 for position in range(max(0-offset,0), min([len(sequence2)-offset, len(sequence2), len(sequence1)-offset])) if sequence2[position] == sequence1[position+offset]])

# given two sequences, find the offset which gives the best score
def find_best_offset(sequence1,sequence2):
    return max([(score(sequence1,sequence2,offset),offset,sequence2,sequence1) for offset in range(1-len(sequence2),len(sequence1))])

# given a single sequence and a collection of others, find the other sequence with the best match score
def find_best_match(sequence,others):
    return max([find_best_offset(sequence,sequence2) for sequence2 in others if sequence2 != sequence])

# given two sequences and an offset, calculate the consensus
def consensus(score,offset,sequence1,sequence2):
    return sequence2[0:max(0,offset)] + sequence1 +  sequence2[len(sequence1)+offset:]

# given a sequence and collection of others, return the complete consensus using recursion
def assemble(sequence, others):
    return consensus(*find_best_match(sequence, others)) if len(others) == 1 else assemble(consensus(*find_best_match(sequence, others)), [ y for y in others if y != find_best_match(sequence, others)[2]])

# given a collection of sequences, call assemble() to start the recursion
def assemble_helper(dnas):
    return assemble(dnas[0],dnas[1:])

Now we can look at each function in more detail.

score() is the basic scoring function. It takes two sequences and an offset, and returns the number of characters that match. It’s the simplest possible function for scoring an ungapped alignment between two sequences. It works by calculating the first and last positions of the overlapping region relative to sequence2, then counts up the number of positions for which the base in sequence2 is the same as the base in sequence1 at that position plus the offset. To get everything in one expression, it uses a list comprehension to build a list of 1′s for each matching position, then sums the list. Here it is written out a bit more conventionally. The only complicated thing going on here is the calculation of the start and stop positions.

def score(sequence1,sequence2,offset):
    start_of_overlap = max(0-offset,0)
    end_of_overlap = min([len(sequence2)-offset, len(sequence2), len(sequence1)-offset])
    total_score = 0
    for position in range(start_of_overlap, end_of_overlap):
        if sequence2[position] == sequence1[position+offset]:
            total_score = total_score + 1
    return total_score

find_best_offset() is the function that tries to maximize the score for a pair of sequences by trying every possible offset. It works by first calculating the range of possible offsets, then using a list comprehension to build a list of tuples, one tuple for each possible offset. Each tuple contains the score, the offset, and the two sequences – this slightly weird way of storing the results is necessary so that the information can be passed to the other functions, as we’ll see in a minute. To find the single best offset, we take advantage of the fact that in Python, sorting a list of tuples sorts them by their first element. Since the first element of each of our tuples is the score, if we simply ask for the max() of the list we get the tuple with the highest first element i.e. the one representing the best score. Here’s the sensible version:

def find_best_offset(sequence1,sequence2):
    lowest_offset = 1-len(sequence2)
    highest_offset = len(sequence1)
    all_offsets = []
    for offset in range(lowest_offset,highest_offset):
        # add the 4-tuple for this offset
        all_offsets.append((score(sequence1,sequence2,offset),offset,sequence2,sequence1))
    return max(all_offsets)

find_best_match() is probably the most straightforward function of the bunch. Given a single sequence and a list of other sequences, it finds the other sequence that has the best match by calling find_best_offset() for each of them in turn. It uses the same tuple-sorting trick as before to figure out which match is the best:

def find_best_match(sequence,others):
    all_matches = []
    for sequence2 in others:
        if sequence2 != sequence:
            all_matches.append(find_best_offset(sequence,sequence2))
    return max(all_matches)

The consensus() function gave me quite a bit of trouble. Its job is to take two sequences plus a given offset, and return the consensus sequence of the two. Of course, it doesn’t do anything like what we normally mean by consensus – it simply concatenates the relevant bits of the two sequences to make a longer one. The logic behind how it works is a little bit hard to follow. We construct the consensus sequence by taking the full length of sequence one, and sticking any left-hand overhang from sequence two on the left end and any right-hand overhang from sequence two on the right end. In other words, you should read the return line as “return any bits of sequence2 that stick out to the left, followed by the whole of sequence1, followed by any bits of sequence two that stick out to the right”. For most overlapping pairs of sequences, either the first or last bit of the returned string will be zero-length, which is why the thing works as a single expression in the compact version.

def consensus(score,offset,sequence1,sequence2):
    sequence2_left_overhang = sequence2[0:max(0,offset)]
    sequence2_right_overhang = sequence2[len(sequence1)+offset:]
    return sequence2_left_overhang + sequence1 + sequence2_right_overhang

The assemble() function is probably the most complicated (and certainly the most inefficient). I cheated a little bit to get it onto a single line by using the ternary operator “x if y else z”. It’s a recursive function that takes a single sequence and a collection of other sequences. It finds the best match for the sequence among the others and calculates the consensus of the sequence and the best-matching other. If that’s the only member of others (i.e. the others list has just one element) it simply returns the consensus. If the others list has more than one element, it removes the best-matching one and calls itself recursively with the newly-built consensus as the single sequence. Here it is expanded:

def assemble(sequence, others):
    # remember, best_matching_other is a 4-tuple
    best_matching_other = find_best_match(sequence, others)
    # the * expands the elements of the tuple so we can use them as arguments to consensus()
    consensus_sequence = consensus(*best_matching_other)
    if len(others) == 1:
        return consensus_sequence
    else:
        # get the second element of the best_matching_other tuple, which is the sequence
        best_matching_sequence = best_matching_other[2]
        others.remove(best_matching_sequence)
        return assemble(consensus_sequence, others)

assemble_helper() is, as the name suggests, just a helper function which kicks off the recursion by calling assemble() with the first element as the single sequence and the remainder of the elements as the list of other sequences.

Let’s sum up the algorithm then (described iteratively, even though it’s written recursively). To assemble a list of N DNA sequences, we take the first sequence, and find the member of the remaining N-1 sequences which has the best match. We remove this best-matching member from the list (leaving N-2 sequences) and calculate the consensus of these two sequences. We then append the newly-built consensus onto the end of the list (bringing the sequence count back up to N-1), then go back to the start and begin again. Hopefully it’s clear that, since the number of sequences in the list decreases by one in each iteration, we will eventually end up with a list of just a single sequence, which is our result.

A discussion of the performance of this algorithm is both beyond the scope of this post, and entirely unnecessary. Suffice it to say that it has horrible performance in terms of both computation time and results! At around 430 characters I think that the compact version is pushing the limits of what can fit on a business card in 12 point text. If anyone can think of a way to squeeze a few characters, please let me know in the comments!

0


New business cards!

I’ve been meaning for a while to get round to making some business cards to hand out to folks who ask me about learning to program. Normally I just tell people to google “python for biologists” and they’ll end up in the right place, but it would be nice to have a physical reminder to give out. At first I though about having some USB memory stick business cards made  - there are some really cool ones that are the shape of a normal business card but can fold in half to reveal a set of USB contacts. Unfortunately they’re way expensive, and the minimum order is far more than I need.

Next I thought about making a “cheat sheet” style business card – the type with contact information on the front and some useful quick-reference information (e.g. a list of regular expression characters) on the back. I guess the idea would be that the recipient is more likely to hang onto the card if it has useful information on it. But I couldn’t think of anything that would fit in well with my website – after all, the emphasis of pythonforbiologists is on learning to program, not simply the practice of programming itself.

Finally, I had an idea; I would put a tiny biology-themed programming exercise on the back of each of my business cards, along with a link to a web page giving the solution.

cards1

This would hopefully mean that when somebody gets hold of one of my cards they can see straight away what kind of material and training I provide, and can head over to the website for more information. I wrote five different nano-exercises on five different biological topics:

  • parsing FASTQ file format
  • counting the number of occurrences of short motifs in DNA sequences
  • calculating AT content using a sliding window
  • generating the reverse complement of a DNA sequence
  • calculating restriction fragment lengths

Fitting the sample code onto the business cards was quite difficult. I wanted to make sure that the code would be readable and not too hard to understand – I even found room for a few comments – but it also had to be very concise. I only had about ten lines to work with, so I had to use very short variable names.

cards2

You can see images of all the reverse sides at this link.

After I’d designed the code samples and exercises I wrote web pages for each of the solutions. I decided to put the exercise description and the link to the solution pages on the front of the card, as I’d used up all the room on the back with the code samples.

cards3

I tried to make each solution page interesting to read. As well as giving an answer to the exercise, I included extra material about useful bits of the Python language that some people don’t know about. For example, in the solution page to the FASTQ parser exercise I talked about generator functions, and in the sliding window exercise solution I talked about higher-order functions.

You can browse all of the exercises along with links to their solution pages here. Comments appreciated!

3

Powered by WordPress. Designed by Woo Themes

Get a free Python mini-course
New to programming? No problem - this course will get you started
Never display this again