Archive | Python

A new programming book for non-biologists

Since writing and publishing Python for Biologists I occasionally get email from folks who found the books useful, and want to pass them on to non-scientist friends. This is, of course, very gratifying to hear, but the problem is that the examples and exercises won’t make much sense to anyone without a biology background.

With this in mind, here’s my first programming book for non-scientists:

front

 

 

 

 

 

 

 

 

As the title suggests it’s suitable for anyone just getting started with programming, and assumes no background in biology or even science. The only thing it shares with Python for Biologists is some of the text and a friendly, approachable style. If you have friends outside biology who are interested in programming, please send them along!

 

0

Five things I hate about teaching Python

I’ve just finished teaching an intensive Python programming course and, as usual, spending a week thinking about how best to introduce my students to programming has given me something to write about. I realised that, while I’ve spent a lot time talking about why Python is a great language, I have a number of pet peeves that I’ve never written down.

I’m not talking about the usual problems, like Python’s relative lack of performance or lack of compile-time type checking – these things are deliberate design trade-offs and changing them would involve making Python not-Python. I’m talking about the small things that cause friction, especially in a teaching environment.

Note: I realize that there are good reasons for all these things to be the way they are, so don’t take this too seriously….

1. Floating point vs. integer division

Anyone who’s written in Python for any length of time probably types this line automatically without really thinking about it:

from __future__ import division

but take a moment to consider how you would explain what’s going on in this piece of code to a beginner. In order to really understand what’s happing here, you have to know about:

  • Python’s system for importing modules
  • Python’s system for grouping modules into packages
  • the fact that there are different versions of Python with slightly different behaviour
  • the difference between floating-point and integer numbers
  • the mechanisms of operator overloading, whereby we can define the behaviour of things like + and / for different types
  • the concept of polymorphic functions and operators, which allow us to treat different classes the same, some of the time

Explaining all this to someone who has never written a line of code before is unlikely to be productive, but none of the alternatives are particularly attractive either. We can just present this as a magic piece of code and save the explanation for later (this is normally what I do). We can instruct students to use explicit floating point numbers:

answer = float(4)/3
answer = 4.0/3

, but eventually they will forget and use integers and find that it works some of the time. We can carefully craft our examples and exercises to avoid the need for floating point division, but this is setting students up for pain further down the line. We can use the command-line argument -Q to force floating-point division, or just use Python 3 for teaching, but both of these options will cause confusion once the student goes back to their own environment.

2. split() vs. join()

“OK class, this is how we take a string and split it up into a list of strings using a fixed delimiter:”

sentence = "The all-England summarize Proust competition"
words = sentence.split(" ")

“So I guess, logically, to put the words back together again we just say:

sentence = words.join(" ")

right? Look at that elegant symmetry…… Wait a minute, you’re telling me it doesn’t work like that? The list and the delimiter actually go the other way around, so that we have to write this ugly line?

sentence = " ".join(words)

Wow, that just looks wrong.”

Yes, I know that there are good reasons for collection classes to only have methods that are type-agnostic, but would it really be so bad to just str() everything?

3. Exhaustible files

It’s perfectly logical that you shouldn’t be able to iterate through a file object twice without re-opening it….. once you know a fair bit about how iteration is actually implemented in Python. As a beginner, thought, it’s a bit like Python is giving with one hand and taking away with the other – you can use an opened file object just like a list, except in this one specific but very important way:

my_list = [1,2,3,4]
for number in my_list:
    do_something(number)
# second loop works just as you'd expect
for number in my_list:
    do_something_else(number)

my_file = open("some.input")
for line in my_file:
    do_something(line)
# second loop silently never runs
for line in my_file:
    do_something_else(line)

This problem also rears its ugly head when students try to iterate over a file having already consumed its contents using read():

my_file = open("some.input")
my_contents = my_file.read()
....
# this loop silently never runs
for line in my_file:
    do_something(line)

That second line can be difficult to spot for student and teacher alike when there are many intervening lines between it and the loop.

4. Lambda expressions

OK, this one is more annoying when writing code than when teaching it, since I rarely get round to talking about functional programming in introductory courses. I totally get why there should be a big, obvious flag when we are doing something clever (which lambda expressions generally are). Nevertheless, it seems a shame to have a style of coding that lends itself to elegant brevity marred by so many unnecessary keystrokes.

I think that the reason this bugs me so much is that I first got into functional programming by way of Groovy, which has (to me) a very pleasing syntax for anonymous functions (actually closures):

{x,y -> x**y}

compared to Python:

lambda x,y : x**y

Of course, Python lessens the sting of having to type lambda with its various comprehensions:

squares = map(lambda x : x**2, range(10))
squares = [x**2 for x in range(10)]

so I can’t complain too loudly.

5. Variables aren’t declared

It’s just way too easy for beginners to make a typo that brings their progress to a screeching halt. Consider this real-life example from my most recent course:

positions = [0]
for pos in [12,54,76,103]:
    postions  = positions + [pos]
print(positions) # prints [0] rather than [0,12,54,76,103]

Leaving aside that this particular example could have been salvaged by using positions.append(), it took way to long for us to track down the typo. In real-life code, this is the kind of thing that would ideally be caught by unit testing. This is one (rare!) case in which I pine for the old days of teaching Perl – use strict and my would have taken care of this type of problem.

10

Randomly sampling reads from a FASTQ file

Attention readers: this article is about how to write a Python program to randomly sample reads from a FASTQ file. If you just want to run the program, save it from this link and run it with -h to view usage. Alternatively, use one of the many other tools which perform this job, and were probably not written in an afternoon as an example. 

If you’re interested in how to write such a program yourself, then read on…

A common task in bioinformatics is to take a data file – here we’re looking at next-generation sequencing reads in FASTQ format – and generate random samples from it. Let’s say we want to sample 10% of the reads from a given file. Here’s one way of doing it – we can take advantage of the fact that each FASTQ record is exactly four lines long and grab four lines at a time, and only write them to the output if the number of records we’ve seen is an exact multiple of 10:

record_number = 0
with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if record_number % 10 == 0:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)
            record_number += 1

This works pretty well – on a test file containing 10 million FASTQ records it runs in about two minutes on my computer, with no particular attempt at optimization. An obvious problem with this code is that we’re going to get exactly the same 10% of reads every time we run it, and for some types of analysis we’d like to be able to generate a bunch of random 10% samples. To do this we need to change the logic of our code a bit; rather than looking at the record number to decide whether or not to write a given read to the output file, we’ll pick a random number between 0 and 9 and write the file if it’s equal to zero:

import random

with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if random.randrange(0,10) == 0:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)

This actually slows things down quite a bit – this version of the code takes twice as long to run, as generating random numbers is relatively computationally expensive.

This version lets us easily extend the idea for any sample size – for each record we will pick a number between 1 and 100 and if it’s less than our sample percent, then we write the record to the output file:

import random

percent = 35
with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if random.randrange(1,101) <= percent:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)

This will do nicely for when we want to sample a given proportion of reads from our file, but what about when we need to sample a given number of reads? The easiest thing to do is probably to add a pre-processing step where we count the number of lines in the file and divide by four to get the number of records, then turn that into a percentage to use in our existing code:

from __future__ import division
import random

number_to_sample = 3000000

with open("test.fastq") as input:
    num_lines = sum([1 for line in input])
total_records = int(num_lines / 4)
print("sampling " + str(number_to_sample) + " out of " + str(total_records) + " records")

percent = (number_to_sample / total_records) * 100
print("sampling " + str(percent) + " % of records")

with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if random.randrange(1,101) <= percent:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)

This code works but has a pretty severe limitation; because random.randrange() only returns integers it can’t deal correctly with situations where the correct percentage is not an integer. To fix this, we could express the proportion of reads to keep as a decimal and switch from using randrange() to random(), which returns a floating point number. However, this might still run into problems with floating-point accuracy. A neater way to do it might be to use random.sample() to pick which records to keep after counting the total number, then just compare each record to that list. Storing the list of record numbers to keep as a set allows for rapid lookup:

from __future__ import division
import random

number_to_sample = 3000000

with open("test.fastq") as input:
    num_lines = sum([1 for line in input])
total_records = int(num_lines / 4)
print("sampling " + str(number_to_sample) + " out of " + str(total_records) + " records")

records_to_keep = set(random.sample(xrange(total_records + 1), number_to_sample))
record_number = 0
with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if record_number in records_to_keep:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)
            record_number += 1

This version of the code works to sample any number of records, and as a bonus is much quicker (at the expense of memory) since it only has to generate one random number per output sequence, rather than one random number per input sequence as in the earlier versions.

If we’re planning on using this program to create multiple samples of records from a single file, then there’s one more refinement we might want to make. Rather than simply running the program multiple times, we can create a bunch of output files in a single pass, meaning that we only have to count the number of lines in the input file once, and iterate over the input file once. Here we use a list to store the output file objects, and another list to store the sets of selected records for each output file:

from __future__ import division
import random

number_to_sample = 3000000
number_of_replicates = 10

with open("test.fastq") as input:
    num_lines = sum([1 for line in input])
total_records = int(num_lines / 4)
print("sampling " + str(number_to_sample) + " out of " + str(total_records) + " records")

output_files = []
output_sequence_sets = []
for i in range(number_of_replicates):
    output_files.append(open("sample.fastq." + str(i), "w"))
    output_sequence_sets.append(set(random.sample(xrange(total_records + 1), number_to_sample)))

record_number = 0
with open("test.fastq") as input:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            for i, output in enumerate(output_files):
                if record_number in output_sequence_sets[i]:
                        output.write(line1)
                        output.write(line2)
                        output.write(line3)
                        output.write(line4)
                record_number += 1

for output in output_files:
    output.close()

Here’s a link to a slight tidied-up version of the code with a minimal user interface to make running it a bit more pleasant. Interested in learning how to write tools like this yourself? Check out the Python for Biologists books.

Update: see the very informative comment below by Andrew Dalke, who suggests a number of changes to improve performance and fix stylistic things. The most important of these is probably to replace the list comprehension:

num_lines = sum([1 for line in input])

with a generator expression, which we can do just by changing the square brackets to round ones:

num_lines = sum((1 for line in input))

This greatly reduces the memory footprint of the program, since it no longer has to store a list of 1’s in memory but rather sums them as it goes along.

3

Powered by WordPress. Designed by Woo Themes