Author Archive | martin

Randomly sampling reads from a FASTQ file

Attention readers: this article is about how to write a Python program to randomly sample reads from a FASTQ file. If you just want to run the program, save it from this link and run it with -h to view usage. Alternatively, use one of the many other tools which perform this job, and were probably not written in an afternoon as an example. 

If you’re interested in how to write such a program yourself, then read on…

A common task in bioinformatics is to take a data file – here we’re looking at next-generation sequencing reads in FASTQ format – and generate random samples from it. Let’s say we want to sample 10% of the reads from a given file. Here’s one way of doing it – we can take advantage of the fact that each FASTQ record is exactly four lines long and grab four lines at a time, and only write them to the output if the number of records we’ve seen is an exact multiple of 10:

record_number = 0
with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if record_number % 10 == 0:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)
            record_number += 1

This works pretty well – on a test file containing 10 million FASTQ records it runs in about two minutes on my computer, with no particular attempt at optimization. An obvious problem with this code is that we’re going to get exactly the same 10% of reads every time we run it, and for some types of analysis we’d like to be able to generate a bunch of random 10% samples. To do this we need to change the logic of our code a bit; rather than looking at the record number to decide whether or not to write a given read to the output file, we’ll pick a random number between 0 and 9 and write the file if it’s equal to zero:

import random

with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if random.randrange(0,10) == 0:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)

This actually slows things down quite a bit – this version of the code takes twice as long to run, as generating random numbers is relatively computationally expensive.

This version lets us easily extend the idea for any sample size – for each record we will pick a number between 1 and 100 and if it’s less than our sample percent, then we write the record to the output file:

import random

percent = 35
with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if random.randrange(1,101) <= percent:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)

This will do nicely for when we want to sample a given proportion of reads from our file, but what about when we need to sample a given number of reads? The easiest thing to do is probably to add a pre-processing step where we count the number of lines in the file and divide by four to get the number of records, then turn that into a percentage to use in our existing code:

from __future__ import division
import random

number_to_sample = 3000000

with open("test.fastq") as input:
    num_lines = sum([1 for line in input])
total_records = int(num_lines / 4)
print("sampling " + str(number_to_sample) + " out of " + str(total_records) + " records")

percent = (number_to_sample / total_records) * 100
print("sampling " + str(percent) + " % of records")

with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if random.randrange(1,101) <= percent:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)

This code works but has a pretty severe limitation; because random.randrange() only returns integers it can’t deal correctly with situations where the correct percentage is not an integer. To fix this, we could express the proportion of reads to keep as a decimal and switch from using randrange() to random(), which returns a floating point number. However, this might still run into problems with floating-point accuracy. A neater way to do it might be to use random.sample() to pick which records to keep after counting the total number, then just compare each record to that list. Storing the list of record numbers to keep as a set allows for rapid lookup:

from __future__ import division
import random

number_to_sample = 3000000

with open("test.fastq") as input:
    num_lines = sum([1 for line in input])
total_records = int(num_lines / 4)
print("sampling " + str(number_to_sample) + " out of " + str(total_records) + " records")

records_to_keep = set(random.sample(xrange(total_records + 1), number_to_sample))
record_number = 0
with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if record_number in records_to_keep:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)
            record_number += 1

This version of the code works to sample any number of records, and as a bonus is much quicker (at the expense of memory) since it only has to generate one random number per output sequence, rather than one random number per input sequence as in the earlier versions.

If we’re planning on using this program to create multiple samples of records from a single file, then there’s one more refinement we might want to make. Rather than simply running the program multiple times, we can create a bunch of output files in a single pass, meaning that we only have to count the number of lines in the input file once, and iterate over the input file once. Here we use a list to store the output file objects, and another list to store the sets of selected records for each output file:

from __future__ import division
import random

number_to_sample = 3000000
number_of_replicates = 10

with open("test.fastq") as input:
    num_lines = sum([1 for line in input])
total_records = int(num_lines / 4)
print("sampling " + str(number_to_sample) + " out of " + str(total_records) + " records")

output_files = []
output_sequence_sets = []
for i in range(number_of_replicates):
    output_files.append(open("sample.fastq." + str(i), "w"))
    output_sequence_sets.append(set(random.sample(xrange(total_records + 1), number_to_sample)))

record_number = 0
with open("test.fastq") as input:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            for i, output in enumerate(output_files):
                if record_number in output_sequence_sets[i]:
                        output.write(line1)
                        output.write(line2)
                        output.write(line3)
                        output.write(line4)
                record_number += 1

for output in output_files:
    output.close()

Here’s a link to a slight tidied-up version of the code with a minimal user interface to make running it a bit more pleasant. Interested in learning how to write tools like this yourself? Check out the Python for Biologists books.

Update: see the very informative comment below by Andrew Dalke, who suggests a number of changes to improve performance and fix stylistic things. The most important of these is probably to replace the list comprehension:

num_lines = sum([1 for line in input])

with a generator expression, which we can do just by changing the square brackets to round ones:

num_lines = sum((1 for line in input))

This greatly reduces the memory footprint of the program, since it no longer has to store a list of 1’s in memory but rather sums them as it goes along.

3

Why readable, documented code is especially important for scientists (and a three-step plan for getting there)

During my most recent teaching engagement I spent some time talking specifically about code readability and documentation. As often happens, presenting these ideas to a roomful of novice programmers helped to crystallize my thoughts on the topic, and made me realize that I’d never written about it – plus I thought that it would be an ideal topic for first post of the new year, since documentation is something that many programmers constantly resolve to do better at!

There’s no shortage of articles and book chapters explaining the general importance of documenting your code – if you learned to program from a book or an online tutorial (such as Python for Biologists) then it will almost certainly have been mentioned. The arguments in favour of documentation are well-rehearsed: it makes it easier for you to work on your own code over a long period of time, it makes it easier for others to contribute fixes and features, it forces you to think about the purpose of each section, etc. In this post, I want to talk about why documentation is particularly important for you – somebody who is using programming for carrying out scientific work. The basis of my argument is that for us as scientists, code serves two important features over and above simply being executed: it acts as both the ultimate supplementary methods, and it’s the way in which you express your original ideas.

Code as supplementary methods

It’s probably fair to say that most users of most programs don’t care about how they actually work, as long as they do what they’re supposed to. When you fire up an image editing program to brighten up a dull digital photo or rotate one where the horizon isn’t straight, you probably aren’t interested in exactly what transformation is being applied to the RGB values, or what trigonometry is being used to straighten the image – you’re only interested in the end result.

Scientific software is different: the end-users are often extremely interested in how the program works internally, since understanding that is a part of understanding the results. And the ultimate way to resolve questions or disagreements about what a program is doing is to examine the source code. This is a great advantage we have when working in bioinformatics. For wet-lab work, there is only so much information you can give in the pages of a journal about how an experiment was carried out. Using supplementary information can help, but even then you’re limited to what the authors thought important enough to write down. For a bioinformatics experiment, however, one can always see exactly where the data came from and what happened to it, providing one has access to the source code. You can read about a piece of bioinformatics software in a journal, listen to a talk on it, discuss it with the authors, but at the end of the day if you still have questions about how it works, you can always go back to the source code.

The vast majority of programmers don’t have to worry about their users wanting to read the source, but we do – so we should make readability and documentation a priority to make sure that it’s as useful as possible.

Code as a way of expressing original ideas

The vast majority of software projects don’t implement any ideas that are particularly original. This isn’t a problem, it’s just a reflection of the fact that many pieces of software do very similar things to other pieces of software, and do them in similar ways. There are fairly standard ways of writing a blog engine, a stock management program, an image manipulation program etc. We could make an argument, therefore, that for those categories of software it’s not super-important that the code is really well-documented, since it’s unlikely to be doing anything surprising, and a reader can probably work out what’s going on in each section by referring to other software that carries out the same task.

Scientific software is different. Yes, we tend to write scripts to carry out tedious everyday tasks like tweaking file formats and correcting sequence headers, but we also use it for implementing entirely new ideas about how to assemble genomes, or how to correct frameshift mutations, or how to pick species for phylogenetic analysis. We’re far more likely than other programmers to write code that does something entirely new. As such our programs (at least the ones that do something interesting) are going to be harder to understand than yet another text editor or chat program.

As programmers, we’re very lucky in that the language we use to implement our original ideas – code – is also an excellent way to communicate them to other researchers. But the usefulness of that language depends on whether we write it in a readable way and document it well.

Three steps to readable, documented code

Documentation is a skill that is learned over the course of a career, but here’s an exercise that I often have my students do. Using a framework like this can make documenting your code less daunting if you’ve no idea where to start.

Step one: make sure your variable and function names are meaningful

Programmers are fond of talking about self-documenting code – i.e. code that doesn’t require external documentation to be understood. A large part of this is using meaningful variable names. Examples of bad variable and function/method names include:

  • Single-letter names e.g. a, b, f (with the exception of variable names that follow common conventions such as x and y for co-ordinates or i for an index)
  • Names that describe the type of data rather than the contents e.g. my_list, dict
  • Names that are extremely generic e.g. process_file(), do_stuff(), my_data
  • Names that come in multiples e.g. file1, file2
  • Names that are excessively shortened e.g. gen_ref_seq_uc
  • Multiple names that are only distinguished by case or punctuation e.g. input_file and inputfile, DNA_seq and dna_seq
  • Names that are misspelled – the computer does not care about spelling but your readers might

Go through your code and look for any instances of the above, and replace them with good names. Good variable names tell us the job of the variable or function. This is also a good opportunity to replace so-called magic numbers – constants that appear in the code with no explanation – with meaningful variable names e.g. 64 might be replaced by number_of_codons.

Example: we want to define two variables which hold the DNA sequence for a contig and a frame, then pass them to a method which will carry out protein translation and store the result. Here’s how not to do it, even though the code is perfectly valid Python:

a = 2
b = 'ATGCGATTGGA'
c = do_stuff(a, b)

This is much better:

frame = 2
contig_dna_seq = 'ATGCGATTGGA'
contig_protein_seq = translate(frame, contig_dna_seq)

Step two: write brief comments explaining the reasoning behind particularly important or complex statements

For most programs, it’s probably true to say that the complexity lies in a very small proportion of the code. There tends to be a lot of straightforward code concerned with parsing command-line options, opening files, getting user input, etc. The same applies to functions and methods: there are likely many statements that do things like unpacking tuples, iterating over lists, and concatenating strings. These lines of code, if you’ve followed step one above, are self-documenting – they don’t require any additional commentary to understand, so there’s no need to write comments for them.

This allows you to concentrate your documentation efforts on the few lines of code that are harder to understand – those whose purpose is not clear, or which are inherently difficult to understand. Here’s one such example – this is the start of a function for processing a DNA sequence codon-by-codon (e.g. for producing a protein translation):

for codon_start in range(0, len(dna)-2, 3):
codon_stop = codon_start+3
codon = dna[codon_start:codon_stop]
    ...

The first line is not trivial to understand, so we want to write a comment explaining it. Here’s an example of how not to do it:

# iterate over numbers from zero to the length of
# the dna sequence minus two in steps of three
for codon_start in range(0, len(dna)-2, 3):
...

The reason that this is a bad comment is that it simply restates what the code does – it doesn’t tell us why. Reading the comment leaves us no better off in knowing why the last start position is the length of the DNA sequence minus two. This is much better:

# get the start position for each codon
# the final codon starts two bases before the end of the sequence
# so we don't get an incomplete codon if the length isn't a multiple of three
for codon_start in range(0, len(dna)-2, 3):
...

Now we can see from reading the comment that the reason for the -2 is to ensure that we don’t end up processing a codon which is only one or two bases long in the event that there are incomplete codons at the end of the DNA sequence.

Go through your code and look for lines whose function isn’t obvious just from reading them, and add explanations

Step three: add docstrings to your functions/methods/classes/modules

Functions and methods are the way that we break up our code into discrete, logical units, so it makes sense that we should also document them as discrete, logical units. Everything in this section also applies to methods, classes and modules, but it keep things readable I’ll just refer to functions below.

Python has a very straightforward convention for documenting functions: we add a triple-quoted string at the start of the function which holds the documentation e.g.

def get_at_content(dna):
  """return the AT content of a DNA string.
     The string must be in upper case.
     The AT content is returned as a float"""
  length = len(dna)
  a_count = dna.count('A')
  t_count = dna.count('T')
  at_content = float(a_count + t_count) / length
  return at_content

This triple-quoted line is called a docstring. The advantage of including function documentation in this way as opposed to in a comment is that, because it uses a standard format, the docstring can be extracted automatically. This allows us to do useful things like automatically generate API documentation from docstrings, or provide interactive help when running the Python interpreter in a shell (take a look at the chapter on testing and documentation in Advanced Python for Biologists for an in-depth look at how this works).

There are various different conventions for writing docstrings. As a rule, useful docstrings need to describe the order and types of the function arguments and the description and type of the return value. It’s also helpful to mention any restrictions on the argument (for instance, as above, that the DNA string must be in upper case). The example above is written in a very unstructured way, but because triple-quoted strings can span multiple lines, we could also adopt a more structured approach:

def get_at_content(dna):
  """return the AT content of a DNA string.

     Arguments: a string containing a DNA sequence.
                The string must be in upper case.

     Returns: the AT content as a float"""
  ...

If you think it’s helpful, you can also give examples of how to use the function in the docstring. Notice that we’re not saying anything in the docstring about how the function works. The whole point of encapsulating code into functions is that we can change the implementation without worrying about how it will affect the calling code!

Summary

These three steps represent the minimum amount of work that you should do on any code that you plan on keeping around for more than a few weeks, or that you plan on showing to anybody else. As always, if you have questions or suggestions, leave a comment.

1

Powered by WordPress. Designed by Woo Themes