Randomly sampling reads from a FASTQ file

Attention readers: this article is about how to write a Python program to randomly sample reads from a FASTQ file. If you just want to run the program, save it from this link and run it with -h to view usage. Alternatively, use one of the many other tools which perform this job, and were probably not written in an afternoon as an example. 

If you're interested in how to write such a program yourself, then read on…

A common task in bioinformatics is to take a data file – here we're looking at next-generation sequencing reads in FASTQ format – and generate random samples from it. Let's say we want to sample 10% of the reads from a given file. Here's one way of doing it – we can take advantage of the fact that each FASTQ record is exactly four lines long and grab four lines at a time, and only write them to the output if the number of records we've seen is an exact multiple of 10:

record_number = 0
with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if record_number % 10 == 0:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)
            record_number += 1

This works pretty well – on a test file containing 10 million FASTQ records it runs in about two minutes on my computer, with no particular attempt at optimization. An obvious problem with this code is that we're going to get exactly the same 10% of reads every time we run it, and for some types of analysis we'd like to be able to generate a bunch of random 10% samples. To do this we need to change the logic of our code a bit; rather than looking at the record number to decide whether or not to write a given read to the output file, we'll pick a random number between 0 and 9 and write the file if it's equal to zero:

import random

with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if random.randrange(0,10) == 0:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)

This actually slows things down quite a bit – this version of the code takes twice as long to run, as generating random numbers is relatively computationally expensive.

This version lets us easily extend the idea for any sample size – for each record we will pick a number between 1 and 100 and if it's less than our sample percent, then we write the record to the output file:

import random

percent = 35
with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if random.randrange(1,101) <= percent:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)

This will do nicely for when we want to sample a given proportion of reads from our file, but what about when we need to sample a given number of reads? The easiest thing to do is probably to add a pre-processing step where we count the number of lines in the file and divide by four to get the number of records, then turn that into a percentage to use in our existing code:

from __future__ import division
import random

number_to_sample = 3000000

with open("test.fastq") as input:
    num_lines = sum([1 for line in input])
total_records = int(num_lines / 4)
print("sampling " + str(number_to_sample) + " out of " + str(total_records) + " records")

percent = (number_to_sample / total_records) * 100
print("sampling " + str(percent) + " % of records")

with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if random.randrange(1,101) &lt;= percent:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)

This code works but has a pretty severe limitation; because random.randrange() only returns integers it can't deal correctly with situations where the correct percentage is not an integer. To fix this, we could express the proportion of reads to keep as a decimal and switch from using randrange() to random(), which returns a floating point number. However, this might still run into problems with floating point accuracy. A neater way to do it might be to use random.sample() to pick which records to keep after counting the total number, then just compare each record to that list. Storing the list of record numbers to keep as a set allows for rapid lookup:

from __future__ import division
import random

number_to_sample = 3000000

with open("test.fastq") as input:
    num_lines = sum([1 for line in input])
total_records = int(num_lines / 4)
print("sampling " + str(number_to_sample) + " out of " + str(total_records) + " records")

records_to_keep = set(random.sample(xrange(total_records + 1), number_to_sample))
record_number = 0
with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if record_number in records_to_keep:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)
            record_number += 1

This version of the code works to sample any number of records, and as a bonus is much quicker (at the expense of memory) since it only has to generate one random number per output sequence, rather than one random number per input sequence as in the earlier versions.

If we're planning on using this program to create multiple samples of records from a single file, then there's one more refinement we might want to make. Rather than simply running the program multiple times, we can create a bunch of output files in a single pass, meaning that we only have to count the number of lines in the input file once, and iterate over the input file once. Here we use a list to store the output file objects, and another list to store the sets of selected records for each output file:

from __future__ import division
import random

number_to_sample = 3000000
number_of_replicates = 10

with open("test.fastq") as input:
    num_lines = sum([1 for line in input])
total_records = int(num_lines / 4)
print("sampling " + str(number_to_sample) + " out of " + str(total_records) + " records")

output_files = []
output_sequence_sets = []
for i in range(number_of_replicates):
    output_files.append(open("sample.fastq." + str(i), "w"))
    output_sequence_sets.append(set(random.sample(xrange(total_records + 1), number_to_sample)))

record_number = 0
with open("test.fastq") as input:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            for i, output in enumerate(output_files):
                if record_number in output_sequence_sets[i]:
                        output.write(line1)
                        output.write(line2)
                        output.write(line3)
                        output.write(line4)
                record_number += 1

for output in output_files:
    output.close()

Here's a link to a slight tidied-up version of the code with a minimal user interface to make running it a bit more pleasant. Interested in learning how to write tools like this yourself? Check out the books page.