Archive | Programming

What you have in common with the Wright brothers

Warning: vast historical oversimplification below in pursuit of a point 🙂

Famously, the Wright brothers built and flew the first aircraft capable of sustained, powered flight in 1903. Looking at the famous photos with eyes used to seeing modern aircraft, it looks pretty airworthy:

220px-1904WrightFlyer

 

There were plenty of other people working on heavier-than-air flying machines around that time, many with much more money and far more resources. So what was the key to the Wright brothers’ success? Did they invent a new type of engine? A new type of wing? Not really – their greatest invention was this:

wright_tunnel

This unprepossessing-looking box is a wind tunnel, which the Wright brothers – realizing that it was far too time-consuming to test wing designs by building them full scale – used to test their aeronautical designs using models. The innovation that prompted their break-through was not an improvement to aircraft, but an improvement in the process for designing aircraft. By using a wind tunnel, they were simply able to make their mistakes faster than anyone else, and to learn from them. Others had to learn by building, and crashing, full-size aircraft.

This is far from an original observation, but I think it has some connection with programming. The story of the Wright brothers illustrates the power of rapid iterative improvement – their approach would probably be called “agile” if it were being used today. The difference between the Wright brothers and their contemporary rivals mirrors one that I often see between the different approaches to writing code I see being used by my students.

On the one hand, you have people who favour small, incremental improvements when writing a program or a function, testing each bit of code as soon as possible and uncovering bugs and mistakes early. Students who program in this way end up with programs and functions that resemble the Wright Flyer pictured above: crude and primitive, perhaps, but certainly fit-for-purpose and relatively unlikely to result in broken bones.

On the other hand, you have people who try to write an entire program or function all in one go, never testing any bit of it until the whole thing is written. Students who program in this way end up with programs and function that resemble other products of early aviation:

images

 

As the picture above attests, this is a recipe for pain.

 

0

Improving database web interfaces with userscripts

Those of us who spend most of our day working on the command line have generally got into the habit of writing small, simple scripts to solve everyday problems and annoyances. The web-browser equivalent of these everyday problem-solving scripts are userscripts – small snippets of Javascript which run on specific web pages. In this post I’m going to show a quick example of how we can use a userscript to add a missing feature to the NCBI taxonomy browser.

Two quick notes before we get started…..

What is a userscript? A userscript is just a piece of Javascript plus a few bits of metadata stored as comments.

How do I use a userscript? You have to install an extension for your web browser – follow the instructions here.

The problem

When I’m in a phylogenetic frame of mind, I often find myself browsing the NCBI taxonomy. This is the database that stores taxonomic information and relationships for all organisms with sequence data in GenBank and we can use it to view a page for any taxonomic group – here’s a screenshot of a bit of the page for Endopterygota (click to embiggen):

Selection_006

This page shows us the various taxonomic groups that belong to Endopterygota laid out in hierarchically. Clicking on the name of one of these groups will take us to the appropriate page, so we can navigate around the tree of life quite easily. We can also display some extra information on this page – checking the “Nucleotide” box above the tree and then clicking “Display” will cause the number of nucleotide records belonging to each group to be displayed after the name:

Selection_007

This is pretty useful – we can go straight to the list of nucleotide records for a given group by clicking the number, but we can also just use the numbers to survey the distribution of sequence data for the groups we’re looking at. For example, in the above view, there are lots of sequence for butterflies and moths, and beetles. The trouble is that the view presented above isn’t a very intuitive way to look at the relative numbers – reading the counts and comparing requires a fair amount of mental overhead. What would be great is if there were an option on the NCBI website to display the number of nucleotide records for each group visually – say, as a bar whose width corresponds to the number of records:

Selection_008

The NCBI website doesn’t have such a feature but, as you can probably guess from the above screenshot, we are going to add it ourselves using a userscript.

The Solution

Before we start coding, we can break the problem down into a few steps. First, we need to get a list of all the counts that appear on the original web page. Then, we need to figure out what the largest count is so that we can scale the bars to a sensible size. Finally, we need to replace each count with a bar of the appropriate width.

Getting the list of counts is pretty easy – if we look at the source HTML for the page we can see that each of the nucleotide counts is an a element with the title attribute set to “Nucleotide”. We can use the JQuery library to grab the list of count elements and store it as nuc_counts:

// get list of nucleotide counts
var nuc_counts = $('[title="Nucleotide"]');

Calculating the maximum count is a little bit trickier. We need to take each count, strip out all the commas, turn the count into an integer, then grab the maximum value from the list of integers. I won’t spend time here going into the craziness required to get the maximum value from an array in Javascript: suffice it to say that we’ll use JQuery’s map function to turn our list of string counts into a list of integers, then find the maximum and store it in a variable called max_nuc_count:

// calcaulate maximum nucleotide count
max_nuc_count = Math.max.apply(Math, $.map(nuc_counts, function(x,i){return parseInt(x.text.replace(/,/g,""))}))

Now for the main body of the script. We’ll iterate over our array of count elements, and for each one use JQuery to construct a new element to replace it. The new element will be a div, and we’ll need to set its width to a value that reflects the original count. To do this, we’ll take the count, multiply it by five hundred, then divide the result by the maximum count that we calculated earlier – in other words, we’ll scale all the bars so that the widest one is five hundred pixels wide. The only other tricky bit is making sure that the bar gets displayed inline with the taxon name, rather than on a line of its own – to do this, we set the “display” attribute of the bar to “inline” or “inline-block”:

// for each count element...
for (var i=0; i<nuc_counts.length; i++){
    var count_element = nuc_counts[i];

    // remove the commas from the number and turn it into an integer
    var count = parseInt(count_element.text.replace(/,/g,""));

    // use jquery to create a new div element which will be the bar representing the nucleotide record count
    bar = $('<div>&nbsp;</div>')	// the div needs to contain a non-breaking space; if it is completely empty then it will not be displayed
    	.css('margin-bottom', 2)	// add a tiny space at the bottom so that there's a little gap between bars
    	.css('display', 'inline-block')	// force the div to display as an inline element so that it can share a line with the taxon name
    	.css('background-color', 'RoyalBlue') // pick a nice colour for the bar
    	.css('width', (count * 500) / max_nuc_count);	// calculate the width for the bar, scaled to the max

    // replace the original count element with the new bar
    $(count_element).replaceWith(bar);
}

So far so good: this gives us a nicely scaled set of bars and makes sure that the widest bar (i.e. the one at the top, which is the sum of all the others) fits on the screen. We could easily make the bar scale bigger or smaller by changing the 500 in the above code to something else – we could even take into account the width of the browser window if we wanted.

Finally, let’s add a couple of finishing touches. There are two things missing from the above solution: firstly, there’s no way to see the actual numbers, and there’s no way to click through to the list of records themselves. We can solve the second problem by creating an anchor element to wrap around the bar, with the target url copied from the original count. And we can solve the first problem by giving the anchor a “title” attribute which contains the original count, so that when we hover the mouse cursor over a given bar, it will display the exact number of nucleotide records. JQuery does most of the hard work here:

// get list of nucleotide counts
var nuc_counts = $('[title="Nucleotide"]');

// calcaulate maximum nucleotide count
max_nuc_count = Math.max.apply(Math, $.map(nuc_counts, function(x,i){return parseInt(x.text.replace(/,/g,""))}))

// for each count element...
for (var i=0; i<nuc_counts.length; i++){
    var count_element = nuc_counts[i];
    
    // remove the commas from the number and turn it into an integer
    var count = parseInt(count_element.text.replace(/,/g,""));
    
    // use jquery to create a new anchor element which will link to the nucleotide records
    anchor = $('<a></a>')
    	.attr('href', count_element.href)	// use the original count as a tooltip
    	.attr('title', count_element.text); // grap the nucleotide search url from the original element
    
    // use jquery to create a new div element which will be the bar representing the nucleotide record count
    bar = $('<div>&nbsp;</div>')	// the div needs to contain a non-breaking space; if it is completely empty then it will not be displayed
    	.css('margin-bottom', 2)	// add a tiny space at the bottom so that there's a little gap between bars
    	.css('display', 'inline-block')	// force the div to display as an inline element so that it can share a line with the taxon name
    	.css('background-color', 'RoyalBlue') // pick a nice colour for the bar
    	.css('width', (count * 500) / max_nuc_count);	// calculate the width for the bar, scaled to the max
    
    // put the bar inside the anchor so that you can click on
    anchor.append(bar);	
    
    // replace the original count element with the new anchor/bar
    $(count_element).replaceWith(anchor);
}

And there we have it. To turn this into a userscript, all we have to do is add a set of specially-formatted comments at the top which can be parsed by whichever browser extension we want to use. In particular, we need to specify which web pages the script should run on using a regular expression (the @match line below). Here’s the script in full:

// ==UserScript==
// @name       NCBI Taxonomy nucleotide record count barchart
// @namespace  http://pythonforbiologists.com/
// @version    0.1
// @description replace nucleotide record counts in NCBI taxonomy with bars, see http://pythonforbiologists.com/index.php/adding-features-ncbi-taxonomy/
// @match      http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi*
// @copyright  2012+, You
// ==/UserScript==

// @require  http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js

// get list of nucleotide counts
var nuc_counts = $('[title="Nucleotide"]');

// calcaulate maximum nucleotide count
max_nuc_count = Math.max.apply(Math, $.map(nuc_counts, function(x,i){return parseInt(x.text.replace(/,/g,""))}))

// for each count element...
for (var i=0; i<nuc_counts.length; i++){
    var count_element = nuc_counts[i];
    
    // remove the commas from the number and turn it into an integer
    var count = parseInt(count_element.text.replace(/,/g,""));
    
    // use jquery to create a new anchor element which will link to the nucleotide records
    anchor = $('<a></a>')
    	.attr('href', count_element.href)	// use the original count as a tooltip
    	.attr('title', count_element.text); // grap the nucleotide search url from the original element
    
    // use jquery to create a new div element which will be the bar representing the nucleotide record count
    bar = $('<div>&nbsp;</div>')	// the div needs to contain a non-breaking space; if it is completely empty then it will not be displayed
    	.css('margin-bottom', 2)	// add a tiny space at the bottom so that there's a little gap between bars
    	.css('display', 'inline-block')	// force the div to display as an inline element so that it can share a line with the taxon name
    	.css('background-color', 'RoyalBlue') // pick a nice colour for the bar
    	.css('width', (count * 500) / max_nuc_count);	// calculate the width for the bar, scaled to the max
    
    // put the bar inside the anchor so that you can click on
    anchor.append(bar);	
    
    // replace the original count element with the new anchor/bar
    $(count_element).replaceWith(anchor);
}

If you want to install this extension and try it out, I’ve added it to the userscripts.org repository – you should be able to install it by going here, once you have installed the browser extension. If you come up with any improvements to the code, or have any suggestions for other database web interface fixes or features, shout out in the comments!

0

The world’s worst genome assembler in six lines of Python

So, after I posted my new business cards the other day I got a comment to the effect that I should have made one with an aligner. That got me thinking about the biggest thing that I could conceivably fit on a business card, if I didn’t care about readability. So I decided that I could probably fit an incredibly bad sequence assembly program on one. Just for fun, I wrote the whole thing in a purely functional style – a total of six lambda expressions.

lp (4)

Warning: the rest of this post contains discussions of horrible code, rampant abuse of Python features, and the complete opposite of all good programming practise. Read on with caution!

The API is quite straightforward: give the ah() expression a list of DNA sequences, and it will return a consensus string. Here’s an ultra-short example:

reads = ['TCCCAGTGAACCCA', 'TTCCGTGCGGTTAAG', 'GTCCCAGTGAACCCACAA', 'TGAACCCACAAAACG', 'ACCCACAAAACGTGA', 'GAACCCACAAAACGTGA', 'TCCGTGCGGTTAAGC', 'TGAACCCACAAA', 'CCGTGCGGTTAAGCGTGA', 'TGACAGTCCCAGTGAA', 'AACCCACAAAACGTGA', 'AGTGAACCCACAAAACGT', 'GTTAAGCGTGA', 'CCGTGCGGTTAAGCGTGA', 'AGCGTGACAGT', 'TGCGGTTAAGCG', 'ACAAAACGTGATG', 'ACAGTCCCAGTGAACC', 'TAAGCGTGACAGTCCCA', 'TCGAATTCCGT', 'TTCTCGAATTCCGTGCG', 'ACAAAACGTG', 'CCACAAAACGTG', 'TGCGGTTAAG', 'GAACCCACAAAACGTGA', 'TCTCGAATTCC', 'ATTCCGTGCGGTTAA', 'ACCCACAAAAC', 'CGTGCGGTTAAGCGTGA', 'CCAGTGAACCCACAA', 'TGCGGTTAAGCGTG', 'CCCACAAAACG', 'TCTCGAATTC', 'AATTCCGTGCGGTT', 'ACAGTCCCAGTGA', 'GTCCCAGTGAACCCA', 'TGAACCCACAAA', 'CCCACAAAACGTG', 'TCCCAGTGAACCCACA', 'CTCGAATTCCGTGCG']
print(ah(reads))
# TGCGGACAAAACGTGTGAACGTGAGGTTAAGCGTGACAGTCCCAGTGAACCCACAAAACGT

And here’s a local alignment to the sequence from which the fake reads were generated, just to prove that it’s actually doing something slightly better than picking the longest read:

 6 GAATTCCGTGCGGTTAAGCGTGACAGTCCCAGTGAACCCACAAAACGT     53
   |||   ||||.|||||||||||||||||||||||||||||||||||||
17 GAA---CGTGAGGTTAAGCGTGACAGTCCCAGTGAACCCACAAAACGT     61

The longest read is 20 bases, and the local match region is 48 bases.

How it works

Let’s start by putting each expression on its own line:

cm = lambda d,ds: max([m(d,e) for e in ds if e != d])
m = lambda d,e: max([(s(d,e,o),o,e,d) for o in range(1-len(e),len(d))])
s = lambda d,e,o: sum([1 for p in range(max(0-o,0), min([len(e)-o, len(e), len(d)-o])) if e[p] == d[p+o]])
con = lambda x,o,s,c : c[0:max(0,o)] + s +  c[len(s)+o:]
a = lambda s, o : con(*cm(s, o)) if len(o) == 1 else a(con(*cm(s, o)), [ y for y in o if y != cm(s, o)[2]])
ah = lambda d : a(d[0],d[1:])

Next we’ll convert each expression to an equivalent function:

def cm(d,ds):
    return max([m(d,e) for e in ds if e != d])

def m(d,e):
    return max([(s(d,e,o),o,e,d) for o in range(1-len(e),len(d))])

def s(d,e,o):
    return sum([1 for p in range(max(0-o,0), min([len(e)-o, len(e), len(d)-o])) if e[p] == d[p+o]])

def con(x,o,s,c):
    return c[0:max(0,o)] + s +  c[len(s)+o:]

def a(s, o):
    return con(*cm(s, o)) if len(o) == 1 else a(con(*cm(s, o)), [ y for y in o if y != cm(s, o)[2]])

def ah(d):
    return a(d[0],d[1:])

And change the names to be slightly more descriptive (this makes some lines extremely long, so you’ll have to scroll right to read them), and rearrange them slightly:

# given two sequences and an offset, count the number of matching bases
def score(sequence1,sequence2,offset):
    return sum([1 for position in range(max(0-offset,0), min([len(sequence2)-offset, len(sequence2), len(sequence1)-offset])) if sequence2[position] == sequence1[position+offset]])

# given two sequences, find the offset which gives the best score
def find_best_offset(sequence1,sequence2):
    return max([(score(sequence1,sequence2,offset),offset,sequence2,sequence1) for offset in range(1-len(sequence2),len(sequence1))])

# given a single sequence and a collection of others, find the other sequence with the best match score
def find_best_match(sequence,others):
    return max([find_best_offset(sequence,sequence2) for sequence2 in others if sequence2 != sequence])

# given two sequences and an offset, calculate the consensus
def consensus(score,offset,sequence1,sequence2):
    return sequence2[0:max(0,offset)] + sequence1 +  sequence2[len(sequence1)+offset:]

# given a sequence and collection of others, return the complete consensus using recursion
def assemble(sequence, others):
    return consensus(*find_best_match(sequence, others)) if len(others) == 1 else assemble(consensus(*find_best_match(sequence, others)), [ y for y in others if y != find_best_match(sequence, others)[2]])

# given a collection of sequences, call assemble() to start the recursion
def assemble_helper(dnas):
    return assemble(dnas[0],dnas[1:])

Now we can look at each function in more detail.

score() is the basic scoring function. It takes two sequences and an offset, and returns the number of characters that match. It’s the simplest possible function for scoring an ungapped alignment between two sequences. It works by calculating the first and last positions of the overlapping region relative to sequence2, then counts up the number of positions for which the base in sequence2 is the same as the base in sequence1 at that position plus the offset. To get everything in one expression, it uses a list comprehension to build a list of 1’s for each matching position, then sums the list. Here it is written out a bit more conventionally. The only complicated thing going on here is the calculation of the start and stop positions.

def score(sequence1,sequence2,offset):
    start_of_overlap = max(0-offset,0)
    end_of_overlap = min([len(sequence2)-offset, len(sequence2), len(sequence1)-offset])
    total_score = 0
    for position in range(start_of_overlap, end_of_overlap):
        if sequence2[position] == sequence1[position+offset]:
            total_score = total_score + 1
    return total_score

find_best_offset() is the function that tries to maximize the score for a pair of sequences by trying every possible offset. It works by first calculating the range of possible offsets, then using a list comprehension to build a list of tuples, one tuple for each possible offset. Each tuple contains the score, the offset, and the two sequences – this slightly weird way of storing the results is necessary so that the information can be passed to the other functions, as we’ll see in a minute. To find the single best offset, we take advantage of the fact that in Python, sorting a list of tuples sorts them by their first element. Since the first element of each of our tuples is the score, if we simply ask for the max() of the list we get the tuple with the highest first element i.e. the one representing the best score. Here’s the sensible version:

def find_best_offset(sequence1,sequence2):
    lowest_offset = 1-len(sequence2)
    highest_offset = len(sequence1)
    all_offsets = []
    for offset in range(lowest_offset,highest_offset):
        # add the 4-tuple for this offset
        all_offsets.append((score(sequence1,sequence2,offset),offset,sequence2,sequence1))
    return max(all_offsets)

find_best_match() is probably the most straightforward function of the bunch. Given a single sequence and a list of other sequences, it finds the other sequence that has the best match by calling find_best_offset() for each of them in turn. It uses the same tuple-sorting trick as before to figure out which match is the best:

def find_best_match(sequence,others):
    all_matches = []
    for sequence2 in others:
        if sequence2 != sequence:
            all_matches.append(find_best_offset(sequence,sequence2))
    return max(all_matches)

The consensus() function gave me quite a bit of trouble. Its job is to take two sequences plus a given offset, and return the consensus sequence of the two. Of course, it doesn’t do anything like what we normally mean by consensus – it simply concatenates the relevant bits of the two sequences to make a longer one. The logic behind how it works is a little bit hard to follow. We construct the consensus sequence by taking the full length of sequence one, and sticking any left-hand overhang from sequence two on the left end and any right-hand overhang from sequence two on the right end. In other words, you should read the return line as “return any bits of sequence2 that stick out to the left, followed by the whole of sequence1, followed by any bits of sequence two that stick out to the right”. For most overlapping pairs of sequences, either the first or last bit of the returned string will be zero-length, which is why the thing works as a single expression in the compact version.

def consensus(score,offset,sequence1,sequence2):
    sequence2_left_overhang = sequence2[0:max(0,offset)]
    sequence2_right_overhang = sequence2[len(sequence1)+offset:]
    return sequence2_left_overhang + sequence1 + sequence2_right_overhang

The assemble() function is probably the most complicated (and certainly the most inefficient). I cheated a little bit to get it onto a single line by using the ternary operator “x if y else z”. It’s a recursive function that takes a single sequence and a collection of other sequences. It finds the best match for the sequence among the others and calculates the consensus of the sequence and the best-matching other. If that’s the only member of others (i.e. the others list has just one element) it simply returns the consensus. If the others list has more than one element, it removes the best-matching one and calls itself recursively with the newly-built consensus as the single sequence. Here it is expanded:

def assemble(sequence, others):
    # remember, best_matching_other is a 4-tuple
    best_matching_other = find_best_match(sequence, others)
    # the * expands the elements of the tuple so we can use them as arguments to consensus()
    consensus_sequence = consensus(*best_matching_other)
    if len(others) == 1:
        return consensus_sequence
    else:
        # get the second element of the best_matching_other tuple, which is the sequence
        best_matching_sequence = best_matching_other[2]
        others.remove(best_matching_sequence)
        return assemble(consensus_sequence, others)

assemble_helper() is, as the name suggests, just a helper function which kicks off the recursion by calling assemble() with the first element as the single sequence and the remainder of the elements as the list of other sequences.

Let’s sum up the algorithm then (described iteratively, even though it’s written recursively). To assemble a list of N DNA sequences, we take the first sequence, and find the member of the remaining N-1 sequences which has the best match. We remove this best-matching member from the list (leaving N-2 sequences) and calculate the consensus of these two sequences. We then append the newly-built consensus onto the end of the list (bringing the sequence count back up to N-1), then go back to the start and begin again. Hopefully it’s clear that, since the number of sequences in the list decreases by one in each iteration, we will eventually end up with a list of just a single sequence, which is our result.

A discussion of the performance of this algorithm is both beyond the scope of this post, and entirely unnecessary. Suffice it to say that it has horrible performance in terms of both computation time and results! At around 430 characters I think that the compact version is pushing the limits of what can fit on a business card in 12 point text. If anyone can think of a way to squeeze a few characters, please let me know in the comments!

0

Powered by WordPress. Designed by Woo Themes