Archive | Not just Python

Improving database web interfaces with userscripts

Those of us who spend most of our day working on the command line have generally got into the habit of writing small, simple scripts to solve everyday problems and annoyances. The web-browser equivalent of these everyday problem-solving scripts are userscripts – small snippets of Javascript which run on specific web pages. In this post I’m going to show a quick example of how we can use a userscript to add a missing feature to the NCBI taxonomy browser.

Two quick notes before we get started…..

What is a userscript? A userscript is just a piece of Javascript plus a few bits of metadata stored as comments.

How do I use a userscript? You have to install an extension for your web browser – follow the instructions here.

The problem

When I’m in a phylogenetic frame of mind, I often find myself browsing the NCBI taxonomy. This is the database that stores taxonomic information and relationships for all organisms with sequence data in GenBank and we can use it to view a page for any taxonomic group – here’s a screenshot of a bit of the page for Endopterygota (click to embiggen):

Selection_006

This page shows us the various taxonomic groups that belong to Endopterygota laid out in hierarchically. Clicking on the name of one of these groups will take us to the appropriate page, so we can navigate around the tree of life quite easily. We can also display some extra information on this page – checking the “Nucleotide” box above the tree and then clicking “Display” will cause the number of nucleotide records belonging to each group to be displayed after the name:

Selection_007

This is pretty useful – we can go straight to the list of nucleotide records for a given group by clicking the number, but we can also just use the numbers to survey the distribution of sequence data for the groups we’re looking at. For example, in the above view, there are lots of sequence for butterflies and moths, and beetles. The trouble is that the view presented above isn’t a very intuitive way to look at the relative numbers – reading the counts and comparing requires a fair amount of mental overhead. What would be great is if there were an option on the NCBI website to display the number of nucleotide records for each group visually – say, as a bar whose width corresponds to the number of records:

Selection_008

The NCBI website doesn’t have such a feature but, as you can probably guess from the above screenshot, we are going to add it ourselves using a userscript.

The Solution

Before we start coding, we can break the problem down into a few steps. First, we need to get a list of all the counts that appear on the original web page. Then, we need to figure out what the largest count is so that we can scale the bars to a sensible size. Finally, we need to replace each count with a bar of the appropriate width.

Getting the list of counts is pretty easy – if we look at the source HTML for the page we can see that each of the nucleotide counts is an a element with the title attribute set to “Nucleotide”. We can use the JQuery library to grab the list of count elements and store it as nuc_counts:

// get list of nucleotide counts
var nuc_counts = $('[title="Nucleotide"]');

Calculating the maximum count is a little bit trickier. We need to take each count, strip out all the commas, turn the count into an integer, then grab the maximum value from the list of integers. I won’t spend time here going into the craziness required to get the maximum value from an array in Javascript: suffice it to say that we’ll use JQuery’s map function to turn our list of string counts into a list of integers, then find the maximum and store it in a variable called max_nuc_count:

// calcaulate maximum nucleotide count
max_nuc_count = Math.max.apply(Math, $.map(nuc_counts, function(x,i){return parseInt(x.text.replace(/,/g,""))}))

Now for the main body of the script. We’ll iterate over our array of count elements, and for each one use JQuery to construct a new element to replace it. The new element will be a div, and we’ll need to set its width to a value that reflects the original count. To do this, we’ll take the count, multiply it by five hundred, then divide the result by the maximum count that we calculated earlier – in other words, we’ll scale all the bars so that the widest one is five hundred pixels wide. The only other tricky bit is making sure that the bar gets displayed inline with the taxon name, rather than on a line of its own – to do this, we set the “display” attribute of the bar to “inline” or “inline-block”:

// for each count element...
for (var i=0; i<nuc_counts.length; i++){
    var count_element = nuc_counts[i];

    // remove the commas from the number and turn it into an integer
    var count = parseInt(count_element.text.replace(/,/g,""));

    // use jquery to create a new div element which will be the bar representing the nucleotide record count
    bar = $('<div>&nbsp;</div>')	// the div needs to contain a non-breaking space; if it is completely empty then it will not be displayed
    	.css('margin-bottom', 2)	// add a tiny space at the bottom so that there's a little gap between bars
    	.css('display', 'inline-block')	// force the div to display as an inline element so that it can share a line with the taxon name
    	.css('background-color', 'RoyalBlue') // pick a nice colour for the bar
    	.css('width', (count * 500) / max_nuc_count);	// calculate the width for the bar, scaled to the max

    // replace the original count element with the new bar
    $(count_element).replaceWith(bar);
}

So far so good: this gives us a nicely scaled set of bars and makes sure that the widest bar (i.e. the one at the top, which is the sum of all the others) fits on the screen. We could easily make the bar scale bigger or smaller by changing the 500 in the above code to something else – we could even take into account the width of the browser window if we wanted.

Finally, let’s add a couple of finishing touches. There are two things missing from the above solution: firstly, there’s no way to see the actual numbers, and there’s no way to click through to the list of records themselves. We can solve the second problem by creating an anchor element to wrap around the bar, with the target url copied from the original count. And we can solve the first problem by giving the anchor a “title” attribute which contains the original count, so that when we hover the mouse cursor over a given bar, it will display the exact number of nucleotide records. JQuery does most of the hard work here:

// get list of nucleotide counts
var nuc_counts = $('[title="Nucleotide"]');

// calcaulate maximum nucleotide count
max_nuc_count = Math.max.apply(Math, $.map(nuc_counts, function(x,i){return parseInt(x.text.replace(/,/g,""))}))

// for each count element...
for (var i=0; i<nuc_counts.length; i++){
    var count_element = nuc_counts[i];
    
    // remove the commas from the number and turn it into an integer
    var count = parseInt(count_element.text.replace(/,/g,""));
    
    // use jquery to create a new anchor element which will link to the nucleotide records
    anchor = $('<a></a>')
    	.attr('href', count_element.href)	// use the original count as a tooltip
    	.attr('title', count_element.text); // grap the nucleotide search url from the original element
    
    // use jquery to create a new div element which will be the bar representing the nucleotide record count
    bar = $('<div>&nbsp;</div>')	// the div needs to contain a non-breaking space; if it is completely empty then it will not be displayed
    	.css('margin-bottom', 2)	// add a tiny space at the bottom so that there's a little gap between bars
    	.css('display', 'inline-block')	// force the div to display as an inline element so that it can share a line with the taxon name
    	.css('background-color', 'RoyalBlue') // pick a nice colour for the bar
    	.css('width', (count * 500) / max_nuc_count);	// calculate the width for the bar, scaled to the max
    
    // put the bar inside the anchor so that you can click on
    anchor.append(bar);	
    
    // replace the original count element with the new anchor/bar
    $(count_element).replaceWith(anchor);
}

And there we have it. To turn this into a userscript, all we have to do is add a set of specially-formatted comments at the top which can be parsed by whichever browser extension we want to use. In particular, we need to specify which web pages the script should run on using a regular expression (the @match line below). Here’s the script in full:

// ==UserScript==
// @name       NCBI Taxonomy nucleotide record count barchart
// @namespace  http://pythonforbiologists.com/
// @version    0.1
// @description replace nucleotide record counts in NCBI taxonomy with bars, see http://pythonforbiologists.com/index.php/adding-features-ncbi-taxonomy/
// @match      http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi*
// @copyright  2012+, You
// ==/UserScript==

// @require  http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js

// get list of nucleotide counts
var nuc_counts = $('[title="Nucleotide"]');

// calcaulate maximum nucleotide count
max_nuc_count = Math.max.apply(Math, $.map(nuc_counts, function(x,i){return parseInt(x.text.replace(/,/g,""))}))

// for each count element...
for (var i=0; i<nuc_counts.length; i++){
    var count_element = nuc_counts[i];
    
    // remove the commas from the number and turn it into an integer
    var count = parseInt(count_element.text.replace(/,/g,""));
    
    // use jquery to create a new anchor element which will link to the nucleotide records
    anchor = $('<a></a>')
    	.attr('href', count_element.href)	// use the original count as a tooltip
    	.attr('title', count_element.text); // grap the nucleotide search url from the original element
    
    // use jquery to create a new div element which will be the bar representing the nucleotide record count
    bar = $('<div>&nbsp;</div>')	// the div needs to contain a non-breaking space; if it is completely empty then it will not be displayed
    	.css('margin-bottom', 2)	// add a tiny space at the bottom so that there's a little gap between bars
    	.css('display', 'inline-block')	// force the div to display as an inline element so that it can share a line with the taxon name
    	.css('background-color', 'RoyalBlue') // pick a nice colour for the bar
    	.css('width', (count * 500) / max_nuc_count);	// calculate the width for the bar, scaled to the max
    
    // put the bar inside the anchor so that you can click on
    anchor.append(bar);	
    
    // replace the original count element with the new anchor/bar
    $(count_element).replaceWith(anchor);
}

If you want to install this extension and try it out, I’ve added it to the userscripts.org repository – you should be able to install it by going here, once you have installed the browser extension. If you come up with any improvements to the code, or have any suggestions for other database web interface fixes or features, shout out in the comments!

0

New business cards!

I’ve been meaning for a while to get round to making some business cards to hand out to folks who ask me about learning to program. Normally I just tell people to google “python for biologists” and they’ll end up in the right place, but it would be nice to have a physical reminder to give out. At first I though about having some USB memory stick business cards made  – there are some really cool ones that are the shape of a normal business card but can fold in half to reveal a set of USB contacts. Unfortunately they’re way expensive, and the minimum order is far more than I need.

Next I thought about making a “cheat sheet” style business card – the type with contact information on the front and some useful quick-reference information (e.g. a list of regular expression characters) on the back. I guess the idea would be that the recipient is more likely to hang onto the card if it has useful information on it. But I couldn’t think of anything that would fit in well with my website – after all, the emphasis of pythonforbiologists is on learning to program, not simply the practice of programming itself.

Finally, I had an idea; I would put a tiny biology-themed programming exercise on the back of each of my business cards, along with a link to a web page giving the solution.

cards1

This would hopefully mean that when somebody gets hold of one of my cards they can see straight away what kind of material and training I provide, and can head over to the website for more information. I wrote five different nano-exercises on five different biological topics:

  • parsing FASTQ file format
  • counting the number of occurrences of short motifs in DNA sequences
  • calculating AT content using a sliding window
  • generating the reverse complement of a DNA sequence
  • calculating restriction fragment lengths

Fitting the sample code onto the business cards was quite difficult. I wanted to make sure that the code would be readable and not too hard to understand – I even found room for a few comments – but it also had to be very concise. I only had about ten lines to work with, so I had to use very short variable names.

cards2

You can see images of all the reverse sides at this link.

After I’d designed the code samples and exercises I wrote web pages for each of the solutions. I decided to put the exercise description and the link to the solution pages on the front of the card, as I’d used up all the room on the back with the code samples.

cards3

I tried to make each solution page interesting to read. As well as giving an answer to the exercise, I included extra material about useful bits of the Python language that some people don’t know about. For example, in the solution page to the FASTQ parser exercise I talked about generator functions, and in the sliding window exercise solution I talked about higher-order functions.

You can browse all of the exercises along with links to their solution pages here. Comments appreciated!

3

Powered by WordPress. Designed by Woo Themes