Applied Python 3


Large datasets and tree structures

NCBI taxonomy

Last time we were looking at the web interface to the ncbi taxonomy. The taxonomy contains a node for every species that
has sequence data in GenBank, so it’s probably the most complete “universal” taxonomy we have. It’s not hard to think of questions that we might want to ask about it:

  • list all the species in Arthropoda
  • what family is Panagrolobus vanmegenae in?
  • how many genera of placental mammals are there?
  • what is the smallest group that contains both Lychnorhiza lucerna and Pheronema raphanus?
  • are the three species with taxids 6279,6239 and 135651 monophyletic with respect to taxid 10090?
  • is Acanthella cavernosa or Boltenia echinata more closely related to Aplysilla sulfurea?

Each page for a node has links to its parent and children, so theoretically we could write a screen-scraper to move up and down the tree. This is going to be slow going, though; whatever the answer to the first question is, it will be on the order of many thousands, which is a lot of web page requests to make.

Happily, you can download a dump of the whole NCBI tree as a pair of simple text files. Here is the link:

If you download this file and extract it, you’ll see a bunch of files. The ones we are interested in are names.dmp which stores the relationships between taxids and names,
and nodes.dmp which store the relationships between nodes. The readme file has a nice concise description of the file formats.


Let’s take a look at the names first. Here’s a random chunk of lines from the names.dmp file:

306556  |   Ligusticum cyprium Spreng.  |       |   synonym |
306557  |   Carum heldreichii   |       |   scientific name |
306557  |   Carum heldreichii Boiss.    |       |   authority   |
306557  |   Carum rupestre  |       |   synonym |
306558  |   Sedum lanceolatum   |       |   scientific name |
306559  |   Symphysodon discus  |       |   scientific name |
306559  |   red discus  |       |   common name |
306560  |   Angustonicus    |       |   scientific name |
306561  |   Angustonicus lifou  |       |   scientific name |

Each row is a single name, and the columns are separated by tab, pipe, tab (in Python we would write it “t|t”). There are four columns per row, from left to right:

  • taxid
  • name
  • unique name
  • type of name

There are several different types of name, so there can be several rows for a given taxid. Look at the three lines for taxid 306557. There is a scientific name, an authority (i.e. a scientific name along with the name of the person who described it) and a synonym. Taxid 306559 also has a common name. The unique name field is normally blank, so we can pretty much ignore it.


1. Write a script that will go through the names.dmp file one line at a time and construct two dictionaries. One will store name->taxid, the other will store taxid->scientific name.
Note that we cannot simply store taxid->name becase the same taxid can have multiple names, and we are not allowed to have multiple values with the same key.

Warning: be careful about how you store taxids – they can be stored either as a string or as a number. The exercises we will be working on will work either way, but you have to
pick one and be consistent!

The names.dmp file has 1.3 million rows (be careful if you try to open it in a text editor!), so this script is going to take a while to run. Your script will have to have some way of informing the
user that it is running. You can’t simply print a message after processing each row – try something like this:

lines_processed = 0
for line in file:
    lines_processed = lines_processed + 1
    if (lines_processed % 100000 == 0):
         print('processing line ' + str(lines_processed))
    # process the line

use your script to answer the following questions:

  • how many taxids are there? (assuming one scientific name per taxid)
  • how many names, on average, does each node have?
  • what is the taxid of Panagrolobus vanmegenae?
  • what is the taxid of Boltenia echinata?
  • what is the scientific name of the node with taxid 861513?
  • how about 212424?

Hint: if you run python -i then Python will enter interactive mode after the script has finished running. This will let you experiment with the dicts without having to
re-run the program each time.

Interesting note: remember when we were talking about dictionaries, I made the claim that they allow you to very quickly look up the value associated with a given key, even when
the number of items is very large. This is a good oppourtunity to test that claim! Python has a built-in benchmarking module
called timeit. We will talk more about benchmarking / profiling in the future.
Running the code to generate the dict results in a taxid->name dict that has nearly one million elements, but it still only takes 0.038 milliseconds to look up a name on my desktop!
In other words, even with a one-million element dict, we can still perform 25,000 lookups per second.


Now let’s turn our attention to nodes.dmp. The structure is very similar – it has the same separator and one row per node.
The difference is in the information that is stored in each row:

27 | 49928 | species | HE | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 |  |
28 | 49928 | species | HE | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 |  |
29 | 28221 | order |  | 0 | 1 | 11 | 1 | 0 | 1 | 0 | 0 |  |
31 | 80811 | family |  | 0 | 1 | 11 | 1 | 0 | 1 | 0 | 0 |  |
32 | 31 | genus |  | 0 | 1 | 11 | 1 | 0 | 1 | 0 | 0 |  |
33 | 32 | species | MF | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 |  |

The first three columns are taxid, parent taxid, and rank. Rank is species, genus, family, phylum, etc.


2. Write a script that reads the nodes.dmp file and creates two dictionaries: taxid->parent and taxid->rank.
Use these two dicts to answer the following questions:

  • what is the rank of taxid 10116?
  • what is the parent taxid of the species with taxid 10116?
  • what is the rank of the parent of the species with taxid 10116?
  • what is the grandparent taxid of the species with taxid 10116?
  • how many Orders are there in the whole taxonomy?

Putting it together

If we combine our two scripts then we will end up with a script that defines 4 dicts:

taxid -> name
name  -> taxid
taxid -> parent
taxid -> rank

We can start putting this information together in useful ways. For example, let’s define a function to find the Order to
which a species belongs. This function will have one parameter – a scientific name – and return a string which is the
scientific name of the order. Here’s one way of writing it:

# we will assume that these dicts have already been defined
# taxid2name, name2taxid, taxid2parent, taxid2rank

def find_order(species):
    # first look up the taxid for the species of interest
    current_taxid = name2taxid[species]
    # use a while loop to continue searching until we find what we are looking for
    found = False
    while found == False:
        # find the taxid of the parent
        parent_taxid = taxid2parent[current_taxid]
        # find the rank of the parent
        parent_rank = taxid2rank[parent_taxid]
        # if it is the order, then turn the taxid into a name and return the value
        if (parent_rank == 'order'):
            parent_name = taxid2name[parent_taxid]
            found = True
            return parent_name
        # if not, then set the current taxid to be the parent taxid and go round the loop again
            current_taxid = parent_taxid

The above code uses a while loop to go up the tree one step at a time until we find the order. Notice that we don’t
generate the dicts inside the function, but rather rely on them being generated outside. This is an exception to the
rule that we have been using which states that functions are defined entirely in terms of their inputs and outputs. In this
case it’s necessary to break the rule, because the process of generating the dicts is very slow and we only want to do it once.

Here is another way of writing the function using recursion:

def find_order(species):
    return find_order_recursive(name2taxid[species])

def find_order_recursive(taxid):
    # if the taxid in the argument is the order, then return the name
    if taxid2rank[taxid] == 'order':
        return taxid2name[taxid]
    # otherwise, run the same function with the parent of the taxid
        return find_order_recursive(taxid2parent[taxid])

The recursive function only works on taxids, so we have to write a little ‘helper’ function who’s job is to
first convert the name into a taxid then call the recursive function. This version may be harder to read, but
it is much more concise, because the structure of the function is similar to the structure of the data that it acts on.

This is a general rule for functions that work on trees – they can be written either with a while loop, or with recursion.

All the exercises we will be working on can be written either way.


3. Pick whichever of the above versions you find easier to read, and modify it so that can search for any rank. Be careful when
writing functions that involve recursion or while loops – it is quite easy to accidentally write a function that never stops!

Now we can answer the second question from the list we made at the start:

  • what family is Panagrolobus vanmegenae in?

Getting a list of parents

4. A very useful building block for answering questions about trees is a function that will give us a list of all the parents
of a given node. Modify your function from above (either version) so that it returns a list of all the taxids of the parents
of the input node. The new function can take as an input either a name or a taxid (you decide!). Important: the root node
(i.e. the top of the tree) has taxid 1 and is named ‘root’ and is its own parent. Your function has to know when it
has reached the top of the tree, otherwise it will never return.

You should find
that your new function works on nodes of any rank, not just species! Test it with a few input names/taxids.

Last common ancestor

5. You will notice that the function which returns a list of parents is ordered from smallest group to largest group. If
we think of the tree as having its root at the top, the list is ordered from lowest to highest. This gives us a very easy
way to identify the most recent common ancestor of two nodes. We simply have to look through one of the lists of parents
and find the first parent that also appears in the second list. Satisfy yourself that this is true by drawing a diagram!

Write a function that finds the last common ancestor (= smallest containg group) of two nodes. The function will have two
parameters (node names or taxids) and return a single taxid or name. The function will have to call the list_parents function
first, then compare the two lists. Use the function to answer the fourth question from the list:

  • what is the smallest group that contains both Lychnorhiza lucerna and Pheronema raphanus?


6. With a bit of thought, we can also use the last_common_ancestor function to answer the fifth question from the list:

  • are the three species with taxids 6279,6239 and 135651 monophyletic with respect to taxid 10090?

This question is equivalent to asking the following: calculate the last common ancestor of nodes 6279,6239 and 135651.
Now look at the list of parents for node 10090. If the LCA is in the list of parents, then the group is not monophyletic.
If the LCA is not in the list of parents, then node 10090 is an outgroup to 6279,6239 and 135651 therefore the group is monophyletic.
Again, draw a diagram to prove that this is true!

The trick to finding the last common ancestor of three nodes is quite straightforward: simply find the LCA of any two
nodes,then find the LCA of the result and the third node. Diagram time again!

Use your LCA function to answer the monophyly question.

Identifying the outgroup of triplets

7. The last question from the list can also be answered by thinking about last common ancestors:

  • is Acanthella cavernosa or Boltenia echinata more closely related to Aplysilla sulfurea?

Calculate the LCA of A. cavernosa and A. sulfurea. If this node doesn’t appear in the parent list for B. echinata, then
_A. cavernosa
is more closely related. If it does, then B. echinata is more closely related. Of course, this could
be a trick question, and they are equally closely related! We can easily check; if this is the case then the LCA of
cavernosa + sulfurea and the LCA of echinata + sulfurea will be the same.

Identifying unassigned groups

8. In an ideal world, we would have complete classification data for all species – i.e. they would all be assigned to a genus, family, phylum, etc. However, this is a dangerous assumption to make!
Write a script to calculate how many species are not assigned to a genus. Modify it to calculate how many orders are not assigned to a phylum.

Trees are everywhere

The algorithms we have been talking about here – finding parents, finding last common ancestors, defining monophyletic groups – crop up in many contexts outside the taxonomic / phylogenetic.

Gene ontology

Here is a small chunk of the Gene Ontology for biological function. Suppose we have a very specific GO term (e.g. “regulation of substrate adhesion-dependent cell spreading”) and we want
to get a list of all the more general terms it is part of. This is the find-all-parents algorithm.


Cell lineage

Here is a figure of cell lineage in the nematode C. elegans.. Suppose we pick two distant adult cells – a neuron and a cell from the pharynx – and want to know at what point
the lineages leading to these two cells diverged. This is the last-common-ancestor algorithm.

Cell lineage


Here is a heat map showing a distance matrix between samples in an RNA-seq experiment, with a tree showing the samples clustered by similarity of expression profile. There are four
different conditions, and three biological replicates for each condition. In such an experiment, we hope to see that each group of three biological replicates cluster together, indicating
that the differences between conditions are due to biological rather than technical variation. This is a monophyly-type problem – we can check this programatically
by finding, for each group of biological replicates, the last common ancestor, then confirming that none of the samples outside the group are also decendents of the LCA.

RNA seq

Org chart

Here is an organizational chart of Microsoft. It shows the hierarchy of people in charge of difference divisions and departments. We can pick two people from the chart and ask: who is the
least-senior person who is in charge of both of them? This is a last-common-ancestor problem.

Org chart

Decending the tree

The two remaining unanswered questions from our list both involve traversing the tree in the opposite direction – from
top (root) to bottom (tips). We cannot go from parent -> child using the dicts that we currently have – we will
need to create a new one. This is tricky, because a parent can have multiple children, but keys in a dict have to
be unique. The solution is to make a dict where the keys are taxids, and the values are lists of taxids. Once we have
created a dict like this then we can write a recursive function that will give a list of all the children of a given node.
The logic involved is quite tricky, so I will write it out here:

# assume that the dict parent2child has already been created

def get_children(taxid):
    # create a new list to hold the result
    result = []
    # get the children of the current node
    children = parent2child[taxid]
    # add all the children of this node to the list
    # then add all of their children as well
    for child in children:
    return result


9. Modify your script to generate this fifth dict, parent2child and use it in conjunction with the function above
to answer the questions from the list:

  • list all the species in Arthropoda
  • how many genera of placental mammals are there?

To answer the first question, you will need to look up the taxid for Arthropoda, get a list of all the children,
loop through the children to find out which ones have the rank species, then get the names for them and print them.

To answer the second question, you will need to get all the children of Eutheria, filter out the ones with rank ‘genus’
and count them.

Powered by WordPress. Designed by Woo Themes