Dictionaries

Storing paired data

Suppose we want to count the number of As in a DNA sequence. Carrying out the calculation is quite straightforward – in fact it’s one of the first things we did in section 2:

dna = "ATCGATCGATCGTACGCTGA"
a_count = dna.count("A")

How will our code change if we want to generate a complete list of base counts for the sequence? We’ll add a new variable for each base:

dna = "ATCGATCGATCGTACGCTGA"
a_count = dna.count("A")
t_count = dna.count("T")
g_count = dna.count("G")
c_count = dna.count("C")

and now our code is starting to look rather repetitive. It’s not too bad for the four individual bases, but what if we want to generate counts for the 16 dinucleotides:

dna = "ATCGATCGATCGTACGCTGA"
aa_count = dna.count("AA")
at_count = dna.count("AT")
ag_count = dna.count("AG")
...etc. etc.

or the 64 trinucleotides:

dna = "ATCGATCGATCGTACGCTGA"
aaa_count = dna.count("AAA")
aat_count = dna.count("AAT")
aag_count = dna.count("AAG")
...etc. etc.

For trinucleotides and longer, the situation is particularly bad. The DNA sequence is 20 bases long, so it only contains 18 overlapping trinucleotides in total. This means that we’ll end up with 64 different variables, at least 46 of which will hold the value zero.

One possible way round this is to store the values in a list. If we use three nested loops, we can generate all possible trinucleotides, calculate the count for each one, and store all the counts in a list:

dna = "AATGATCGATCGTACGCTGA"
all_counts = []
for base1 in ['A', 'T', 'G', 'C']:
    for base2 in ['A', 'T', 'G', 'C']:
            for base3 in ['A', 'T', 'G', 'C']:
                trinucleotide = base1 + base2 + base3
                count = dna.count(trinucleotide)
                print("count is " + str(count) + " for " + trinucleotide)
                all_counts.append(count)
print(all_counts)

Although the code is above is quite compact, and doesn’t require huge numbers of variables, the output shows two problems with this approach:

count is 0 for AAA
count is 1 for AAT
count is 0 for AAG
count is 0 for AAC
count is 0 for ATA
count is 0 for ATT
count is 1 for ATG
count is 2 for ATC
 … many lines removed …
[0, 1, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0]

Firstly, the data are still incredibly sparse – the vast majority of the counts are zero. Secondly, the counts themselves are now disconnected from the trinucleotides. If we want to look up the count for a single trinucleotide – for example, TGA – we first have to figure out that TGA was the 25th trinucleotide generated by our loops. Only then can we get the element at the correct index:

print("count for TGA is " + str(all_counts[24]))

We can try various tricks to get round this problem. What if we generated two lists – one of counts, and one of the trinucleotides themselves?

dna = "AATGATCGATCGTACGCTGA"
all_trinucleotides = []
all_counts = []
for base1 in ['A', 'T', 'G', 'C']:
    for base2 in ['A', 'T', 'G', 'C']:
            for base3 in ['A', 'T', 'G', 'C']:
                trinucleotide = base1 + base2 + base3
                count = dna.count(trinucleotide)
                all_trinucleotides.append(trinucleotide)
                all_counts.append(count)
print(all_counts)
print(all_trinucleotides)

Now we have two lists of the same length, with a one-to-one correspondence between the elements:

[0, 1, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0]
['AAA', 'AAT', 'AAG', 'AAC', 'ATA', 'ATT', 'ATG', 'ATC', 'AGA', 'AGT', 'AGG', 'AGC', 'ACA', 'ACT', 'ACG', 'ACC', 'TAA', 'TAT', 'TAG', 'TAC', 'TTA', 'TTT', 'TTG', 'TTC', 'TGA', 'TGT', 'TGG', 'TGC', 'TCA', 'TCT', 'TCG', 'TCC', 'GAA', 'GAT', 'GAG', 'GAC', 'GTA', 'GTT', 'GTG', 'GTC', 'GGA', 'GGT', 'GGG', 'GGC', 'GCA', 'GCT', 'GCG', 'GCC', 'CAA', 'CAT', 'CAG', 'CAC', 'CTA', 'CTT', 'CTG', 'CTC', 'CGA', 'CGT', 'CGG', 'CGC', 'CCA', 'CCT', 'CCG', 'CCC']

This allows us to look up the count for a given trinucleotide in a slightly more appealing way – we can look up the index of the trinucleotide in the all_trinucleotides list, then get the count at the same index in the all_counts list:

i = all_trinucleotides.index('TGA')
c = all_counts[i]
print('count for TGA is ' + str(c))

This is a little bit nicer, but still has major drawbacks. We’re still storing all those zeros, and now we have two lists to keep track of. We need to be incredibly careful when manipulating either of the two lists to make sure that they stay perfectly synchronized – if we make any change to one list but not the other, then there will no longer be a one-to-one correspondence between elements and we’ll get the wrong answer when we try to look up a count.

This approach is also slow1 . To find the index of a given trinucleotide in the all_trinucleotides list, Python has to look at each element one at a time until it finds the one we’re looking for. This means that as the size of the list grows2 , the time taken to look up the count for a given element will grow alongside it.

If we take a step back and think about the problem in more general terms, what we need is a way of storing pairs of data (in this case, trinucleotides and their counts) in a way that allows us to efficiently look up the count for any given trinucleotide. This problem of storing paired data is incredibly common in programming. We might want to store:

  • protein sequence names and their sequences
  • DNA restriction enzyme names and their motifs
  • codons and their associated amino acid residues
  • colleagues’ names and their email addresses
  • sample names and their co-ordinates
  • words and their definitions

All these are examples of what we call key-value pairs. In each case we have pairs of keys and values:

Key Value
trinucleotide count
name protein sequence
name restriction enzyme motif
codon amino acid residue
name email address
sample coordinates
word definition

The last example in this table – words and their definitions – is an interesting one because we have a tool in the physical world for storing this type of data – a dictionary. Python’s tool for solving this type of problem is also called a dictionary (usually abbreviated to dict) and in this section we’ll see how to create and use them.

Creating a dictionary

The syntax for creating a dictionary is similar to that for creating a list, but we use curly brackets rather than square ones. Each pair of data, consisting of a key and a value, is called an item. When storing items in a dictionary, we separate them with commas. Within an individual item, we separate the key and the value with a colon. Here’s a bit of code that creates a dictionary of restriction enzymes (using data from the previous section) with three items:

enzymes = { 'EcoRI':r'GAATTC', 'AvaII':r'GG(A|T)CC', 'BisI':'GC[ATGC]GC' }

In this case, the keys and values are both strings3 . Splitting the dictionary definition over several lines makes it easier to read:

enzymes = {
    'EcoRI' : r'GAATTC',
    'AvaII' : r'GG(A|T)CC',
    'BisI'  : r'GC[ATGC]GC'
}

but doesn’t affect the code at all. To retrieve a bit of data from the dictionary – i.e. to look up the motif for a particular enzyme – we write the name of the dictionary, followed by the key in square brackets:

print(enzymes['BisI'])

The code looks very similar to using a list, but instead of giving the index of the element we want, we’re giving the key for the value that we want to retrieve.

Dictionaries are a very useful way to store data, but they come with some restrictions. The only types of data we are allowed to use as keys are strings and numbers4 , so we can’t, for example, create a dictionary where the keys are file objects. Values can be whatever type of data we like. Also, keys must be unique – we can’t store multiple values for the same key. You might think that this makes dicts less useful, but there are ways round the problem of storing multiple values – we won’t need them for the examples in this chapter, but the chapter on complex data structures in Advanced Python for Biologists gives details.

In real-life programs, it’s relatively rare that we’ll want to create a dictionary all in one go like in the example above. More often, we’ll want to create an empty dictionary, then add key/value pairs to it (just as we often create an empty list and then add elements to it).

To create an empty dictionary we simply write a pair of curly brackets on their own, and to add elements, we use the square-brackets notation on the left-hand side of an assignment. Here’s a bit of code that stores the restriction enzyme data one item at a time:

enzymes = {}
enzymes['EcoRI'] = r'GAATTC'
enzymes['AvaII] =  r'GG(A|T)CC'
enzymes['BisI'] =  r'GC[ATGC]GC'

We can delete a key from a dictionary using the pop method. pop actually returns the value and deletes the key at the same time:

enzymes = {
    'EcoRI' : r'GAATTC',
    'AvaII' : r'GG(A|T)CC',
    'BisI'  : r'GC[ATGC]GC'
}
# remove the EcoRI enzyme from the dict
enzymes.pop('EcoRI')

Let’s take another look at the trinucleotide count example from the start of the section. Here’s how we store the trinucleotides and their counts in a dictionary:

dna = "AATGATCGATCGTACGCTGA"
counts = {}
for base1 in ['A', 'T', 'G', 'C']:
    for base2 in ['A', 'T', 'G', 'C']:
        for base3 in ['A', 'T', 'G', 'C']:
            trinucleotide = base1 + base2 + base3
            count = dna.count(trinucleotide)
            counts[trinucleotide] = count

print(counts)

We can see from the output that the trinucleotides and their counts are stored together in one variable:

{'ACC': 0, 'ATG': 1, 'AAG': 0, 'AAA': 0, 'ATC': 2, 'AAC': 0, 'ATA': 0, 'AGG': 0, 'CCT': 0, 'CTC': 0, 'AGC': 0, 'ACA': 0, 'AGA': 0, 'CAT': 0, 'AAT': 1, 'ATT': 0, 'CTG': 1, 'CTA': 0, 'ACT': 0, 'CAC': 0, 'ACG': 1, 'CAA': 0, 'AGT': 0, 'CAG': 0, 'CCG': 0, 'CCC': 0, 'CTT': 0, 'TAT': 0, 'GGT': 0, 'TGT': 0, 'CGA': 1, 'CCA': 0, 'TCT': 0, 'GAT': 2, 'CGG': 0, 'TTT': 0, 'TGC': 0, 'GGG': 0, 'TAG': 0, 'GGA': 0, 'TAA': 0, 'GGC': 0, 'TAC': 1, 'TTC': 0, 'TCG': 2, 'TTA': 0, 'TTG': 0, 'TCC': 0, 'GAA': 0, 'TGG': 0, 'GCA': 0, 'GTA': 1, 'GCC': 0, 'GTC': 0, 'GCG': 0, 'GTG': 0, 'GAG': 0, 'GTT': 0, 'GCT': 1, 'TGA': 2, 'GAC': 0, 'CGT': 1, 'TCA': 0, 'CGC': 1}

We still have a lot of repetitive counts of zero, but looking up the count for a particular trinucleotide is now very straightforward:

print(counts['TGA'])

We no longer have to worry about either “memorizing” the order of the counts or maintaining two separate lists.

Let’s now see if we can find a way of avoiding storing all those zero counts. We can add an if statement that ensures that we only store a count if it’s greater than zero:

dna = "AATGATCGATCGTACGCTGA"
counts = {}
for base1 in ['A', 'T', 'G', 'C']:
    for base2 in ['A', 'T', 'G', 'C']:
        for base3 in ['A', 'T', 'G', 'C']:
            trinucleotide = base1 + base2 + base3
            count = dna.count(trinucleotide)
            if count > 0:
                counts[trinucleotide] = count

print(counts)

When we look at the output from the above code, we can see that the amount of data we’re storing is much smaller – just the counts for the trinucleotides that are greater than zero:

{'ATG': 1, 'ACG': 1, 'ATC': 2, 'GTA': 1, 'CTG': 1, 'CGC': 1, 'GAT': 2, 'CGA': 1, 'AAT': 1, 'TGA': 2, 'GCT': 1, 'TAC': 1, 'TCG': 2, 'CGT': 1}

Now we have a new problem to deal with. Looking up the count for a given trinucleotide works fine when the count is positive:

print(counts['TGA'])

But when the count is zero, the trinucleotide doesn’t appear as a key in the dictionary:

print(counts['AAA'])

so we will get a KeyError when we try to look it up:

KeyError: 'AAA'

There are two possible ways to fix this. We can check for the existence of a key in a dictionary (just like we can check for the existence of an element in a list), and only try to retrieve it once we know it exists:

if 'AAA' in counts:
    print(counts('AAA'))

Alternatively, we can use the dictionary’s get method. get usually works just like using square brackets: the following two lines do exactly the same thing:

print(counts['TGA'])
print(counts.get('TGA'))

The thing that makes get really useful, however, is that it can take an optional second argument, which is the default value to be returned if the key isn’t present in the dictionary. In this case, we know that if a given trinucleotide doesn’t appear in the dictionary then its count is zero, so we can give zero as the default value and use get to print out the count for any trinucleotide:

print("count for TGA is " + str(counts.get('TGA', 0)))
print("count for AAA is " + str(counts.get('AAA', 0)))
print("count for GTA is " + str(counts.get('GTA', 0)))
print("count for TTT is " + str(counts.get('TTT', 0)))

As we can see from the output, we now don’t have to worry about whether or not each trinucleotide appears in the dictionary – get takes care of everything and returns zero when appropriate:

count for TGA is 2
count for AAA is 0
count for GTA is 1
count for TTT is 0

Iterating over a dictionary

What if, instead of looking up a single item from a dictionary, we want to do something for all items? For example, imagine that we wanted to take our counts dictionary variable from the code above and print out all trinucleotides where the count was 2. One way to do it would be to use our three nested loops again to generate all possible trinucleotides, then look up the count for each one and decide whether or not to print it:

for base1 in ['A', 'T', 'G', 'C']:
    for base2 in ['A', 'T', 'G', 'C']:
        for base3 in ['A', 'T', 'G', 'C']:
            trinucleotide = base1 + base2 + base3
            if counts.get(trinucleotide, 0) == 2:
                print(trinucleotide)

As we can see from the output, this works perfectly well:

ATC
TGA
TCG
GAT

But it seems inefficient to go through the whole process of generating all possible trinucleotides again, when the information we want – the list of trinucleotides – is already in the dictionary. A better approach would be to read the list of keys directly from the dictionary, which is what the keys method does.

Iterating over keys

When used on a dictionary, the keys method returns a list of all the keys in the dictionary:

print(counts.keys())

Looking at the output5 confirms that this is the list of trinucleotides we want to consider (remember that we’re looking for trinucleotides with a count of two, so we don’t need to consider ones that aren’t in the dictionary as we already know that they have a count of zero):

['ATG', 'ACG', 'ATC', 'GTA', 'CTG', 'CGC', 'GAT', 'CGA', 'AAT', 'TGA', 'GCT', 'TAC', 'TCG', 'CGT']

Using keys, our code for printing out all the trinucleotides that appear twice in the DNA sequence becomes a lot more concise:

for trinucleotide in counts.keys():
    if counts.get(trinucleotide) == 2:
        print(trinucleotide)

This version prints exactly the same set of trinucleotides as the more verbose method:

ATC
GAT
TGA
TCG

Before we move on, take a moment to compare the output immediately above this paragraph with the output from the three-loop version from earlier in this section. You’ll notice that while the set of trinucleotides is the same, the order in which they appear is different. This illustrates an important point about dictionaries – they are inherently unordered. That means that when we use the keys method to iterate over a dictionary, we can’t rely on processing the items in the same order that we added them. This is in contrast to lists, which always maintain the same order when looping. If we want to control the order in which keys are printed we can use the sorted method to sort the list before processing it:

for trinucleotide in sorted(counts.keys()):
    if counts.get(trinucleotide) == 2:
        print(trinucleotide)

Iterating over items

In the example code above, the first thing we need to do inside the loop is to look up the value for the current key. This is a very common pattern when iterating over dictionaries – so common, in fact, that Python has a special shorthand for it. Instead of doing this:

for key in my_dict.keys():
    value = my_dict.get(key)
    # do something with key and value

We can use the items method to iterate over pairs of data, rather than just keys:

for key, value in my_dict.items():
    # do something with key and value

The items method does something slightly different from all the other methods we’ve seen so far in this book; rather than returning a single value, or a list of values, it returns a list of pairs of values. That’s why we have to give two variable names at the start of the loop. Here’s how we can use the items method to process our dictionary of trinucleotide counts just like before:

for trinucleotide, count in counts.items():
    if count == 2:
        print(trinucleotide)

This method is generally preferred for iterating over items in a dictionary, as it makes the intention of the code very clear. Many of the problems that we solve by iterating over dicts can also be solved using comprehensions – there’s a whole chapter devoted to comprehensions in Advanced Python for Biologists, so take a look.

Recap

We started this section by examining the problem of storing paired data in Python. After looking at a couple of unsatisfactory ways to do it using tools that we’ve already learned about, we introduced a new type of data structure – the dictionary – which offers a much nicer solution to the problem of storing paired data.

Later in the section, we saw that the real benefit of using dictionaries is the efficient lookup they provide. We saw how to create dictionaries and manipulate the items in them, and several different ways to look up values for known keys. We also saw how to iterate over all the items in dictionary.

In the process, we uncovered a few restrictions on what dictionaries are capable of – we’re only allowed to use a couple of different data types for keys, they must be unique, and we can’t rely on their order. Just as a physical dictionary allows us to rapidly look up the definition for a word but not the other way round, Python dictionaries allow us to rapidly look up the value associated with a key, but not the reverse.

Because of their ability to look up a given value very rapidly given a key, dicts are extremely useful when storing complex data. Take a look at the last few sections of the chapter on complex data structures in Advanced Python for Biologists for a set of examples of this technique.

After you’ve tried the exercise below, take a look at one possible solution, then move on to the last section: files, programs, and user input.

Exercises

DNA translation

Write a program that will translate a DNA sequence into protein. Your program should use the standard genetic code which can be found at this URL:

http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=tgencodes#SG1

[get_solutions]


  1. As a rule, we’ve avoided talking about performance in this book, but we’ll break the rule in this case.
     

  2. For instance, imagine carrying out the same exercise with the approximately one million unique 10-mers.
     

  3. The values are actually raw strings, but that’s not important.
     

  4. Not strictly true; we can use any immutable type, but that is beyond the scope of this book.
     

  5. If you’re using Python 3 you might see slightly different output here, but all the code examples will work just the same
     

Powered by WordPress. Designed by Woo Themes