Lists and loops

Why do we need lists and loops?

Think back over the exercises that we’ve seen in the previous two sections – they’ve all involved dealing with one bit of information at a time. In section 2, we used string manipulation tools to process single sequences, and in section 3, we practised reading and writing files one at a time. The closest we got to using multiple pieces of data was during the final exercise in section 3, where we were dealing with three DNA sequences.

If that’s all that Python allowed us to do, it wouldn’t be a very helpful tool for biology. In fact, there’s a good chance that you’re working through this course because you want to be able to write programs to help you deal with large datasets. A very common situation in biological research is to have a large collection of data (DNA sequences, SNP positions, gene expression measurements) that all need to be processed in the same way. In this section, we’ll learn about the fundamental programming tools that will allow our programs to do this.

So far we have learned about several different data types (strings, numbers, and file objects), all of which store a single bit of information1 When we’ve needed to store multiple bits of information (for example, the three DNA sequences in the section 3 exercises) we have simply created more variables to hold them:

# set the values of all the sequence variables
seq_1 = "ATCGTACGATCGATCGATCGCTAGACGTATCG"
seq_2 = "actgatcgacgatcgatcgatcacgact"
seq_3 = "ACTGAC-ACTGT—ACTGTA----CATGTG"

The limitations of this approach became clear quite quickly as we looked at the solution code – it only worked because the number of sequences were small, and we knew the number in advance. If we were to repeat the exercise with three hundred or three thousand sequences, the vast majority of the code would be given over to storing variables and it would become completely unmanageable. And if we were to try and write a program that could process an unknown number of input sequences (for instance, by reading them from a file), we wouldn’t be able to do it. To make our programs able to process multiple pieces of data, we need an entirely new type of structure which can hold many pieces of information at the same time – a list.

We’ve also dealt exclusively with programs whose statements are executed from top to bottom in a very straightforward way. This has great advantages when first starting to think about programming – it makes it very easy to follow the flow of a program. The downside of this sequential style of programming, however, is that it leads to very redundant code like we saw at the end of the previous section:

# make three files to hold the output
output_1 = open(header_1 + ".fasta", "w")
output_2 = open(header_2 + ".fasta", "w")
output_3 = open(header_3 + ".fasta", "w")

Again; it was only possible to solve the exercise in this manner because we knew in advance the number of output files we were going to need. Looking at the code, it’s clear that these three lines consist of essentially the same statement being executed multiple times, with some slight variations. This idea of repetition-with-variation is incredibly common in programming problems, and Python has built in tools for expressing it – loops.

Creating lists and retrieving elements

To make a new list, we put several strings or numbers2 inside square brackets, separated by commas:

apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
conserved_sites = [24, 56, 132]

Each individual item in a list is called an element. To get a single element from the list, write the variable name followed by the index of the element you want in square brackets:

apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
conserved_sites = [24, 56, 132]
print(apes[0])
first_site = conserved_sites[2]

If we want to go in the other direction – i.e. we know which element we want but we don’t know the index – we can use the index method:

apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
chimp_index = apes.index("Pan troglodytes")
# chimp_index is now 1

Remember that in Python we start counting from zero rather than one, so the first element of a list is always at index zero. If we give a negative number, Python starts counting from the end of the list rather than the beginning – so it’s easy to get the last element from a list:

last_ape = apes[-1]

What if we want to get more than one element from a list? We can give a start and stop position, separated by a colon, to specify a range of elements:

ranks = ["kingdom","phylum", "class", "order", "family"]
lower_ranks = ranks[2:5]
# lower ranks are class, order and family

Does this look familiar? It’s the exact same notation that we used to get substrings back in section 2, and it works in exactly the same way – numbers are inclusive at the start and exclusive at the end. The fact that we use the same notation for strings and lists hints at a deeper relationship between the two types. In fact, what we were doing when extracting substrings in section 2 was treating a string as though it were a list of characters. This idea – that we can treat a variable as though it were a list when it’s not – is a powerful one in Python and we’ll come back to it later in this section (and also in the chapter on iterators in Advanced Python for Biologists).

Working with list elements

To add another element onto the end of an existing list, we can use the append method:

apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
apes.append("Pan paniscus")

append is an interesting method because it actually changes the variable on which it’s used – in the above example, the apes list goes from having three elements to having four. We can get the length of a list by using the len function, just like we did for strings:

apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
print("There are " + str(len(apes)) + " apes")
apes.append("Pan paniscus")
print("Now there are " + str(len(apes)) + " apes")

The output shows that the number of elements in apes really has changed:

There are 3 apes
Now there are 4 apes

We can concatenate two lists just as we did with strings, by using the plus symbol:

apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
monkeys = ["Papio ursinus", "Macaca mulatta"]
primates = apes + monkeys

print(str(len(apes)) + " apes")
print(str(len(monkeys)) + " monkeys")
print(str(len(primates)) + " primates")

As we can see from the output, this doesn’t change either of the two original lists – it makes a brand new list which contains elements from both:

3 apes
2 monkeys
5 primates

If we want to add elements from a list onto the end of an existing list, changing it in the process, we can use the extend method. extend behaves like append but takes a list as its argument rather than a single element.

Here are two more list methods that change the variable they’re used on: reverse and sort. Both reverse and sort work by changing the order of the elements in the list. If we want to print out a list to see how this works, we need to used str (just as we did when printing out numbers):

ranks = ["kingdom","phylum", "class", "order", "family"]
print("at the start : " + str(ranks))
ranks.reverse()
print("after reversing : " + str(ranks))
ranks.sort()
print("after sorting : " + str(ranks))

If we take a look at the output, we can see how the order of the elements in the list is changed by these two methods:

at the start : ['kingdom', 'phylum', 'class', 'order', 'family']
after reversing : ['family', 'order', 'class', 'phylum', 'kingdom']
after sorting : ['class', 'family', 'kingdom', 'order', 'phylum']

By default, Python sorts strings in alphabetical order and numbers in ascending numerical order3 .

Writing a loop

Imagine we wanted to take our list of apes:

apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]

and print out each element on a separate line, like this:

Homo sapiens is an ape
Pan troglodytes is an ape
Gorilla gorilla is an ape

One way to do it would be to just print each element separately:

print(apes[0] + " is an ape")
print(apes[1] + " is an ape")
print(apes[2] + " is an ape")

but this is very repetitive and relies on us knowing the number of elements in the list. What we need is a way to say something along the lines of “for each element in the list of apes, print out the element, followed by the words ‘ is an ape’“. Python’s loop syntax allows us to express those instructions like this:

for ape in apes:
    print(ape + " is an ape")

Let’s take a moment to look at the different parts of this loop. We start by writing for x in y, where y is the name of the list we want to process and x is the name we want to use for the current element each time round the loop.

x is just a variable name (so it follows all the rules that we’ve already learned about variable names), but it behaves slightly differently to all the other variables we’ve seen so far. In all previous examples, we create a variable and store something in it, and then the value of that variable doesn’t change unless we change it ourselves. In contrast, when we create a variable to be used in a loop, we don’t set its value – the value of the variable will be automatically set to each element of the list in turn, and it will be different each time round the loop.

Importantly, the loop variable x only exists inside the loop – it gets created at the start of each loop iteration, and disappears at the end. This means that once the loop has finished running for the last time, that variable is gone forever. When a variable is restricted to a block of code like this, we call it the variable’s scope – we will see several more examples later in the book.

This first line of the loop ends with a colon, and all the subsequent lines (just one, in this case) are indented. Indented lines can start with any number of tab or space characters, but they must all be indented in the same way. This pattern – a line which ends with a colon, followed by some indented lines – is very common in Python, and we’ll see it in several more places throughout this book. A group of indented lines is often called a block of code.4

In this case, we refer to the indented bock as the body of the loop, and the lines inside it will be executed once for each element in the list. To refer to the current element, we use the variable name that we wrote in the first line. The body of the loop can contain as many lines as we like, and can include all the functions and methods that we’ve learned about, with one important exception: we’re not allowed to change the list while inside the body of the loop5 .

Here’s an example of a loop with a more complicated body:

apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
for ape in apes:
    name_length = len(ape)
    first_letter = ape[0]
    print(ape + " is an ape. Its name starts with " + first_letter)
    print("Its name has " + str(name_length) + " letters")

The body of the loop in the code above has four statements, two of which are print statements, so each time round the loop we’ll get two lines of output. If we look at the output we can see all six lines:

Homo sapiens is an ape. Its name starts with H
Its name has 12 letters
Pan troglodytes is an ape. Its name starts with P
Its name has 15 letters
Gorilla gorilla is an ape. Its name starts with G
Its name has 15 letters

Why is the above approach better than printing out these six lines in six separate statements? Well, for one thing, there’s much less redundancy – here we only needed to write two print statements. This also means that if we need to make a change to the code, we only have to make it once rather than three separate times. Another benefit of using a loop here is that if we want to add some elements to the list, we don’t have to touch the loop code at all. Consequently, it doesn’t matter how many elements are in the list, and it’s not a problem if we don’t know how many are going to be in it at the time when we write the code. Many problems that can be solved with loops can also be solved using a tool called list comprehensions – see the chapter on comprehensions in Advanced Python for Biologists.

Indentation errors

Unfortunately, introducing tools like loops that require an indented block of code also introduces the possibility of a new type of error – an IndentationError. Notice what happens when the indentation of one of the lines in the block does not match the others:

apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
for ape in apes:
    name_length = len(ape)
  first_letter = ape[0]
    print(ape + " is an ape. Its name starts with " + first_letter)
    print("Its name has " + str(name_length) + " letters")

When we run this code, we get an error message before the program even starts to run:

IndentationError: unindent does not match any outer indentation level

When you encounter an IndentationError, go back to your code and double-check that all the lines in the block match up. Also double-check that you are using either tabs or spaces for indentation, not both. The easiest way to do this, as mentioned in section 1, is to enable tab emulation in your text editor.

Using a string as a list

We’ve already seen how a string can pretend to be a list – we can use list index notation to get individual characters or substrings from inside a string. Can we also use loop notation to process a string as though it were a list? Yes – if we write a loop statement with a string in the position where we’d normally find a list, Python treats each character in the string as a separate element. This allows us to very easily process a string one character at a time:

name = "martin"
for character in name:
    print("one character is " + character)

In this case, we’re just printing each individual character:

one character is m
one character is a
one character is r
one character is t
one character is i
one character is n

The process of repeating a set of instructions for each element of a list (or character in a string) is called iteration, and we often talk about iterating over a list or string.

Splitting a string to make a list

So far in this section, all our lists have been written manually. However, there are plenty of functions and methods in Python that produce lists as their output. One such method that is particularly interesting to biologists is the split method which works on strings. split takes a single argument, called the delimiter, and splits the original string wherever it sees the delimiter, producing a list. Here’s an example:

names = "melanogaster,simulans,yakuba,ananassae"
species = names.split(",")
print(str(species))

We can see from the output that the string has been split wherever there was a comma leaving us with a list of strings:

 ['melanogaster', 'simulans', 'yakuba', 'ananassae']

Of course, once we’ve created a list in this way we can iterate over it using a loop, just like any other list.

Iterating over lines in a file

Another very useful thing that we can iterate over is a file. Just as a string can pretend to be a list for the purposes of looping, a file object can do the same trick6 . When we treat a string as a list, each character becomes an individual element, but when we treat a file as a list, each line becomes an individual element. This makes processing a file line-by-line very easy:

file = open("some_input.txt")
for line in file:
    # do something with the line

A quick warning: when you’re writing a program that reads data from a file, it’s best to use either the read method (to store the entire contents in a variable) or the loop method (to deal with each line separately). If you try to mix them, you might get unexpected behaviour. The reason for this is that Python keeps track of its position in each file, so if you read the contents of a file object using the read method, and then later try to process it one line at a time with a loop, you won’t get any input because Python thinks it’s already at the end of the file. If you absolutely have to use one method and then the other, you can get round this problem by closing and then re-opening the file.

Looping with ranges

Sometimes we want to loop over a list of numbers. Imagine we have a protein sequence:

protein = "vlspadktnv"

and we want to print out the first three residues, then the first four residues, etc. etc.:

vls
vlsp
vlspa
vlspad
...etc...

One way to tackle the problem would be to use a loop – we could extract a substring from the protein sequence and print it in the body of the loop, and the only thing that would need to change is the stop position in the substring. But what are we going to iterate over? We can’t just iterate over the protein string, because that will give us individual residues, which is not what we want. We can manually assemble a list of stop positions, and loop over that:

stop_positions = [3,4,5,6,7,8,9,10]
for stop in stop_positions:
    substring = protein[0:stop]
    print(substring)

but this seems cumbersome, and only works if we know the length of the protein sequence in advance.

A better solution is to use the range function. range is a built-in Python function that generates lists of numbers for us to loop over. The behaviour of the range function depends on how many arguments we give it. Below are a few examples, with the output following directly after the code.

With a single argument, range will count up from zero to that number, excluding the number itself:

for number in range(6):
    print(number)

 

0
1
2
3
4
5

With two numbers, range will count up from the first number (inclusive7 ) to the second (exclusive):

for number in range(3, 8):
    print(number)

 

3
4
5
6
7

With three numbers, range will count up from the first to the second with the step size given by the third:

for number in range(2, 14, 4):
    print(number)

 

2
6
10

Recap

In this section we’ve seen several tools that work together to allow our programs to deal elegantly with multiple pieces of data. Lists let us store many elements in a single variable, and loops let us process those elements one by one. In learning about loops, we’ve also been introduced to the block syntax and the importance of indentation in Python.

We’ve also seen several useful ways in which we can use the notation we’ve learned for working with lists with other types of data. Depending on the circumstances, we can use strings, files, and ranges as if they were lists. This is a very helpful feature of Python, because once we’ve become familiar with the syntax for working with lists, we can use it in many different place. Learning about these tools has also helped us make sense of some interesting behaviour that we observed in earlier sections.

Lists are the first example we’ve encountered of structures that can hold multiple pieces of data. We’ll encounter another such structure – the dict – in section 8. In fact, Python has several more such data types – you’ll find a full survey of them in the chapter on complex data structures in Advanced Python for Biologists.

Exercises

Note: all the files mentioned in these exercises can be found in the section_4 folder of the exercises download.

Processing DNA in a file

The file input.txt contains a number of DNA sequences, one per line. Each sequence starts with the same 14 base pair fragment – a sequencing adapter that should have been removed. Write a program that will (a) trim this adapter and write the cleaned sequences to a new file and (b) print the length of each sequence to the screen.

Multiple exons from genomic DNA

The file genomic_dna.txt contains a section of genomic DNA, and the file exons.txt contains a list of start/stop positions of exons. Each exon is on a separate line and the start and stop positions are separated by a comma. Write a program that will extract the exon segments, concatenate them, and write them to a new file.
[get_solutions]


  1. We know that files are slightly different to strings and numbers because they can store a lot of information, but each file object still only refers to a single file. 

  2. or any other single variable or value 

  3. We can sort in other ways too, but that’s beyond the scope of this book. 

  4. If you’re familiar with any other programming languages, you might know code blocks as things that are surrounded with curly brackets – the indentation does the same job in Python 

  5. Changing the list while looping can cause Python to become confused about which elements have already been processed and which are yet to come. 

  6. If you’re interested in how this “pretending” actually works, look up the Python documentation for iterators – but be prepared to do quite a bit of reading! 

  7. The rules for ranges are the same as for array notation – inclusive on the low end, exclusive on the high end – so you only have to memorize them once! 

Powered by WordPress. Designed by Woo Themes