Programs need to make decisions
If we look back at the examples and exercises in previous sections, something that stands out is the lack of decision-making. We’ve gone from doing simple calculations on individual bits of data to carrying out more complicated procedures on collections of data, but the way that each bit of data (a sequence, a base, a species name, an exon) has been treated identically.
Real-life problems, however, often require our programs to act as decision-makers; to examine a property of some bit of data and decide what to do with it. In this section, we’ll see how to do that using conditional statements. Conditional statements are features of Python that allow us to build decision points in our code. They allow our programs to decide which out of a number of possible courses of action to take – instructions like “print the name of the sequence if it’s longer than 300 bases” or “group two samples together if they were collected less than 10 metres apart“.
Before we can start using conditional statements, however, we need to understand conditions.
Conditions, True and False
A condition is simply a bit of code that can produce a true or false answer. The easiest way to understand how conditions work in Python is try out a few examples. The following example prints out the result of testing (or evaluating) a bunch of different conditions – some mathematical examples, some using string methods, and one for testing if a value is included in a list:
print(3 == 5) print(3 > 5) print(3 <=5) print(len("ATGC") > 5) print("GAATTC".count("T") > 1) print("ATGCTT".startswith("ATG")) print("ATGCTT".endswith("TTT")) print("ATGCTT".isupper()) print("ATGCTT".islower()) print("V" in ["V", "W", "L"])
If we look at the output, we can see use the line numbers to match up each condition with its result:
False False True False True True False True False True
But what’s actually being printed here? At first glance, it looks like we’re printing the strings “True” and “False”, but those strings don’t appear anywhere in our code. What is actually being printed is the special built-in values that Python uses to represent true and false – they are capitalized so that we know they’re these special values.
We can show that these values are special by trying to print them. The following code runs without errors (note the absence of quotation marks):
whereas trying to print arbitrary unquoted words:
There’s a wide range of things that we can include in conditions, and it would be impossible to give an exhaustive list here. The basic building blocks are:
- equals (represented by
- greater and less than (represented by
- greater and less than or equal to (represented by
- not equal (represented by
- is a value in a list (represented by
- are two objects the same1 (represented by
Many data types also provide methods that return True or False values, which are often a lot more convenient to use than the building blocks above. We’ve already seen a few in the code sample above: for example, strings have a
startswith method that returns true if the string starts with the string given as an argument. We’ll mention these true/false methods when they come up.
Notice that the test for equality is two equals signs, not one. Forgetting the second equals sign will cause an error.
Now that we know how to express tests as conditions, let’s see what we can do with them.
The simplest kind of conditional statement is an if statement. Hopefully the syntax is fairly simple to understand:
expression_level = 125 if expression_level > 100: print("gene is highly expressed")
We write the word
if, followed by a condition, and end the first line with a colon. There follows a block of indented lines of code (the body of the if statement), which will only be executed if the condition is true. This colon-plus-block pattern should be familiar to you from the sections on loops and functions.
Most of the time, we want to use an if statement to test a property of some variable whose value we don’t know at the time when we are writing the program. The example above is obviously useless, as the value of the
expression_level variable is not going to change!
Here’s a slightly more interesting example: we’ll define a list of gene accession names and print out just the ones that start with “a”:
accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72'] for accession in accs: if accession.startswith('a'): print(accession)
Looking at the output allows us to check that this works as intended:
ab56 ay93 ap97
If you take a close look at the code above, you’ll see something interesting – the lines of code inside the loop are indented (just as we’ve seen before), but the line of code inside the
if statement is indented twice – once for the loop, and once for the
if. This is the first time we’ve seen multiple levels of indentation, but it’s very common once we start working with larger programs – whenever we have one loop or
if statement nested inside another, we’ll have this type of indentation.
Python is quite happy to have as many levels of indentation as needed, but you’ll need to keep careful track of which lines of code belong at which level. If you find yourself writing a piece of code that requires more than three levels of indentation, it’s generally an indication that that piece of code should be turned into a function.
Closely related to the
if statement is the
else statement. The examples above use a yes/no type of decision-making: should we print the gene accession number or not? Often we need an either/or type of decision, where we have two possible actions to take. To do this, we can add on an
else clause after the end of the body of an
expression_level = 125 if expression_level > 100: print("gene is highly expressed") else: print("gene is lowly expressed")
else statement doesn’t have any condition of its own – rather, the else statement body is execute when the if statement to which it’s attached is not executed.
Here’s an example which uses
else to split up a list of accession names into two different files – accessions that start with “a” go into the first file, and all other accessions go into the second file:
file1 = open("one.txt", "w") file2 = open("two.txt", "w") accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72'] for accession in accs: if accession.startswith('a'): file1.write(accession + "\n") else: file2.write(accession + "\n")
Notice how there are multiple indentation levels as before, but that the
else statements are at the same level.
What if we have more than two possible branches? For example, say we want three files of accession names: ones that start with “a”, ones that start with “b”, and all others. We could have a second
if statement nested inside the
else clause of the first
file1 = open("one.txt", "w") file2 = open("two.txt", "w") file3 = open("three.txt", "w") accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72'] for accession in accs: if accession.startswith('a'): file1.write(accession + "\n") else: if accession.startswith('b'): file2.write(accession + "\n") else: file3.write(accession + "\n")
This works, but is difficult to read – we can quickly see that we need an extra level of indentation for every additional choice we want to include. To get round this, Python has an
elif statement, which merges together
if and allows us to rewrite the above example in a much more elegant way:
file1 = open("one.txt", "w") file2 = open("two.txt", "w") file3 = open("three.txt", "w") accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72'] for accession in accs: if accession.startswith('a'): file1.write(accession + "\n") elif accession.startswith('b'): file2.write(accession + "\n") else: file3.write(accession + "\n")
Notice how this version of the code only needs two levels of indention. In fact, using
elif we can have any number of branches and still only require a single extra level of indentation:
for accession in accs: if accession.startswith('a'): file1.write(accession + "\n") elif accession.startswith('b'): file2.write(accession + "\n") elif accession.startswith('c'): file3.write(accession + "\n") elif accession.startswith('d'): file4.write(accession + "\n") elif accession.startswith('e'): file5.write(accession + "\n") else: file6.write(accession + "\n")
Another way of handling complex decision branches like this – especially useful when dealing with validation and errors – is using exceptions, which have their own chapter in Advanced Python for Biologists.
Here’s one final thing we can do with conditions: use them to determine when to exit a loop. In section 4 we learned about loops that iterate over a collection of items (like a list, a string or a file). Python has another type of loop called a
while loop. Rather than running a set number of times, a
while loop runs until some condition is met. For example, here’s a bit of code that increments a count variable by one each time round the loop, stopping when the count variable reaches ten:
count = 0 while count<10: print(count) count = count + 1
Because normal loops in Python are so powerful2 ,
while loops are used much less frequently than in other languages, so we won’t discuss them further.
Building up complex conditions
What if we wanted to express a condition that was made up of several parts? Imagine we want to go through our list of accessions and print out only the ones that start with “a” and end with “3”. We could use two nested if statements:
accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72'] for accession in accs: if accession.startswith('a'): if accession.endswith('3'): print(accession)
but this brings in an extra, unneeded level of indention. A better way is to join up the two condition with
and to make a complex expression:
accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72'] for accession in accs: if accession.startswith('a') and accession.endswith('3'): print(accession)
This version is nicer in two ways: it doesn’t require the extra level of indentation, and the condition reads in a very natural way. We can also use
or to join up two conditions, to produce a complex condition that will be true if either of the two simple conditions are true:
accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72'] for accession in accs: if accession.startswith('a') or accession.startswith('b'): print(accession)
We can even join up complex conditions to make more complex conditions – here’s an example which prints accessions if they start with either “a” or “b” and end with “4”:
accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72'] for acc in accs: if (acc.startswith('a') or acc.startswith('b')) and acc.endswith('4'): print(acc)
Notice how we have to include parentheses in the above example to avoid ambiguity. Finally, we can negate any type of condition by prefixing it with the word
not. This example will print out accessions that start with “a” and don’t end with 6:
accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72'] for acc in accs: if acc.startswith('a') and not acc.endswith('6'): print(acc)
By using a combination of
not (along with parentheses where necessary) we can build up arbitrarily complex conditions. This kind of use for conditions – identifying elements in a list – can often be done better using either the filter function, or a list comprehension. You’ll find examples of each in the chapters on functional programming and comprehensions respectively in Advanced Python for Biologists.
These three words are collectively known as boolean operators and crop up in a lot of places. For example, if you wanted to search for information on using Python in biology, but didn’t want to see pages that talked about biology of snakes, you might do a search for “biology python -snake“. This is actually a complex condition just like the ones above – Google automatically adds and between words, and uses the hyphen to mean not. So you’re asking for pages that mention python and biology but not snakes.
Writing true/false functions
Sometimes we want to write a function that can be used in a condition. This is very easy to do – we just make sure that our function always returns either True or False. Remember that True and False are built-in values in Python, so they can be passed around, stored in variables, and returned, just like numbers or strings.
Here’s a function that determines whether or not a DNA sequence is AT-rich (we’ll say that a sequence is AT-rich if it has an AT content of more than 0.65):
def is_at_rich(dna): length = len(dna) a_count = dna.upper().count('A') t_count = dna.upper().count('T') at_content = (a_count + t_count) / length if at_content > 0.65: return True else: return False
We’ll test this function on a few sequences to see if it works:
The output shows that the function returns True or False just like the other conditions we’ve been looking at:
Therefore we can use our function in an
if is_at_rich(my_dna): # do something with the sequence
Because the last four lines of our function are devoted to evaluating a condition and returning True or False, we can write a slightly more compact version. In this example we evaluate the condition, and then return the result right away:
def is_at_rich(dna): length = len(dna) a_count = dna.upper().count('A') t_count = dna.upper().count('T') at_content = (a_count + t_count) / length return at_content > 0.65
This is a little more concise, and also easier to read once you’re familiar with the idiom.
In this short section, we’ve dealt with two things: conditions, and the statements that use them.
We’ve seen how simple conditions can be joined together to make more complex ones, and how the concepts of truth and falsehood are built in to Python on a fundamental level. We’ve also seen how we can incorporate True and False in our own functions in a way that allows them to be used as part of conditions.
We’ve been introduced to four different tools that use conditions –
while – in approximate order of usefulness. You’ll probably find, in the programs that you write and in your solutions to the exercises in this book, that you use
else very frequently,
elif occasionally, and
while almost never.
In the section_6 folder in the exercises download, you’ll find a text file called data.csv, containing some made-up data for a number of genes. Each line contains the following fields for a single gene in this order: species name, sequence, gene name, expression level. The fields are separated by commas (hence the name of the file – csv stands for Comma Separated Values). Think of it as a representation of a table in a spreadsheet – each line is a row, and each field in a line is a column. All the exercises for this section use the data read from this file.
Reminder: if you’re using Python 2 rather than Python 3, include this line at the top of your programs:
from __future__ import division
Print out the gene names for all genes belonging to Drosophila melanogaster or Drosophila simulans.
Print out the gene names for all genes between 90 and 110 bases long.
Print out the gene names for all genes whose AT content is less than 0.5 and whose expression level is greater than 200.
Print out the gene names for all genes whose name begins with “k” or “h” except those belonging to Drosophila melanogaster.
High low medium
For each gene, print out a message giving the gene name and saying whether its AT content is high (greater than 0.65), low (less than 0.45) or medium (between 0.45 and 0.65).