# Applied Python 1

[sc:fullwidth_signup]

# Large programs and complex data structures

## Sequence similarity search

A subject of great interest to biologists is the problem of identifying regions of similarity between DNA sequences.
In particular, we are interested in the case where we have a large collection of sequences about which something is known,
and we want to tell which, if any, are similar to a new sequence (this is pretty much the most common use case for BLAST).

How can we start to tackle this problem using Python? To start with, we need to define what we mean when we say that two
regions of DNA share similarity. In bioinformatics, we usually accomplish this using a scoring matrix. For each pair of bases in a chunk of
two sequences, we will look up the score in a matrix, and add them all together. Sometimes we include gaps in an alignment, but let’s forget

Here is a very simple scoring matrix:

```     A    T    G    C

A    1   -1   -1   -1

T    -1   1   -1   -1

G    -1  -1    1   -1

C    -1  -1   -1    1
```

or to put it another way, we score 1 for a match and -1 for a missmatch.

Given that we are not using gaps, a ‘match’ between two sequences is completely described by five pieces of information:

• the query sequence
• the subject sequence
• the start of the match on the query
• the start of the match on the subject
• the length of the match

Note that we are using the standard-ish names for the sequences – query is the unknown sequence, and subject is the known sequence. These names are completely
arbitrary, but using them will (1) avoid the need for inelegant names like ‘sequence A’ and (2) be more consistent with other programs.

So, let’s say we have the following two sequences:

```                     1         2
012345678901234567890123
subject = 'actgatcgattgatcgatcgatcg'
query   = 'tttagatcgatctttgatc'
```

I’ve added numbers along the top so that we can easily see the positions of individual characters. It is not particularly easy to see by eye, but there is a region of similarity which
is 8 bases long and starts at position 4 on the query and position 7 on the subject. It is easier to spot if we format it slightly differently:

```subject = 'actgatcGATTGATCgatcgatcg'
query   =    'tttaGATCGATCtttgatc'
```

What is the score of this match? There are seven matches and one missmatch, so the total score is six. It’s not too tricky to write a Python function to calculate the score.

```one_sequence = 'actgatcgattgatcgatcgatcg'
another_sequence   = 'tttagatcgatctttgatc'

# here are the five bits of information we described before
def score_match(subject, query, subject_start, query_start, length):
score = 0
# for each base in the match
for i in range(0,length):
# first figure out the matching base from both sequences
subject_base = subject[subject_start + i]
query_base = query[query_start + i]
# then adjust the score up or down depending on
# whether or not they are the same
if subject_base == query_base:
score = score + 1
else:
score = score - 1
return score

# here is the score for the match we were looking at above
print(score_match(one_sequence, another_sequence, 7, 4, 8))

# let's try a few other potential matches

# here is the same match but shorter
print(score_match(one_sequence, another_sequence, 7, 4, 4))

# how about a longer match
print(score_match(one_sequence, another_sequence, 7, 4, 12))

# and a random match
print(score_match(one_sequence, another_sequence, 10, 1, 5))

```
```6
2
4
-1
```

What if we didn’t know about the ‘good’ match in advance? We need some kind of ‘match proposal mechanism’ – something that can generate
proposed matches which we can then feed into our scoring function and decide whether or not they are good matches. Here is a brute-force approach:

```def try_all_matches(subject, query):

# try every possible value for subject start, query start, and length of match
for subject_start in range(0,len(subject)):
for query_start in range(0,len(query)):

# the length can never be longer than the length of the shortest
# input sequence, so it doesn't matter whether we use the query or the subject
for length in range(1,len(query)):

# this will generate lots of proposed matches which go beyond the
#length of one of the input sequences
# so we will only try to score those that fit within the two sequences
if (subject_start + length < len(subject) and query_start + length < len(query)):
score = score_match(subject, query, subject_start, query_start, length)
print(subject_start, query_start, length, score)

try_all_matches(one_sequence, another_sequence)
```

The output from this script looks like this:

```0 0 1 -1
0 0 2 -2
0 0 3 -1
0 0 4 -2
0 0 5 -3
0 0 6 -4
0 0 7 -5
0 0 8 -6
0 0 9 -7
0 0 10 -8
0 0 11 -7
0 0 12 -8
0 0 13 -9
0 0 14 -8
```

This generates a huge amount of output; the vast majority of the matches are terrible – they score below zero. Let’s try a version that only shows us the good matches:

```def try_all_matches(subject, query, score_limit):
for subject_start in range(0,len(subject)):
for query_start in range(0,len(query)):
for length in range(0,len(query)):
if (subject_start + length < len(subject) and query_start + length < len(query)):
score = score_match(subject, query, subject_start, query_start, length)
# only print a line of output if the score is better than some limie
if (score >= score_limit):
print(subject_start, query_start, length, score)

try_all_matches(one_sequence, another_sequence, 6)
```

output:

```2 3 8 6
3 4 6 6
3 4 7 7
3 4 8 6
4 5 6 6
5 2 10 6
7 0 12 6
7 4 8 6
8 1 10 6
8 1 11 7
8 1 12 6
8 1 14 6
9 2 8 6
```

It is quite tricky to visualise these matches in our heads. Let’s write another function who’s job is to display a match in a nicely-formatted way.

```# the arguments are the five bits of information that define a match
def pretty_print_match(subject, query, subject_start, query_start, length):

# first print the start/stop positions for the subject sequence
print(str(subject_start) + (' ' * length) + str(subject_start+length))

# then print the bit of the subject that matches
print(' ' + subject[subject_start:subject_start+length])

# then print the bit of the query that matches
print(' ' + query[query_start:query_start+length])

# finally print the start/stop positions for the query
print(str(query_start) + (' ' * length) + str(query_start+length))

print('n--------------------n')

pretty_print_match(one_sequence, another_sequence, 7, 4, 8)
```
```7        15
gattgatc
gatcgatc
4        12

--------------------
```

Now we can modify our `try_all_matches` to get nicely formatted output:

```def try_all_matches(subject, query, score_limit):
for subject_start in range(0,len(subject)):
for query_start in range(0,len(query)):
for length in range(0,len(query)):
if (subject_start + length < len(subject) and query_start + length < len(query)):
score = score_match(subject, query, subject_start, query_start, length)
# only print a line of output if the score is better than some limie
if (score >= score_limit):
print('Score : ' + str(score))
pretty_print_match(subject, query, subject_start, query_start, length)

try_all_matches(one_sequence, another_sequence, 7)
```

Output:

```Score : 7
3       10
gatcgat
gatcgat
4       11

--------------------

Score : 7
8           19
attgatcgatc
ttagatcgatc
1           12

--------------------

Score : 7
9         18
ttgatcgat
tagatcgat
2         11

```

## A new proposal mechanism

Now we have a few nice building blocks to play around with. `try_all_matches` generates potential matches and uses `score_match` to calculate the score. If the score is good enough,
it then uses `pretty_print_match` to print a result. We can now alter the behaviour of our program by replacing `try_all_matches` with something a bit more sophisticated, and using the other two
functions as before. We will steal some ideas from a program that you might have heard of called BLAST ðŸ™‚

## How does BLAST work?

Rather than using a brute-force approach to consider all possible matches between two sequences, BLAST first identifies short regions (‘seeds’) of high similarity between the two sequences
by splitting both sequences into short ‘words’ of a fixed size and looking for words that are shared between both sequences. It then tries to extend each seed in both directions, stopping when
the score of the match falls below some threshold. Let’s try to express this procedure in a slightly more rigorous way:

• split the subject sequence up into words and record the position of each one
• split the query sequence up into words and for each word do the following:
• look at the list of positions to see if the same word occurs in the subject sequence
• if so, then for each matching position on the subject sequence do the following:
• extend the match on the right end by one base
• check the score of the new, extended match
• extend the match on the left end and check the score again
• keep extending the match in this way, alternating left and right, until the score drops below the threshold, then stop and report the match

This is simpler than BLAST, but it will do for a start!

## Exercises

1. A key part of the algorithm outlined above is generating a record of the positions of each word that occurs in the subject sequence (this record is what the index file of a BLAST database
is – this is what the makeblastdb program does when you are using command-line BLAST). Start off by writing a function to calculate this record. It will take as an input a sequence, and return a
record of all the positions of all the words in the sequence. What is the best way to store the return value? Let’s look at the data for our test subject sequence – we will use a word size of 4:

```                     1         2
012345678901234567890123
actgatcgattgatcgatcgatcg

word   positions
--------------------
actg   0
ctga   1
tgat   2
gatc   3, 11, 15, 19
atcg   4, 12, 16, 20
tcga   5, 13, 17
cgat   6, 14, 18
gatt   7
attg   8
ttga   9
tgat   10

```

This looks like a good candidate for a dict – the keys will be words and the values will be lists of positions. This will fit in nicely with the requirements of the algorithm.
Remember that a dict allows us to look up the value very quickly if we know the key. This is exactly what we need to do in the third bullet point in our description. So, if we use the subject
as the input sequence, our new function will return a dict that looks like this:

```{
'actg' :  [0],
'ctga' :  [1],
'tgat' :  [2],
'gatc' :  [3, 11, 15, 19],
'atcg' :  [4, 12, 16, 20],
'tcga'  : [5, 13, 17],
'cgat' :  [6, 14, 18],
'gatt' :  [7],
'attg' :  [8],
'ttga' :  [9],
'tgat' :  [10]
}
```

Look carefully at the structure here. The keys are all strings, the values are lists of numbers. Sometimes the lists have just one element, sometimes the have several (because a particular
word occurs more than once in the input sequence).

2. Once you have written (and tested!) your new function, you can start to write a replacement for `try_all_matches`. The easiest way to do this is in stages. First write a function (let’s call it `find_matches`) that will take
a subject sequence and a query sequence, and simply list the positions of all the words that are shared between the two. This will give us a set of very short ‘matches’ that are the same length
as the word size.

3. Then extend this function so that it also tries to extend the the match. The cleanest way to do this is to write a new function (let’s call it `extend_match`) who’s job is to take a single match, extend it as far
as possible in both directions, and print it. Just as before, build up the function in stages – first make it so that it extends the match to the
right by one base and calculates the new score. Then extend the function so that it does the same to the left. Then extend the function so that it does right and left alternately and stops when
the score falls too low. This will be easiest to express using a `while` loop

### Summary of functions

 name input output job score_match subject, query, subject_pos, query_pos, length score to calculate the score of a match get_words a sequence a dict mapping word – > list of positions to generate a list of all the positions of all the words in the sequence pretty_print_match subject, query, subject_pos, query_pos, length None (but it does print some ouput to the screen) to print out a match in a nicely-formatted way find_matches subject, query, wordsize None (but it causes some output to be printed) to get subject words dict using `get_words`, loop through query words, call `extend_match` for each shared word extend_match subject, query, subject_pos, query_pos, length None (but it causes some output to be printed) to extend a match as far as possible, then call `pretty_print_match` to print it

Remember to test your function at each step. Use a very small input sequence to test it (possibly even smaller than the one in the examples above).

Notice that there are quite a few numbers that affect the matches which we can tweak:

• the score for a match
• the score for a missmatch
• the word size
• the minimum match score to report
• the threshold match score for extending a match

Try different values for these numbers – how do they affect the results?

## Extensions to the program

Obviously, the program we have written here is much, much simpler than BLAST! Once you have it working, try adding back in some of the features. I have
listed them roughly in order of increasing difficulty. Pick one that looks interesting and try to implement it.

### Reporting more stats

4. Modify the `pretty_print_match` function so that it always reports
– the length of the match
– the percent identity
– the number of matches
– the number of missmatches

### Multiple subject sequences

5. The test scenario we have set up here is not very realistic. Normally we would expect to have a large number of subject sequences. Modify your program so that it can read a FASTA file full of
subject sequences. For this to work, we will need one dict per input sequence to store the word positions.

### Persistent word dicts

6. The whole reason for using BLAST-style searching instead of brute-force searching is to speed things up. However, our current program is not very good at this – the dict of subject sequence
word positions is re-calculated every time the program runs. To fix this, split the program into two parts. The first part will read some input sequences, create the dicts, and save them to a file.
The second part will read the dicts from the file and do the actual searching (this is the way command-line BLAST works). Python has a feature called ‘pickling’ that lets us save variables
to a file – take a look here:

http://wiki.python.org/moin/UsingPickle

### Match sorting

7. For most applications, rather than setting a minimum match score, we would like to see the matches sorted by score, with the best at the top. Modify your program to do this. The new program
will have to be a bit more complicated, because rather than just printing each match as it is found, it will need to score all matches, then sort them. Here is one way of storing matches – a list
of dicts. A match is defined by 5 bits of information (as described above) so a dict with 5 items can store a single match, and a list of such dicts can store a list of matches:

```my_matches = [
{
'subject'       : 'actgatcgattgatcgatcgatcg',
'query'         : 'tttagatcgatctttgatc',
'subject_start  : 3,
'query_start'   : 5,
'length'        : 6
},
{
'subject'       : 'actgatcgattgatcgatcgatcg',
'query'         : 'tttagatcgatctttgatc',
'subject_start  : 8,
'query_start'   : 12,
'length'        : 7
}
]
```

You will have to write a custom sort function to sort the dicts appropriately.

### Other alphabets

8. As it stands, your program will probably work ok for protein sequences too, but the simplified scoring matrix is unlikely to yield good results.
Modify your program to use the BLOSUM62 matrix. You will need to construct a dict of dicts
in order to look up the score for each amino-acid pair.

### Different extension rules

9. In its current state, your program may sometimes stop extending a hit before it should. Remember that your program tries to extend the hit on the
left and right alternately, and ‘gives up’ when extending the hit would lower the score too much. It’s possible for your program to encounter a hit that
could be successfully extended on the right side, but not on the left side. If it tries to extend it on the left side first, then your program will
‘give up’ too soon. Modify it so that it only ‘gives up’ after unsucessfully extending the match on both the left and right hand sides. Think about
other extension strategies. Maybe your program should extend as far as possible to the right, then as far as possible to the left. Maybe it should extend,
but then backtrack to the limits that gave the highest score (at the moment, it prefers a longer match even if it has a lower score, as long as the score
is still above the threshold).

### Merging HSPs

10. In its current state, your program can end up reporting multiple overlapping hits. Modify it so that it merges overlapping hits, provided that the merged hit wouldn’t be below the scoring threshold.