# Simple sequence similarity search

A subject of great interest to biologists is the problem of identifying regions of similarity between DNA sequences. In particular, we are interested in the case where we have a large collection of sequences about which something is known, and we want to tell which, if any, are similar to a new sequence (this is pretty much the most common use case for BLAST).

How can we start to tackle this problem using Python? To start with, we need to define what we mean when we say that two regions of DNA share similarity. In bioinformatics, we usually accomplish this using a scoring matrix. For each pair of bases in a chunk of two sequences, we will look up the score in a matrix, and add them all together. Sometimes we include gaps in an alignment, but let's forget about that for now.

Here is a very simple scoring matrix:

     A    T    G    C

A    1   -1   -1   -1

T    -1   1   -1   -1

G    -1  -1    1   -1

C    -1  -1   -1    1



or to put it another way, we score 1 for a match and -1 for a missmatch.

Given that we are not using gaps, a 'match' between two sequences is completely described by five pieces of information:

• the query sequence
• the subject sequence
• the start of the match on the query
• the start of the match on the subject
• the length of the match

Note that we are using the standard-ish names for the sequences – query is the unknown sequence, and subject is the known sequence. These names are completely arbitrary, but using them will (1) avoid the need for inelegant names like 'sequence A' and (2) be more consistent with other programs.

So, let's say we have the following two sequences:

                     1         2
012345678901234567890123
subject = 'actgatcgattgatcgatcgatcg'
query   = 'tttagatcgatctttgatc'



I've added numbers along the top so that we can easily see the positions of individual characters. It is not particularly easy to see by eye, but there is a region of similarity which is 8 bases long and starts at position 4 on the query and position 7 on the subject. It is easier to spot if we format it slightly differently:

subject = 'actgatcGATTGATCgatcgatcg'
query   =    'tttaGATCGATCtttgatc'



What is the score of this match? There are seven matches and one missmatch, so the total score is six. It's not too tricky to write a Python function to calculate the score:

In [ ]:
one_sequence = 'actgatcgattgatcgatcgatcg'
another_sequence   = 'tttagatcgatctttgatc'

# here are the five bits of information we described before
def score_match(subject, query, subject_start, query_start, length):
score = 0
# for each base in the match
for i in range(0,length):
# first figure out the matching base from both sequences
subject_base = subject[subject_start + i]
query_base = query[query_start + i]
# then adjust the score up or down depending on
# whether or not they are the same
if subject_base == query_base:
score = score + 1
else:
score = score - 1
return score


Here is the score for the match we were looking at above:

In [2]:
score_match(one_sequence, another_sequence, 7, 4, 8)

Out[2]:
6

Let's try a few other potential matches:

In [3]:
# here is the same match but shorter
print(score_match(one_sequence, another_sequence, 7, 4, 4))

# how about a longer match
print(score_match(one_sequence, another_sequence, 7, 4, 12))

# and a random match
print(score_match(one_sequence, another_sequence, 10, 1, 5))

2
4
-1


What if we didn't know about the 'good' match in advance? We need some kind of 'match proposal mechanism' – something that can generate proposed matches which we can then feed into our scoring function and decide whether or not they are good matches. Here is a brute-force approach:

In [11]:
def try_all_matches(subject, query):

# try every possible value for subject start and query start
for subject_start in range(0,len(subject)):
for query_start in range(0,len(query)):

# try every possible value for length of match
# the length can never be longer than the length of the shortest
# input sequence, so it doesn't matter whether we use the query or the subject
for length in range(1,len(query)):

# this will generate lots of proposed matches which go beyond the
# length of one of the input sequences
# so we will only try to score those that fit within the two sequences
if (subject_start + length < len(subject) and query_start + length < len(query)):
score = score_match(subject, query, subject_start, query_start, length)
print(subject_start, query_start, length, score)


This generates a huge amount of output looking like this:

0 0 1 -1
0 0 2 -2
0 0 3 -1
0 0 4 -2
0 0 5 -3
0 0 6 -4
0 0 7 -5
0 0 8 -6
0 0 9 -7
0 0 10 -8

We can see how it's trying every possible combination of starting position and length. Unsurprisingly, the vast majority of the matches are terrible – they score below zero. Let's try a version that only shows us the good matches:

In [5]:
def try_all_matches(subject, query, score_limit):
for subject_start in range(0,len(subject)):
for query_start in range(0,len(query)):
for length in range(0,len(query)):
if (subject_start + length < len(subject) and query_start + length < len(query)):
score = score_match(subject, query, subject_start, query_start, length)
# only print a line of output if the score is better than some limie
if (score >= score_limit):
print(subject_start, query_start, length, score)


Here's the output with a score_limit of 6 i.e. we only want to see the matches that score at least 6:

In [6]:
try_all_matches(one_sequence, another_sequence, 6)

2 3 8 6
3 4 6 6
3 4 7 7
3 4 8 6
4 5 6 6
5 2 10 6
7 0 12 6
7 4 8 6
8 1 10 6
8 1 11 7
8 1 12 6
8 1 14 6
9 2 8 6
9 2 9 7
9 2 10 8
9 2 11 7
9 2 12 6
9 2 13 7
9 2 14 6
10 3 8 6
10 3 9 7
10 3 10 6
10 3 12 6
11 0 12 6
11 4 6 6
11 4 7 7
11 4 8 8
11 4 9 7
11 4 10 6
11 4 11 7
11 4 12 6
12 1 10 6
12 1 11 7
12 5 6 6
12 5 7 7
12 5 8 6
12 5 10 6
13 2 8 6
13 2 9 7
13 2 10 8
13 6 6 6
14 3 8 6
14 3 9 7
15 4 6 6
15 4 7 7
15 4 8 8
16 5 6 6
16 5 7 7
17 6 6 6


It is quite tricky to visualise these matches in our heads. Let's write another function who's job is to display a match in a nicely formatted way:

In [8]:
# the arguments are the five bits of information that define a match
def pretty_print_match(subject, query, subject_start, query_start, length):

# first print the start/stop positions for the subject sequence
print(str(subject_start) + (' ' * length) + str(subject_start+length))

# then print the bit of the subject that matches
print(' ' + subject[subject_start:subject_start+length])

# then print the bit of the query that matches
print(' ' + query[query_start:query_start+length])

# finally print the start/stop positions for the query
print(str(query_start) + (' ' * length) + str(query_start+length))

print('\n--------------------\n')


We can pick one of the good matches and visualize it:

In [9]:
pretty_print_match(one_sequence, another_sequence, 7, 4, 8)

7        15
gattgatc
gatcgatc
4        12

--------------------



Now we can modify our try_all_matches() to get nicely formatted output:

In [10]:
def try_all_matches(subject, query, score_limit):
for subject_start in range(0,len(subject)):
for query_start in range(0,len(query)):
for length in range(0,len(query)):
if (subject_start + length < len(subject) and query_start + length < len(query)):
score = score_match(subject, query, subject_start, query_start, length)
# only print a line of output if the score is better than some limie
if (score >= score_limit):
print('Score : ' + str(score))
pretty_print_match(subject, query, subject_start, query_start, length)

try_all_matches(one_sequence, another_sequence, 7)

Score : 7
3       10
gatcgat
gatcgat
4       11

--------------------

Score : 7
8           19
attgatcgatc
ttagatcgatc
1           12

--------------------

Score : 7
9         18
ttgatcgat
tagatcgat
2         11

--------------------

Score : 8
9          19
ttgatcgatc
tagatcgatc
2          12

--------------------

Score : 7
9           20
ttgatcgatcg
tagatcgatct
2           13

--------------------

Score : 7
9             22
ttgatcgatcgat
tagatcgatcttt
2             15

--------------------

Score : 7
10         19
tgatcgatc
agatcgatc
3         12

--------------------

Score : 7
11       18
gatcgat
gatcgat
4       11

--------------------

Score : 8
11        19
gatcgatc
gatcgatc
4        12

--------------------

Score : 7
11         20
gatcgatcg
gatcgatct
4         13

--------------------

Score : 7
11           22
gatcgatcgat
gatcgatcttt
4           15

--------------------

Score : 7
12           23
atcgatcgatc
ttagatcgatc
1           12

--------------------

Score : 7
12       19
atcgatc
atcgatc
5       12

--------------------

Score : 7
13         22
tcgatcgat
tagatcgat
2         11

--------------------

Score : 8
13          23
tcgatcgatc
tagatcgatc
2          12

--------------------

Score : 7
14         23
cgatcgatc
agatcgatc
3         12

--------------------

Score : 7
15       22
gatcgat
gatcgat
4       11

--------------------

Score : 8
15        23
gatcgatc
gatcgatc
4        12

--------------------

Score : 7
16       23
atcgatc
atcgatc
5       12

--------------------



## A new proposal mechanism¶

Now we have a few nice building blocks to play around with. try_all_matches() generates potential matches and uses score_match() to calculate the score. If the score is good enough, it then uses pretty_print_match() to print a result. We can now alter the behaviour of our program by replacing try_all_matches() with something a bit more sophisticated, and using the other two functions as before. We will steal some ideas from a program that you might have heard of called BLAST :-)

## How does BLAST work?¶

Rather than using a brute-force approach to consider all possible matches between two sequences, BLAST first identifies short regions ('seeds') of high similarity between the two sequences by splitting both sequences into short 'words' of a fixed size and looking for words that are shared between both sequences. It then tries to extend each seed in both directions, stopping when the score of the match falls below some threshold. Let's try to express this procedure in a slightly more rigorous way:

• split the subject sequence up into words and record the position of each one
• split the query sequence up into words and for each word do the following:
• look at the list of positions to see if the same word occurs in the subject sequence
• if so, then for each matching position on the subject sequence do the following:
• extend the match on the right end by one base
• check the score of the new, extended match
• extend the match on the left end and check the score again
• keep extending the match in this way, alternating left and right, until the score drops below the threshold, then stop and report the match

This is simpler than BLAST, but it will do for a start! Implementing this idea is left as an exercise for now; I'll drop an update here if I write a solution.