Simple sequence similarity search

Sequence similarity search

A subject of great interest to biologists is the problem of identifying regions of similarity between DNA sequences. In particular, we are interested in the case where we have a large collection of sequences about which something is known, and we want to tell which, if any, are similar to a new sequence (this is pretty much the most common use case for BLAST).

How can we start to tackle this problem using Python? To start with, we need to define what we mean when we say that two regions of DNA share similarity. In bioinformatics, we usually accomplish this using a scoring matrix. For each pair of bases in a chunk of two sequences, we will look up the score in a matrix, and add them all together. Sometimes we include gaps in an alignment, but let’s forget about that for now.

Here is a very simple scoring matrix:

     A    T    G    C

A    1   -1   -1   -1

T    -1   1   -1   -1

G    -1  -1    1   -1

C    -1  -1   -1    1

or to put it another way, we score 1 for a match and -1 for a missmatch.

Given that we are not using gaps, a ‘match’ between two sequences is completely described by five pieces of information:

the query sequence
the subject sequence
the start of the match on the query
the start of the match on the subject
the length of the match

Note that we are using the standard-ish names for the sequences – query is the unknown sequence, and subject is the known sequence. These names are completely arbitrary, but using them will (1) avoid the need for inelegant names like ‘sequence A’ and (2) be more consistent with other programs.

So, let’s say we have the following two sequences:

                     1         2
           012345678901234567890123
subject = 'actgatcgattgatcgatcgatcg'
query   = 'tttagatcgatctttgatc'

I’ve added numbers along the top so that we can easily see the positions of individual characters. It is not particularly easy to see by eye, but there is a region of similarity which is 8 bases long and starts at position 4 on the query and position 7 on the subject. It is easier to spot if we format it slightly differently:

subject = 'actgatcGATTGATCgatcgatcg'
query   =    'tttaGATCGATCtttgatc'

What is the score of this match? There are seven matches and one missmatch, so the total score is six. It’s not too tricky to write a Python function to calculate the score:

one_sequence = 'actgatcgattgatcgatcgatcg'
another_sequence   = 'tttagatcgatctttgatc'

# here are the five bits of information we described before
def score_match(subject, query, subject_start, query_start, length):
    score = 0
    # for each base in the match
    for i in range(length):
        # first figure out the matching base from both sequences
        subject_base = subject[subject_start + i]
        query_base = query[query_start + i]
        # then adjust the score up or down depending on 
        # whether or not they are the same
        if subject_base == query_base:
            score = score + 1
        else:
            score = score - 1
    return score

Here is the score for the match we were looking at above:

score_match(one_sequence, another_sequence, 7, 4, 8)

Let’s try a few other potential matches:

# here is the same match but shorter
print(score_match(one_sequence, another_sequence, 7, 4, 4))

# how about a longer match
print(score_match(one_sequence, another_sequence, 7, 4, 12))

# and a random match
print(score_match(one_sequence, another_sequence, 10, 1, 5))

2
4
-1

What if we didn’t know about the ‘good’ match in advance? We need some kind of ‘match proposal mechanism’ – something that can generate proposed matches which we can then feed into our scoring function and decide whether or not they are good matches. Here is a brute-force approach:

def try_all_matches(subject, query):

 # try every possible value for subject start and query start
    for subject_start in range(0,len(subject)):
        for query_start in range(0,len(query)):

            # try every possible value for length of match    
            # the length can never be longer than the length of the shortest 
            # input sequence, so it doesn't matter whether we use the query or the subject
            for length in range(1,len(query)):

                # this will generate lots of proposed matches which go beyond the 
                # length of one of the input sequences
                # so we will only try to score those that fit within the two sequences
                if (subject_start + length < len(subject) and query_start + length < len(query)):
                    score = score_match(subject, query, subject_start, query_start, length)
                    print(subject_start, query_start, length, score)

This generates a huge amount of output looking like this:

We can see how it’s trying every possible combination of starting position and length. Unsurprisingly, the vast majority of the matches are terrible – they score below zero. Let’s try a version that only shows us the good matches:

def try_all_matches(subject, query, score_limit):
    for subject_start in range(0,len(subject)):
        for query_start in range(0,len(query)):
            for length in range(0,len(query)):
                if (subject_start + length < len(subject) and query_start + length < len(query)):
                    score = score_match(subject, query, subject_start, query_start, length)
                    # only print a line of output if the score is better than some limie
                    if (score >= score_limit):
                        print(subject_start, query_start, length, score)

Here’s the output with a score_limit of 6 i.e. we only want to see the matches that score at least 6:

try_all_matches(one_sequence, another_sequence, 6)

It is quite tricky to visualise these matches in our heads. Let’s write another function who’s job is to display a match in a nicely formatted way:

# the arguments are the five bits of information that define a match
def pretty_print_match(subject, query, subject_start, query_start, length):

    # first print the start/stop positions for the subject sequence
    print(str(subject_start) + (' ' * length) + str(subject_start+length))

    # then print the bit of the subject that matches
    print(' ' + subject[subject_start:subject_start+length])

    # then print the bit of the query that matches
    print(' ' + query[query_start:query_start+length])

    # finally print the start/stop positions for the query
    print(str(query_start) + (' ' * length) + str(query_start+length))

    print('\n--------------------\n')

We can pick one of the good matches and visualize it:

pretty_print_match(one_sequence, another_sequence, 7, 4, 8)

7        15
 gattgatc
 gatcgatc
4        12

--------------------

Now we can modify our try_all_matches() to get nicely formatted output:

def try_all_matches(subject, query, score_limit):
    for subject_start in range(0,len(subject)):
        for query_start in range(0,len(query)):
            for length in range(0,len(query)):
                if (subject_start + length < len(subject) and query_start + length < len(query)):
                    score = score_match(subject, query, subject_start, query_start, length)
                    # only print a line of output if the score is better than some limie
                    if (score >= score_limit):
                        print('Score : ' + str(score))
                        pretty_print_match(subject, query, subject_start, query_start, length)
    
try_all_matches(one_sequence, another_sequence, 7)

Score : 7
3       10
 gatcgat
 gatcgat
4       11

--------------------

Score : 7
8           19
 attgatcgatc
 ttagatcgatc
1           12

--------------------

Score : 7
9         18
 ttgatcgat
 tagatcgat
2         11

--------------------

Score : 8
9          19
 ttgatcgatc
 tagatcgatc
2          12

--------------------

Score : 7
9           20
 ttgatcgatcg
 tagatcgatct
2           13

--------------------

Score : 7
9             22
 ttgatcgatcgat
 tagatcgatcttt
2             15

--------------------

Score : 7
10         19
 tgatcgatc
 agatcgatc
3         12

--------------------

Score : 7
11       18
 gatcgat
 gatcgat
4       11

--------------------

Score : 8
11        19
 gatcgatc
 gatcgatc
4        12

--------------------

Score : 7
11         20
 gatcgatcg
 gatcgatct
4         13

--------------------

Score : 7
11           22
 gatcgatcgat
 gatcgatcttt
4           15

--------------------

Score : 7
12           23
 atcgatcgatc
 ttagatcgatc
1           12

--------------------

Score : 7
12       19
 atcgatc
 atcgatc
5       12

--------------------

Score : 7
13         22
 tcgatcgat
 tagatcgat
2         11

--------------------

Score : 8
13          23
 tcgatcgatc
 tagatcgatc
2          12

--------------------

Score : 7
14         23
 cgatcgatc
 agatcgatc
3         12

--------------------

Score : 7
15       22
 gatcgat
 gatcgat
4       11

--------------------

Score : 8
15        23
 gatcgatc
 gatcgatc
4        12

--------------------

Score : 7
16       23
 atcgatc
 atcgatc
5       12

--------------------

A new proposal mechanism

Now we have a few nice building blocks to play around with. try_all_matches() generates potential matches and uses score_match() to calculate the score. If the score is good enough, it then uses pretty_print_match() to print a result. We can now alter the behaviour of our program by replacing try_all_matches() with something a bit more sophisticated, and using the other two functions as before. We will steal some ideas from a program that you might have heard of called BLAST :-)

How does BLAST work?

Rather than using a brute-force approach to consider all possible matches between two sequences, BLAST first identifies short regions (‘seeds’) of high similarity between the two sequences by splitting both sequences into short ‘words’ of a fixed size and looking for words that are shared between both sequences. It then tries to extend each seed in both directions, stopping when the score of the match falls below some threshold. Let’s try to express this procedure in a slightly more rigorous way:

split the subject sequence up into words and record the position of each one
split the query sequence up into words and for each word do the following:
look at the list of positions to see if the same word occurs in the subject sequence
if so, then for each matching position on the subject sequence do the following:
extend the match on the right end by one base
check the score of the new, extended match
extend the match on the left end and check the score again
keep extending the match in this way, alternating left and right, until the score drops below the threshold, then stop and report the match

This is simpler than BLAST, but it will do for a start! Implementing this idea is left as an exercise for now; I’ll drop an update here if I write a solution.