Simple sequence similarity search

A new proposal mechanism

Now we have a few nice building blocks to play around with. try_all_matches() generates potential matches and uses score_match() to calculate the score. If the score is good enough, it then uses pretty_print_match() to print a result. We can now alter the behaviour of our program by replacing try_all_matches() with something a bit more sophisticated, and using the other two functions as before. We will steal some ideas from a program that you might have heard of called BLAST :-)

How does BLAST work?

Rather than using a brute-force approach to consider all possible matches between two sequences, BLAST first identifies short regions (‘seeds’) of high similarity between the two sequences by splitting both sequences into short ‘words’ of a fixed size and looking for words that are shared between both sequences. It then tries to extend each seed in both directions, stopping when the score of the match falls below some threshold. Let’s try to express this procedure in a slightly more rigorous way:

  • split the subject sequence up into words and record the position of each one
  • split the query sequence up into words and for each word do the following:
  • look at the list of positions to see if the same word occurs in the subject sequence
  • if so, then for each matching position on the subject sequence do the following:
  • extend the match on the right end by one base
  • check the score of the new, extended match
  • extend the match on the left end and check the score again
  • keep extending the match in this way, alternating left and right, until the score drops below the threshold, then stop and report the match

This is simpler than BLAST, but it will do for a start! Implementing this idea is left as an exercise for now; I’ll drop an update here if I write a solution.