# Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology

All Stack Overflow 7
This Month Stack Overflow 3

I would go for the "longest common subsequence". A standard implementation can be found here:

http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Longest_common_subsequence

If you need a better algorithm and/or lots of additional ideas for inexact matching of strings, THE standard reference is this book:

http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198

Don't be fooled by the title. Compuatational biology is mostly about matching of large database of long strings (also known as DNA sequences).

Technically your task is matching long strings made out of characters of a 5 letter alphabet. This kind of stuff is researched extensively in the area of computational biology. (Typically with 4 letter alphabets). If you do not know the book http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198 then you might want to get hold of a copy. IMHO this is THE standard book on fuzzy matching / scoring of sequences.

I'd recommend getting hold of a copy of Gusfield's Algorithms on Strings, Trees and Sequences which is all about string operations for computational biology.

Your intuition behind why the algorithm should be Θ(n2) is a good one, but most suffix trees are designed in a way that eliminates the need for this time complexity. Intuitively, it would seem that you need Θ(n2) different nodes to hold all of the different suffixes, because you'd need n + (n - 1) + ... + 1 different nodes. However, suffix trees are typically designed so that there isn't a single node per character in the suffix. Instead, each edge is typically labeled with a sequence of characters that are substrings of the original string. It still may seem that you'd need Θ(n2) time to construct this tree because you'd have to copy the substrings over to these edges, but typically this is avoided by a cute trick - since all the edges are labeled with strings that are substrings of the input, the edges can instead be labeled with a start and end position, meaning that an edge spanning Θ(n) characters can be constructed in O(1) time and using O(1) space.

That said, constructing suffix trees is still really hard to do. The Θ(n) algorithms referenced in Wikipedia aren't easy. One of the first algorithms found to work in linear time is Ukkonen's Algorithm, which is commonly described in textbooks on string algorithms (such as Algorithms on Strings, Trees, and Sequences). The original paper is linked in Wikipedia. More modern approaches work by first building a suffix array and using that to then construct the suffix tree.

Hope this helps!

1) You should look at using a suffix tree data structure.

Suffix Tree

This data structure can be built in O(N * log N) time
(I think even in O(N) time using Ukkonen's algorithm)
where N is the size/length of the input string.
It then allows for solving many (otherwise) difficult
tasks in O(M) time where M is the size/length of the pattern.

So even though I didn't try your particular problem, I am pretty sure that
if you use a suffix tree and a smart formulation of your problem, then the
problem can be solved by using a suffix tree (in reasonable O time).

2) A very good book on these (and related) subjects is this one:

Algorithms on Strings, Trees and Sequences

It's not really easy to read though unless you're well-trained in algorithms.
But OK, reading such things is the only way to get well-trained ;)

3) I suggest you have a quick look at this algorithm too.

Aho-Corasick Algorithm

Even though, I am not sure but... this one might be somewhat
off-topic with respect to your particular problem.

The most fundamental data structure used in bioinformatics is string. There are also a whole range of different data structures representing strings. And algorithms like string matching are based on the efficient representation/data structures.

A comprehensive work on this is Dan Gusfield's Algorithms on Strings, Trees and Sequences

Suffix tree related algorithms are useful here.

The same book describes simpler O(N2) algorithm for this problem, also using suffix trees.