Daily books from hacker news, stack overflow, Reddit, which are recommended by amazing people. Follow us on Twitter, Facebook And Medium. "A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one." ― George R.R. Martin, A Dance with Dragons

TopTalkedBooks.com is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to Amazon.com

1) You should look at using a suffix tree data structure.

Suffix Tree

This data structure can be built in O(N * log N) time

(I think even in O(N) time using Ukkonen's algorithm)

where N is the size/length of the input string.

It then allows for solving many (otherwise) difficult

tasks in O(M) time where M is the size/length of the pattern.

So even though I didn't try your particular problem, I am pretty sure that

if you use a suffix tree and a smart formulation of your problem, then the

problem can be solved by using a suffix tree (in reasonable O time).

2) A very good book on these (and related) subjects is this one:

Algorithms on Strings, Trees and Sequences

It's not really easy to read though unless you're well-trained in algorithms.

But OK, reading such things is the only way to get well-trained ;)

3) I suggest you have a quick look at this algorithm too.

Aho-Corasick Algorithm

Even though, I am not sure but... this one might be somewhat

off-topic with respect to your particular problem.

The most fundamental data structure used in bioinformatics is string. There are also a whole range of different data structures representing strings. And algorithms like string matching are based on the efficient representation/data structures.

A comprehensive work on this is Dan Gusfield's Algorithms on Strings, Trees and Sequences

Suffix tree related algorithms are useful here.

One is described in Algorithms on Strings, Trees and Sequences by Dan Gusfield (Chapter 9.6). It uses a combination of divide-and-conquer approach and suffix trees and has time complexity O(N log N + Z) where Z is the number of substring repetitions.

The same book describes simpler O(N

^{2}) algorithm for this problem, also using suffix trees.