Managing Gigabytes: Compressing and Indexing Documents and Images, Second Edition (The Morgan Kaufmann Series in Multimedia Information and Systems)

Author: Ian H. Witten, Alistair Moffat, Timothy C. Bell
All Stack Overflow 8
This Year Hacker News 2
This Month Hacker News 1


by DamonHD   2022-11-03
You had to know things and keep them in your head, and have a pile of text books otherwise! Plus no normal app had access to gigabytes of storage, and indeed that was a vast amount for many years more[1]. Skills learnt then in terms of memory and CPU efficiency are still valuable now, though your typical smartphone is more powerful than the supercomputer replacements I used to look after for an oil company...

[1] (1999)

by mindcrime   2021-12-05
I don't even know if anybody has written a book specifically about search at "web scale" (no MongoDB jokes here, please). But about the closest things I know of would be something like:

by PaulHoule   2021-06-21
This book has a nice treatment of that kind of compression:

For instance you might be keep track of facts like

   the word "the" is contained in document 1
   the word "john" is contained in document 1
   the word "the" is contained in document 2
   the word "john" is contained in document 12
and you code the gaps; the word "the" appears in every document and the gap is always 1, but the gap for "john" is 11. With a variable-sized encoding you use fewer bits for smaller gaps -- with that kind of encoding you don't have to make "the" be a stopword because you can afford to encode all the postings.
by verytrivial   2017-10-23
That name sound very familiar, as does the feature set. Managing Gigabytes[1], or "mg" was the output of a University of Melbourne and RMIT research in the 1990s. It went on to be commercialized as SIM and later TeraText[2] and has largely disappeared into the government intelligence indexing and consulting-heavy systems space (where it is now presumably being trounced by Palantir).

[1] - Note review from Peter Norvig!


by nostrademons   2017-09-14
Separate out the concepts of "search infrastructure" (how documents and posting lists are stored in terms of bits on disk & RAM) and "ranking functions" (how queries are matched to documents).

The former is basically a solved problem. Lucene/ElasticSearch and Google are using basically the same techniques, and you can read about them in Managing Gigabytes [1], which was first published over 2 decades ago. Google may be a generation or so ahead - they were working on a new system to take full advantage of SSDs (which turn out to be very good for search, because it's a very read-heavy workload) when I left, and I don't really know the details of it. But ElasticSearch is a perfectly adequate retrieval system, and it does basically the same stuff that Google's systems did circa 2013, and even does some stuff better than Google.

The real interesting work in search is in ranking functions, and this is where nobody comes close to Google. Some of this, as other commenters note, is because Google has more data than anyone else. Some of it is just because there've been more man-hours poured into it. IMHO, it's pretty doubtful that an open-source project could attract that sort of focused knowledge-work (trust me; it's pretty laborious) when Google will pay half a mil per year for skilled information-retrieval Ph.Ds.


by anonymous   2017-08-20

OK, I only have a half-answer for you... a good place to start might be to go look through the source of an open-source project like Nutch or Solr or Apache Lucene.

If you're interested in options aside from open-source, a really, really good textbook on this very topic is "Managing Gigabytes". The book goes through many different search, IR, and storage algorithms for developing search engines:

by anonymous   2017-08-20

You need a better understanding about search engines first. There are normally

1) a web crawler, something that get the documents you want to add to your search data space. THis is usually totally outside the scope of what you call "search engine".

2) a parser which is taking the document and splitting it into indexable text fragments. If usually works with different file formats, human languages and is preprocessing the text in maybe some fixed records and flow text. Linguistic algorithms (like stemmers - search for Porter Stemmer to get simple one) are also applied here.

3) A indexer which might be as simple as an inverted list of words per document or as complex as you want if you try to be as clever as google. Building an index is the really magic part of a successfull search engine. Usually there are multiple ranking algorithms that are put together.

4) The frontend with an optional query language. THis is where google is really bad but as you can see on googles success it might not be so important for 98% of the people. But i really miss this.

I think you are asking for (3) the indexer. Basically there are 2 different kind of algorithms you find in classic information retrieval literature. Vector Space model and Boolean Search. The later is easy, just check if the search words are inside the document and return a boolean value. Each search term can be given a relevanz probability. And for different search terms you can use Bayesian probability to sum up the relevanz and add return the highest ranked documents. The vector model treats a document as a vector of all its words you can build a scalar vector product between documents to judge if they are close together - this is a much more complex theroy. The father of IR (information retrieval) was Gerald Salton, you will find a lot of literature under his name.

This was the state of IR art until 1999 (i wrote my diploma thesis about a usenet news search engine in 1998). Then google came and all the theory went into the trashcan of academic stupidity and pratical irrelevanz.

Google was not build on mainstream IR theory. Read in the link that Srirangan gave you about it. Its just an ad hock relevanz function build on many many different sources. You will not find anything in this area beside white paper marketing blablabla. This algorithms are the business secret and capital of the search engine companies.

For simple search engines look at the lucence library or at dtsearch which was always my choice for an embeddable search engine library.

There is not really a lot of example code nor available information in the open source world about IR technology. Most of them like lucense are just implementing the most primitive operations. You have to buy books and go to a university library to get access to research literature.

As literature i would recommend starting with this book link text alt text,204,203,200_PIsitb-sticker-arrow-click,TopRight,35,-76_AA240_SH20_OU01_.jpg

by anonymous   2017-08-20

There's no single technique or 'silver bullet'. But if you do start from scratch better grok this and this standard text on the topic.