Found in 6 comments on Hacker News
nostrademons · 2017-09-13 · Original thread
Separate out the concepts of "search infrastructure" (how documents and posting lists are stored in terms of bits on disk & RAM) and "ranking functions" (how queries are matched to documents).

The former is basically a solved problem. Lucene/ElasticSearch and Google are using basically the same techniques, and you can read about them in Managing Gigabytes [1], which was first published over 2 decades ago. Google may be a generation or so ahead - they were working on a new system to take full advantage of SSDs (which turn out to be very good for search, because it's a very read-heavy workload) when I left, and I don't really know the details of it. But ElasticSearch is a perfectly adequate retrieval system, and it does basically the same stuff that Google's systems did circa 2013, and even does some stuff better than Google.

The real interesting work in search is in ranking functions, and this is where nobody comes close to Google. Some of this, as other commenters note, is because Google has more data than anyone else. Some of it is just because there've been more man-hours poured into it. IMHO, it's pretty doubtful that an open-source project could attract that sort of focused knowledge-work (trust me; it's pretty laborious) when Google will pay half a mil per year for skilled information-retrieval Ph.Ds.


verytrivial · 2017-04-26 · Original thread
That name sound very familiar, as does the feature set. Managing Gigabytes[1], or "mg" was the output of a University of Melbourne and RMIT research in the 1990s. It went on to be commercialized as SIM and later TeraText[2] and has largely disappeared into the government intelligence indexing and consulting-heavy systems space (where it is now presumably being trounced by Palantir).

[1] - Note review from Peter Norvig!


drblast · 2014-07-28 · Original thread
If you enjoy reading articles about the rediscovery of indexing large amounts of read-only data, I'd highly recommend reading this book which is a treasure trove about this kind of work:

sajid · 2011-12-21 · Original thread
I recommend reading 'Managing Gigabytes' by Witten, Moffat and Bell:

dejv · 2010-05-04 · Original thread
You can take a look on Managing Gigabytes (

It is nice book, but might be little bit outdated.

slackerIII · 2008-03-03 · Original thread
I always have to plug Managing Gigabytes whenever a discussion of computer books comes up. Great reference for anyone dealing with searching or compressing large amounts of information:

Fresh book recommendations delivered straight to your inbox every Thursday.