Thursday, April 9

word searching

So... I was reading http://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/WSDM09-keynote.pdf and I ran across this bit:

Ways of Index Partitioning
  • By doc: each shard has index for subset of docs
    • pro: each shard can process queries independently 
    • pro: easy to keep additional per-doc information 
    • pro: network traffic (requests/responses) small 
    • con: query has to be processed by each shard 
    • con: O(K*N) disk seeks for K word query on N shards  
  • By word: shard has subset of words for all docs 
    • pro: K word query => handled by at most K shards 
    • pro: O(K) disk seeks for K word query 
    • con: much higher network bandwidth needed 
      • data about each word for each matching doc must be collected in one place 
    • con: harder to have per-doc information

Hopefully they have moved beyond that. I mean, it's not like they don't have the time nor the talent. Also, some of the search algorithms I have heard described strongly suggest word-based indexing.

That said, here's some thoughts:


  1. Hybrid approaches are valid. You can speed up doc searches if you only search docs which contain the word. This isn't all that useful when the word is "the" but can be quite useful when the word is "drachma".
  2. You can have word indexes which select documents, and you can have word indexes which select elements within documents (paragraphs, sentences, table cells, titles, ...).
  3. You can have word indices which also track adjacent words.
  4. Words themselves are an exact list of characters but you can build out lists of "this word likely means this list of words given this context" for people where that makes sense. Identifying contexts takes work, of course - and feedback.
If you are looking for a single word, having a canned search result for that word means the search is trivial - precached. That said, you could also think of this from an oxford english dictionary approach - where words tend to have many definitions and examples, all of which are slight variations on each other. Here, you try to give the user a way of selecting the context they are currently interested in. (For example: are they researching spam, are they dealing with biochemistry, are they searching for cute cat pics, are they building a bridge, etc...)

But when you have multiple words, it gets more interesting. Here, you can start with the rarest word, and use that as the basis for your search result, constraining it by the more frequent words. If you are doing a phrase search (which should maybe be the default - falling back first to words in the same document element and then if that fails to words in the same document and possibly if that fails to words linking to the document... (but note that these need some description to the user of what they're getting and how to fall back to the slower approaches - and you can impose a small search delay on the fallbacks to hint at the underlying mechanical costs)) and if you have adjacent words stored, you might be able to conduct the search entirely without even visiting indices for the most common words in the phrase.

For example "office of the president" has a specific meaning which is different from other meanings involving the words "office" and "president", but "of" and "the" are typically "stop words", but if every entry in the "office" word index recorded its word position in the document and the words immediately preceding and following it, and if all word indices were structured the same way, you would just need to query the indices for "office" and "president" to complete this search.

Bandwidth? Seriously, how was that measured?

That said, there are and will be costs to this - having to do with normalization rules and so on. In addition to logical "word position" you also almost certainly need physical "byte offset" or maybe "character position" for some of the UI related tasks. And, these word indices never exclude the need for document indices. So you wind up having both and several versions of both), with a certain amount of aging...

So, yeah, hopefully Google is already long past the stage of doing things this way. 

If I were Google I'd be having several different competing ways of representing word searches and I'd be watching each of them for signs that they are returning substandard results. (And, yeah, that would have to be a mix of metrics (counting) and samples (random inspection), and I'd be wanting to contrast current presentation with those from years ago - both for signs of stability and signs of current areas of progress.)

[I'd also want to be hooking up with people whose areas of focus are outside the computer realm - farmers, plumbers, etc. People doing tangible work which directly benefits other people. Not for their impulse quips, but just to make sure I had my head on straight.]