Emergic: Rajesh Jain's Blog

Emergic: Rajesh Jain's Blog header image 2

iReader and Vector-based Searching

March 6th, 2007 · No Comments

Robert Cringely writes:

Vector-based searching begins with making an index of words in a document. Using this column as an example, the software would examine all the words I have written here, throw away words that carry no real information — words like “the,” “and” and most verbs — then count the instances of each of the remaining words. Each word in the column becomes a vector in a multidimensional space. If I have used the word “Internet” 15 times in this column, then “internet” defines the direction of the vector and 15 is its length. Adding all the vectors in this column yields a single vector that represents the entire column in a multidimensional space defined by all the words in all the articles in the entire database.

Doing a search using this system is simply a matter of entering a natural-language query, which is parsed and indexed in exactly the same manner, yielding another vector. This search vector is plotted in the multidimensional space and the search results are those vectors (those articles) that are nearest in space to the query vector. The closer to the query vector an article vector lies, the more likely that article is to answer the question posed in the query.

EPrecis and now iReader use a similar approach, but where the actual words didn’t matter to Excite, they matter a LOT to these new products.

Tags: Software