Latent Semantic Indexing

[via David Weinberger] A paper by Clara Yu, John Cuadrado, Maciej Ceglowski and J. Scott Payne:

In talking about search engines and how to improve them, it helps to remember what distinguishes a useful search from a fruitless one. To be truly useful, there are generally three things we want from a search engine:

1. We want it to give us all of the relevant information available on our topic.
2. We want it to give us only information that is relevant to our search
3. We want the information ordered in some meaningful way, so that we see the most relevant results first.

Improving our trinity of precision, ranking and recall, however, requires more than brute force. In the following pages, we will describe one promising approach, called latent semantic indexing, that lets us make improvements in all three categories. LSI was first developed at Bellcore in the late 1980’s, and is the object of active research, but is surprisingly little-known outside the information retrieval community.

Latent semantic indexing adds an important step to the document indexing process. In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. LSI considers documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant… Although the LSI algorithm doesn’t understand anything about what the words mean, the patterns it notices can make it seem astonishingly intelligent.

When you search an LSI-indexed database, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the query. Because two documents may be semantically very close even if they do not share a particular keyword, LSI does not require an exact match to return useful results. Where a plain keyword search will fail if there is no exact match, LSI will often return relevant documents that don’t contain the keyword at all.

Published by

Rajesh Jain

An Entrepreneur based in Mumbai, India.