Azeem Azhar provides a perspective on Google’s purchase of Applied Semantics, and also gives a nice backgrounder on the technology of searching.
Information retrieval is the core of all search businesses. It is about creating software that solves a hard question: getting computers to understand human language with all its vagaries. These vagaries include:
– polysemy (words with multiple meanings like DRIVE or SET)
– synonymy (different words with similar meanings like AIRPLANE and AIRCRAFT)
– multi-word expressions which need to be treated as such (BILL CLINTON)
– errors, typos and poor grammar
For example, a key word search engine would find it hard to distinguish between A RED FISH and A FISH IN THE RED SEA
Broadly speaking there have been two major schools of thought. The first is one I call the statistical school and the second is the semantic.
The statistical school held that context could be determined by look at statistical patterns within documents and across documents in a collection. Essentially, they use a variety of techniques to recognise word co-occurrence. So when words like DRIVE, CAR and HIGHWAY are used together frequently, we can make assumptions about the context of those words. This means that searches on the terms like SADDAM HUSSEIN may turn up documents without those words in, but with related terms like TARIQ AZIZ or IRAQ.
The other approach is the semantic approach. Here knowledge engineers build up a complex network of relationships, an ontology, that relates words together. So a CAR is defined as a type of VEHICLE and identical to the word AUTOMOBILE. A search on the word CAR will also turn up documents with the word AUTOMOBILE in it, even if they dont mention it. Such semantic networks require a good deal of work and a lot of maintenance to keep them up to date.