Text Mining

NYTimes writes:

[Text-mining is] a technique that academics have been experimenting with for years but for which tools have only recently become commercially available. The prospect of rapidly scanning through reams of documents is stirring interest among researchers and analysts faced with more material than they can handle.

To the uninitiated, it may seem that Google and other Web search engines do something similar, since they also pore through reams of documents in split-second intervals. But, as experts note, search engines are merely retrieving information, displaying lists of documents that contain certain keywords.

Text-mining programs go further, categorizing information, making links between otherwise unconnected documents and providing visual maps (some look like tree branches or spokes on a wheel) to lead users down new pathways that they might not have been aware of.

In most cases, text-mining software is built upon the foundations of data mining, which uses statistical analysis to pull information out of structured databases like product inventories and customer demographics. But text mining starts with information that doesn’t come in neat rows and columns. It works on unstructured data – e-mail messages, news articles, internal reports, transcripts of phone calls and the like.

To make sense of what it is reading, the software uses algorithms to examine the context behind words. If someone is doing research on computer modeling, for example, it not only knows to discard documents about fashion models but can also extract important phrases, terms, names and locations. It can then categorize them and draw connections among the categories.

It would be nice to apply some of these ideas to blog posts.

Published by

Rajesh Jain

An Entrepreneur based in Mumbai, India. View all posts by Rajesh Jain