Emergic: Rajesh Jain's Blog

Emergic: Rajesh Jain's Blog header image 2

Amazon’s New Search

October 24th, 2003 · No Comments

Search is turning out to be a big focus area for many. Google’s success (it is now considering an online IPO auction that could value the company at more than USD 15 billion) has highlighted the importance of one of the key activities that we do on the Net. expect all kinds of specialised searches to be possible. One such initiative is by Amazon.

News.com reports on Amazon’s “Search Inside the Book” feature, which “allows you to search millions of pages to find exactly the book you want to buy. Now instead of just displaying books whose title, author, or publisher-provided keywords match your search terms, your search results will surface titles based on every word inside the book.”

What is interesting about this is that this is opening up the digital content that has remained invisible so far. Wired has more:

An ingenious attempt to illuminate the dark region of books is under way at Amazon.com. Over the past spring and summer, the company created an unrivaled digital archive of more than 120,000 books. The goal is to quickly add most of Amazon’s multimillion-title catalog. The entire collection is searchable, and every page is viewable.

The Amazon archive is dizzying not because it unearths books that would necessarily have languished in obscurity, but because it renders their contents instantly visible in response to a search. It allows quick query revisions, backtracking, and exploration. It provides a new form of map.

Getting to this point represents a significant technological feat. Most of the material in the archive comes from scanned pages of actual books. This may be surprising, given that most books today are written on PCs, e-mailed to publishers, typeset on computers, and printed on digital presses. But many publishers still do not have push-button access to the digital files of the books they put out. Insofar as the files exist, they are often scattered around the desktops of editors, designers, and contract printers. For books more than a few years old, complete digital files may be lost. John Wiley & Sons contributed 5,000 titles to the Amazon project — all of them in physical form.

Fortunately, mass scanning has grown increasingly feasible, with the cost dropping to as low as $1 each. Amazon sent some of the books to scanning centers in low-wage countries like India and the Philippines; others were run in the United States using specialty machines to ensure accurate color and to handle oversize volumes. Some books can be chopped out of their bindings and fed into scanners, others have to be babied by a human, who turns pages one by one. Remarkably, Amazon was already doing so much data processing in its regular business that the huge task of reading the images of the books and converting them into a plain-text database was handled by idle computers at one of the company’s backup centers.

The magic of the archive lies in the assumption that physical books are irreplaceable. The electronic text is simply an enhancement of the physical object. The Amazon projectrepresents a bold step toward the dream of a universal library.

A battle royale may be brewing in times to come between Amazon and Google:

With retail at the center of the Internet industry, Google is a key competitor because customers begin their online shopping trips at search engines that offer neat algorithms for comparing prices across multiple vendors. Everybody Yahoo!, eBay, AOL, Microsoft, and, of course, Amazon wants to be the site of first resort.

All the leading retail sites have better knowledge of their customers than Google. But Google is the leading Internet information tool, period. Google is a window onto the entire Web. On the other hand, the contents of books may be the only publicly accessible data set with the potential to match Google’s Web index both for size and utility. Search Inside the Book makes Amazon the sole guide to tens and ultimately hundreds of millions of pages of information. And while Google’s business is vulnerable to any competitor that builds a better search engine, Amazon’s book archive is the product of negotiated contracts with hundreds of publishers. Amazon has cornered the market on information that was once hidden away in books. The burden of the physical the fact that the database Amazon uses is linked into a complex system involving real things gives it a stunning, if perhaps temporary, advantage.


Tags: Software

0 responses so far ↓

  • There are no comments yet...Kick things off by filling out the form below.

Leave a Comment