Emergic: Rajesh Jain's Blog

Emergic: Rajesh Jain's Blog header image 2

TECH TALK: Let’s Build a Business: 3. India Search Engine

April 5th, 2006 · No Comments

The current search engines have a low relevance when it comes to searching for India-specific content because of their inability to identify India-only websites. Since most Indian sites tend to have a .com suffix, are in English and are typically hosted on US servers, none of the standard three parameters to identify India-centric content (domain, language and hosting location) work well. As such, there is an opportunity to build an India-centric search engine provided it can have the right basis set.

To build the basis set, one approach followed would be as follows:

  • Identify about 10,000 India-centric URLs. These would be identified from current search engines and links from known Indian sites. They would be vetted by human editors prior to crawling. This would result in a few million pages. This process would take 5 content editors along with a couple software programmers (to write software to automatically pick URLs from specified pages) about 2 person-months to identify. [1 content editor should be able to identify/vet 40 URLs a day. Thus in a month, one person can vet about 1,000.]

  • Next, crawl these sites. That is the initial basis set.

  • From these sites, work on identifying outgoing links and incoming links. Some heuristics should be used to identify India-relevance of these URLs. These would then be submitted to the editorial team for vetting.

  • In parallel, inputs would be solicited from webmasters for submitting India-specific sites, which also would be vetted by the editorial team.

  • Our goal should be able to get to 90% coverage of the Indian sites in about 3 months after launch and 100% in about 6 months.

    There should be a total of three offerings each on the web and mobile platform:

  • A directory of the best Indian sites, organised hierarchically
  • A Reference Web search engine based on the sites
  • An Incremental Web search engine based on RSS feeds

    Points to Ponder:

  • How can we build a more scalable model for soliciting India-specific content? eg. Tagging
  • What additional differentiation can we get with the likes of Google and Yahoo?
  • How can we get local content?
  • What about maps?
  • How would search work in the context of mobiles?
  • How do we support local languages?

    When I had created khoj.com in 1997, the focus was on building an India-specific search engine. It is now time to rethink a new khoj.com building on a lot of new search-related ideas and leveraging the community.

    Interested in leading or being part of this venture? Email me at rajesh-at-netcore.co.in or fill out this feedback form with a brief profile of yourself, your thoughts on the ideas presented, and your thinking about the role that you’d like to play in the venture.

    Tomorrow: Computing Grid


    TECH TALK Build Business+T

  • Tags: Tech Talk

    0 responses so far ↓

    • There are no comments yet...Kick things off by filling out the form below.

    Leave a Comment