On Search Engines

ACM Queue has an interview with search engine Gigablast designer Matt Wells, who had worked earlier at Infoseek. “Wells began writing the Gigablast search engine from scratch in C++ about three years ago. He says he built it from day one to be the cheapest, most scalable search engine on the market, able to index billions of Web pages and serve thousands of queries per second at very low cost. Another major feature of Gigablast is its ability to index Web pages almost instantly, Wells explains. The network machines give priority to query traffic, but when they’re not answering questions, they spend their time spidering the Web.” Excerpts from the fascinating interview:

Search was quoted by Microsoft as being the hardest problem in computer science today, and they weren’t kidding. You can break it down into three primary arenas: (1) scalability and performance; (2) quality control; and (3) research and development.

I would say that we are just beginning to realize the full potential of the massive data store on the Internet. Companies are struggling to sort and filter it all out.

I would say Google’s strength is its cached Web pages, index size, and search speed. Cached Web pages allowed Google to generate dynamic summaries that contain the most query terms. It gives greater perceived relevance. It allows searchers to determine if a document is one they’d like to view. The other engines are starting to jump on board here, though. I do not think Google’s results are the best anymore, but the other engines really are not offering enough for searchers to switch. That, coupled with the fact that Google delivers very fast results, gives it a fairly strong position, but it’s definitely not impenetrable.

I do not think Google’s success is due that much to its PageRank algorithm. It certainly touts that as a big deal, but it’s really just marketing hype. In fact, the idea predated Google in IBM’s CLEVER project [a search engine developed at the IBM Almaden Computer Science Research Center], so it’s not even Google’s, after all.

PageRank is just a silly idea in practice, but it is beautiful mathematically. You start off with a simple idea, such as the quality of a page is the sum of the quality of the pages that link to it, times a scalar. This sets you up with finding the eigen vectors of a huge sparse matrix. And because it is so much work, Google appears not to be updating its PageRank values that much.

A lot of people think this way of analyzing links was invented by Google. It wasn’t. As I said, IBM’s CLEVER project did it first. Furthermore, it doesn’t work. Well, it works, but no better than the simple link and link text analysis methods employed by the other engines. I know this because we implemented our own version at Infoseek and didn’t see a whole lot of difference. And Yahoo did a result comparison between Google and Inktomi before it purchased Inktomi and concluded the same thing.

What I think really brought Google into the spotlight was its index size, speed, and dynamically generated summaries. Those were Google’s winning points and still are till this day, not PageRank.

Gigablast is a search engine that I’ve been working on for about the last three years. I wrote it entirely from scratch in C++. The only external tool or library I use is the zlib compression library. It runs on eight desktop machines, each with four 160-GB IDE hard drives, two gigs of RAM, and one 2.6-GHz Intel processor. It can hold up to 320 million Web pages (on 5 TB), handle about 40 queries per second and spider about eight million pages per day. Currently it serves half a million queries per day to various clients, including some meta search engines and some pay-per-click engines. This, in addition to licensing my technology to other companies, provides me with a small income.

Search is very cutthroat so I don’t like to go too far on the plank, for reasons I mentioned before. However, I think that search engines, if they index XML properly, will have a good shot at replacing SQL. If an engine can sort and constrain on particular XML fields, then it can be very useful. For instance, if you are spidering a bunch of e-commerce Web pages, you’d like to be able to do a search and sort and constrain by price, color, and other page-defined dimensions, right? Search engines today are not very number friendly, but I think that will be changing soon.

I also think the advancements of tomorrow will be based on the search engines of today. Today’s search engines will serve as a core for higher-level operations. The more queries per second an engine can deliver, the higher the quality its search results will be. All of the power and information in today’s engines have yet to be exploited. It is like a large untapped oil reservoir.

To understand the direction search is taking, it helps to know what search is. Search is a natural function of the brain. What it all comes down to is that the brain itself is rather like a search engine. For instance, you might look out the window and see your car. Your brain subconsciously converts the image of your car into a search term, not too unlike the word car itself. The brain first consults the index in your short-term memory. That memory is the fast memory, the L1 cache so to speak. Events and thoughts that have occurred recently are indexed into your short-term memory, and, later, while you are sleeping, are sorted and merged into your long-term memory. This is why I think you sometimes have weird dreams. Thoughts from long-term memory are being pulled into short-term memory so they can be merged and then written back out to long term, and you sometimes notice these thoughts while they are being sorted.

So, the brain has to keep everything in order. That way, you can do fast associations. So looking up the word car, the brain lets you know that your gas tank is empty. You were driving the car this afternoon, so that thought is fresh in your short-term memory and is therefore the first thought you have. A second later, after the brain has had a chance to pull some search results from slower, long-term memory, you are informed that your car insurance payment is due very soon. That thought is important and is stored in the cache section of your short-term memory for fast recall. You also assign it a high priority in your directive queue.

When the brain is performing a search, it must actually result in a list of doc IDs, just like a search engine. However, these doc IDs correspond to thoughts, not documents, so perhaps it would be better to call them thought IDs. Once you get your list of thought IDs, you can look up the associated thoughts. Thoughts would be the title records I discussed earlier.

The brain is continuously doing searches in the background. You can see one thing and it makes you think of something else very quickly because of the powerful search architecture in your head. You can do the same types of associations by going to a search engine and clicking on the related topics. Clicking on one topic yields another list of topics, just like one thought leads to another. In reality, a thought is not much more than a set of brain queries, all of which, when executed, will lead to more thoughts.

Now that the Internet is very large, it makes for some well-developed memory. I would suppose that the amount of information stored on the Internet is around the level of the adult human brain. Now we just need some higher-order functionality to really take advantage of it. At one point we may even discover the protocol used in the brain and extend it with an interface to an Internet search engine.

All of this is the main reason I’m working with search now. I see the close parallel between the search engine and the human mind. Working on search gives us insights into how we function on the intellectual level. The ultimate goal of computer science is to create a machine that thinks like we do, and that machine will have a search engine at its core, just like our brain.

MySQL’s Success

News.com writes:

MySQL, which sells an open-source database of the same name, was nearly unheard of in corporate technology circles a few years ago. Now the company’s competitively priced, easy-to-use database is becoming increasingly popular with business customers looking for smaller, less-expensive options.

The MySQL database is taking over that lower-price, lesser-need market Microsoft started with. It’s a niche the company says is underserved by database industry heavyweights Oracle, IBM and Microsoft. MySQL appeals to organizations looking for a database that is “good enough” for most needs, said Mark Shainman, a database analyst at Meta Group. MySQL is also riding a wave of growing awareness around the cost-effective use of open-source software, notably the Linux operating system.

Rather than have an aggressive marketing strategy, MySQL often comes in the back door of corporations and spreads from there, Urlocker said. A programmer or a department in a company may use MySQL when it can’t get the budget to purchase a database license, and then the company considers the software for broader use, he said.

MySQL has a novel open-source business model designed to appeal to smaller companies. The company offers its product under a dual license, charging customers for support services with a commercial license and offering its database for free download under the open-source GNU General Public License (GPL).

HBS Conference on India

HBS Working Knowledge has a collection of reports on India, based on the India Business Conference, sponsored by Harvard Business Schools South Asian Business Association on April 4. From the venture capital discussion:

The pool of “extraordinary” technical talent in India is a big plus, said Venetia Kontogouris, of Trident Capital, a thirty-five-person firm with $1.3 billion under management that invests in information and business services companies. Finding good middle managers is more of a challenge.

“I think the quality of management in India is pretty good. You just have to work hard to find it,” said Ashish Dhawan, co-founder and senior managing director at ChrysCapital, a private equity fund that manages about $200 million. People’s aspirations have changed, said Dhawan, with many hoping to work at large multinational corporations that come to India. The trick is in luring talent to entrepreneurial ventures.

Ramanan Raghavendran, a senior partner at TH Lee Putnam Ventures, leads his firm’s investment activities in business process outsourcing (BPO) and software. “India is a standard now,” he said. “If you’re starting a technology company of some size and don’t have an India strategy, you’re not likely to survive long.”

Software portability means that offshoring will continue to be a big opportunity, with companies moving software services and customer interactions offshore, continued Guerster.

Navin Chaddha, of Mobius Venture Capital, said that his firm would be focusing on cross-border deals in the software and BPO sectors involving U.S. companies with Indian subsidiaries that build scalable operations around low-end transactions in health care, financial services, and life science services.

“I think the next big opportunity is to find the new sectors where you can take an Indian business global,” said Dhawan. Asian Paints, for example, is swiftly becoming one of the largest paint companies in the world. “They have proven that they can crush the international competition,” he said.

Raghavendran counseled the audience of largely Indian nationals to look beyond the established BPO model and think of companies like Asian Paints as the most exciting new category of opportunity: “businesses coming out of India that may go off and conquer the world.”

The Future of Work

David Kirkpatrick interviews Tom Malone and writes about his new book “The Future of Work”:

[Malone’s] new book posits that the central transformative development of our time is the radically decreased cost of communications caused by the Internet, wireless voice and data, and cheap long distance, among other new technologies. It is all fundamentally changing the nature of work, Malone says: “This change may be as important for business as the change to democracy has been for government.” He stopped by the office the other day to talk about the book, published this month, and his ideas.

Malone sees a parallel between the evolution of human society and the evolution of business. “For millenia,” he says, “all human societies were organized as small, autonomous, egalitarian groups called bands. Then we saw the rise of bigger and bigger, more centralized societies called kingdoms. Only in the last 200 years have we seen the rise on a large scale of the third way of organizing human society-democracy.” Each of those stages, Malone says, can be explained by a change in a single factor–the cost of communication. In his view, writing is what enabled hierarchically organized kingdoms to arise. Printing led to democracy.

Likewise, he says, “until a couple hundred years ago businesses were still organized like bands. It was only when new communications technologies like telegraph and telephone and even the Xerox machine made communication cheap enough to coordinate larger groups of people that we saw the rise of the centralized corporation–the kingdoms of the business world.” I like the way this guy thinks.

So where are we now? It’s the revolution, he says. “Near the end of the 20th century, it became possible for the first time to exchange the detailed kind of information necessary to coordinate a business on a very large scale even as lots of individuals made decisions for themselves. When communications costs fall it becomes possible for vastly more people to be well-enough informed to make decisions instead of just following orders from their uniquely well-informed superiors.”

For most of our lives, Malone says, “the big message of business history was that getting bigger and more centralized was the way you succeed. But now you can have both the economic benefits of bigness and the human benefits of smallness.”

He cites all the small companies that now can sell around the country and the world via the Internet: “They’re no longer limited by being in a certain region. They can buy and sell anywhere.”

Here is an excerpt on Decnetralisation from the book on HBS Working Knowledge.

Local Google = STIM?

Some interesting thoughts from Esther Dyson:

Consider Googles AdWords system a subtle mechanism for metadata collection. Right now, you can specify geographic targeting. Someday soon, perhaps, youll be able to specify targeting by opening hours, or by language spoken, or by other criteria. For now, that information is used only for targeting rather than displayed

But just as Google is implicitly if transparently planning to collect huge amounts of e-mail, its also beginning to collect metadata about businesses. And it has the market pprsence to make such a collection interesting. For now, the information provided by AdWords advertisers is an interesting database; someday, perhaps it could support a variety of open APIs. (Take a look at SMB meta, courtesy of Dan Bricklin.)

The best analogy, perhaps, is to Wal-Marts efforts to get its suppliers to use RF-ID, faltering though they may be. In the long run, suppliers will adopt Wal-Marts standards, and other large customers will likely start to use those standards too. Here are some scenarios: Currently, most commerce searches are for products and the establishments that sell them. But unless youre ordering online, those two searches are generally separate. There are few listings for whats on sale at an individual store. But soon, it could make sense for a store to make limited access to its inventories available online, so that people could know exactly where to buy things.

And, of course, Google could sell anonymous data about those queries to merchants who wanted to stay in stock or pre-order based on what looks hot, or to manufacturers, fashion mavens and so on. .

While right now Google is collecting information through AdWords for targeting, theres no reason it couldnt start using advertiser-entered data for display as well, as it already does with data feeds in Froogle. Some companies may start sending these new kinds of feeds expressly, while others might fill out a slightly more complex , domain-specific form when they advertise. Then hotels could start to compete on the basis of their swimming pool hours.

So, a local search engine could be used to build an SME Trade Information Marketplace.

TECH TALK: As India Develops: Vision and Will

If Vision is the art of the long-view, Will is the determination to make it happen. India and Indians need both. For much of the period since our Independence, we have taken short-term, half-baked measures. The results are all around for us to see, and we can feel it every day around us – be it something as trivial as the non-existent numbering of the buildings on a road which makes it hard to find new places or the construction and annual re-construction of the roads because we just cannot seem to do them right. India needs a Few to think things right, get the standards in place and then for the others to follow. As Atanu Dey puts it:

I think the time has come to speak of little things. Things that add up like little grains of sand and little drops of water. Individually, they seem irrelevant and inconsequential. But they matter very much in the end.

Economists like to remind people that learning by doing is a very powerful device. If you are at the forefront of some technology, the only way to learn is by doing and making mistakes and so on. But I believe that if you are not at the cutting-edge, then learning by imitating is the way to go. It does not require a rocket scientist to keep ones eyes open, note very carefully how others have solved a specific problem, and simply copy that solution if it is applicable. That way you don’t have to pay the price of having to discover the solution and yet you get the benefit of having the solution. This is the advantage that can come of being a late-comer. Among siblings, it is often the case that the second born appears to be sharper than the first born because of this learning by imitation.

Atanu adds:

The larger point is that standardization matters. It eases the friction that accompanies transactions which increase as an economy develops into a more complex web of interactions. Reducing transaction costs is what increases the pie because transaction costs are sheer losses (or dead-weight losses) that benefit no one. In a village economy, street addresses are not needed because everyone knows where everyone is and what he is up to today. In a city of a few million people and a few hundred square kilometers of buildings, one has to be more systematic.

We need better technology, not necessarily ICT with its computers and cell phones and internet and world wide web. By technology I mean know-how — how to do stuff. The know-how exists. One just has to observe and learn and adopt. But observing, learning, and adopting takes thinking and effort; it is not as easy as simply buying a bunch of computers and firing off Microsoft Windows.

Where does all this connect with Vision and Will? Heres the point: India needs a collective to come together and agree on the basic things that we need to do (build the Vision), make these as standards that everyone will follow, and then see through the execution across the country. Be it the way we educate our young or the way we build our roads, be it the way we keep our streets clean or number them, be it the energy solutions we adopt or the broadband that we deliver what India needs is to pick and choose the most appropriate ideas and then roll them out nationally. We need the innovation in the thinking, and not necessarily as much as in execution.

There is no danger of a monoculture here that fear may be valid if we had the base infrastructure in place. But we are starting from so far behind and have so little time to put in place development that is uniform and not just in pockets that we absolutely need to get to scale very rapidly across the length and breadth of India.

Tomorrow: Innovation and Entrepreneurship

Continue reading