ACM Queue has an interview with search engine Gigablast designer Matt Wells, who had worked earlier at Infoseek. “Wells began writing the Gigablast search engine from scratch in C++ about three years ago. He says he built it from day one to be the cheapest, most scalable search engine on the market, able to index billions of Web pages and serve thousands of queries per second at very low cost. Another major feature of Gigablast is its ability to index Web pages almost instantly, Wells explains. The network machines give priority to query traffic, but when they’re not answering questions, they spend their time spidering the Web.” Excerpts from the fascinating interview:
Search was quoted by Microsoft as being the hardest problem in computer science today, and they weren’t kidding. You can break it down into three primary arenas: (1) scalability and performance; (2) quality control; and (3) research and development.
I would say that we are just beginning to realize the full potential of the massive data store on the Internet. Companies are struggling to sort and filter it all out.
I would say Google’s strength is its cached Web pages, index size, and search speed. Cached Web pages allowed Google to generate dynamic summaries that contain the most query terms. It gives greater perceived relevance. It allows searchers to determine if a document is one they’d like to view. The other engines are starting to jump on board here, though. I do not think Google’s results are the best anymore, but the other engines really are not offering enough for searchers to switch. That, coupled with the fact that Google delivers very fast results, gives it a fairly strong position, but it’s definitely not impenetrable.
I do not think Google’s success is due that much to its PageRank algorithm. It certainly touts that as a big deal, but it’s really just marketing hype. In fact, the idea predated Google in IBM’s CLEVER project [a search engine developed at the IBM Almaden Computer Science Research Center], so it’s not even Google’s, after all.
PageRank is just a silly idea in practice, but it is beautiful mathematically. You start off with a simple idea, such as the quality of a page is the sum of the quality of the pages that link to it, times a scalar. This sets you up with finding the eigen vectors of a huge sparse matrix. And because it is so much work, Google appears not to be updating its PageRank values that much.
A lot of people think this way of analyzing links was invented by Google. It wasn’t. As I said, IBM’s CLEVER project did it first. Furthermore, it doesn’t work. Well, it works, but no better than the simple link and link text analysis methods employed by the other engines. I know this because we implemented our own version at Infoseek and didn’t see a whole lot of difference. And Yahoo did a result comparison between Google and Inktomi before it purchased Inktomi and concluded the same thing.
What I think really brought Google into the spotlight was its index size, speed, and dynamically generated summaries. Those were Google’s winning points and still are till this day, not PageRank.
Gigablast is a search engine that I’ve been working on for about the last three years. I wrote it entirely from scratch in C++. The only external tool or library I use is the zlib compression library. It runs on eight desktop machines, each with four 160-GB IDE hard drives, two gigs of RAM, and one 2.6-GHz Intel processor. It can hold up to 320 million Web pages (on 5 TB), handle about 40 queries per second and spider about eight million pages per day. Currently it serves half a million queries per day to various clients, including some meta search engines and some pay-per-click engines. This, in addition to licensing my technology to other companies, provides me with a small income.
Search is very cutthroat so I don’t like to go too far on the plank, for reasons I mentioned before. However, I think that search engines, if they index XML properly, will have a good shot at replacing SQL. If an engine can sort and constrain on particular XML fields, then it can be very useful. For instance, if you are spidering a bunch of e-commerce Web pages, you’d like to be able to do a search and sort and constrain by price, color, and other page-defined dimensions, right? Search engines today are not very number friendly, but I think that will be changing soon.
I also think the advancements of tomorrow will be based on the search engines of today. Today’s search engines will serve as a core for higher-level operations. The more queries per second an engine can deliver, the higher the quality its search results will be. All of the power and information in today’s engines have yet to be exploited. It is like a large untapped oil reservoir.
To understand the direction search is taking, it helps to know what search is. Search is a natural function of the brain. What it all comes down to is that the brain itself is rather like a search engine. For instance, you might look out the window and see your car. Your brain subconsciously converts the image of your car into a search term, not too unlike the word car itself. The brain first consults the index in your short-term memory. That memory is the fast memory, the L1 cache so to speak. Events and thoughts that have occurred recently are indexed into your short-term memory, and, later, while you are sleeping, are sorted and merged into your long-term memory. This is why I think you sometimes have weird dreams. Thoughts from long-term memory are being pulled into short-term memory so they can be merged and then written back out to long term, and you sometimes notice these thoughts while they are being sorted.
So, the brain has to keep everything in order. That way, you can do fast associations. So looking up the word car, the brain lets you know that your gas tank is empty. You were driving the car this afternoon, so that thought is fresh in your short-term memory and is therefore the first thought you have. A second later, after the brain has had a chance to pull some search results from slower, long-term memory, you are informed that your car insurance payment is due very soon. That thought is important and is stored in the cache section of your short-term memory for fast recall. You also assign it a high priority in your directive queue.
When the brain is performing a search, it must actually result in a list of doc IDs, just like a search engine. However, these doc IDs correspond to thoughts, not documents, so perhaps it would be better to call them thought IDs. Once you get your list of thought IDs, you can look up the associated thoughts. Thoughts would be the title records I discussed earlier.
The brain is continuously doing searches in the background. You can see one thing and it makes you think of something else very quickly because of the powerful search architecture in your head. You can do the same types of associations by going to a search engine and clicking on the related topics. Clicking on one topic yields another list of topics, just like one thought leads to another. In reality, a thought is not much more than a set of brain queries, all of which, when executed, will lead to more thoughts.
Now that the Internet is very large, it makes for some well-developed memory. I would suppose that the amount of information stored on the Internet is around the level of the adult human brain. Now we just need some higher-order functionality to really take advantage of it. At one point we may even discover the protocol used in the brain and extend it with an interface to an Internet search engine.
All of this is the main reason I’m working with search now. I see the close parallel between the search engine and the human mind. Working on search gives us insights into how we function on the intellectual level. The ultimate goal of computer science is to create a machine that thinks like we do, and that machine will have a search engine at its core, just like our brain.