Emergic: Rajesh Jain's Blog

Emergic: Rajesh Jain's Blog header image 2

Bosworth on Databases

January 16th, 2005 · No Comments

Adam Bosworth asks where all the good databases have gone:

Users of databases tend to ask for three very simple things:

1) Dynamic schema so that as the business model/description of goods or services changes and evolves, this evolution can be handled seamlessly in a system running 24 by 7, 365 days a year. This means that Amazon can track new things about new goods without changing the running system. It means that Federal Express can add Federal Express Ground seamlessly to their running tracking system and so on. In short, the database should handle unlimited change.

2) Dynamic partitioning of data across large dynamic numbers of machines. A lot people people track a lot of data these days. It is common to talk to customers tracking 100,000,000 items a day and having to maintain the information online for at least 180 days with 4K or more a pop and that adds (or multiplies) up to a 100 TB or so. Customers tell me that this is best served up to the 1MM users who may want it at any time by partioning the data because, in general, most of this data is highly partionable by customer or product or something. The only issue is that it needs to be dynamic so that as items are added or get “busy” the system dynamically load balances their data across the machines. In short, the database should handle unlimited scale with very low latency. It can do this because the vast majority of queries will be local to a product or a customer or something over which you can partion. It is, obviously, going to come at a cost for complex joins and predicates across entire data sets, but as it turns out, this isn’t that normative for these sorts of data bases and an be slower as long as point 3 below is handled well. And a lot of them can be solved with some giant indices that cover the datasets that are routinely scanned across customers or products.

3) Modern indexing. Google has spoiled the world. Everyone has learned that just typing in a few words should show the relevant results in a couple of hundred milliseconds. Everyone (whether an Amazon user or a customer looking up a check they wrote a month ago or a customer service rep looking up the history for someone calling in to complain) expects this. This indexing, of course, often has to include indexing through the “blobs” stored in the items such as PDF’s and Spreadsheets and Powerpoints. This is actually hard to do across all data, but much of the need is within a partioned data set (e.g. I want to and should only see my checks, not yours or my airbill status not yours) and then it should be trivial.

If the database vendors ARE solving these problems, then they aren’t doing a good job of telling the rest of us. The customers I talk to who are using the traditional databases are esentially using them as very dumb row stores and trying very hard to move all the logic and searching out into arrays of machines with in memory caches.

Tags: Software

0 responses so far ↓

  • There are no comments yet...Kick things off by filling out the form below.

Leave a Comment