About five years ago I started to notice an odd thing. The products that the database vendors were building had less and less to do with what the customers wanted. This is not just an artifact of talking to enterprise customers while at BEA. Google itself (and I’d bet a lot Yahoo too) have similar needs to the ones Federal Express or Morgan Stanley or Ford or others described, quite eloquently to me. So, what is this growing disconnect?
It is this. Users of databases tend to ask for three very simple things:
1) Dynamic schema so that as the business model/description of goods or services changes and evolves, this evolution can be handled seamlessly in a system running 24 by 7, 365 days a year. This means that Amazon can track new things about new goods without changing the running system. It means that Federal Express can add Federal Express Ground seamlessly to their running tracking system and so on. In short, the database should handle unlimited change.
2) Dynamic partitioning of data across large dynamic numbers of machines. A lot people people track a lot of data these days. It is common to talk to customers tracking 100,000,000 items a day and having to maintain the information online for at least 180 days with 4K or more a pop and that adds (or multiplies) up to a 100 TB or so. Customers tell me that this is best served up to the 1MM users who may want it at any time by partioning the data because, in general, most of this data is highly partionable by customer or product or something. The only issue is that it needs to be dynamic so that as items are added or get “busy” the system dynamically load balances their data across the machines. In short, the database should handle unlimited scale with very low latency. It can do this because the vast majority of queries will be local to a product or a customer or something over which you can partion. It is, obviously, going to come at a cost for complex joins and predicates across entire data sets, but as it turns out, this isn’t that normative for these sorts of data bases and an be slower as long as point 3 below is handled well. And a lot of them can be solved with some giant indices that cover the datasets that are routinely scanned across customers or products.
3) Modern indexing. Google has spoiled the world. Everyone has learned that just typing in a few words should show the relevant results in a couple of hundred milliseconds. Everyone (whether an Amazon user or a customer looking up a check they wrote a month ago or a customer service rep looking up the history for someone calling in to complain) expects this. This indexing, of course, often has to include indexing through the “blobs” stored in the items such as PDF’s and Spreadsheets and Powerpoints. This is actually hard to do across all data, but much of the need is within a partioned data set (e.g. I want to and should only see my checks, not yours or my airbill status not yours) and then it should be trivial.
By the way, the inherent cost of the machines to do all this is relatively negligible. Assume 3 by 400GB cheap disks per machine mounted in racks of 60 and one rack would pretty much do it if there wasn’t a need for redundancy and logs, say two racks to cover that. Companies are already coming out this year with highly redundant disk arrays for $1 per GB or $1200 / machine for the ones above (not counting the $1000 for the machine and memory itself). In short, for 120 such machines, it will probably cost less than $500K and that’s less than 3-4 good programmers and it is one time a capital cost. But the cost to most people I’ve spoken to in terms of actual people to build and administer such systems is an order of magnitude more. For that matter, configure the 120 machines with 4GB each of memory and you could normally keep the current days work in memory and in many of these cases the data accessed will be the current days as people look for their waybills or flight statuses or check their Blog comments or whatever.
Users of databases don’t believe that they are getting any of these three. Salesforce, for example, has a lot of clever technology just to hack around the dynamic schema problem so that 13,000 customers can have 13,000 different views of what a prospect is.
If the database vendors ARE solving these problems, then they aren’t doing a good job of telling the rest of us. The customers I talk to who are using the traditional databases are esentially using them as very dumb row stores and trying very hard to move all the logic and searching out into arrays of machines with in memory caches. Oracle is doing some very clever high end things with streaming queries and the ability to see data as of some point in recent history (and even which updates affected the query within some date range) and with integrated pub/sub and queueing, but even Oracle seems to make systems too static and too ponderous to really meet the needs about and, oh yes, they seem to charge about ten times as much as one would expect for them.
Indeed, in these days of open source, I wonder if the software itself, should cost at all? Open Source solutions would undoubtedly get hacked more quickly to be robust and truly scalable across nice simple software. It wouldn’t be as pointwise fast, but the whole point is that these systems will scale linearly and are so cheap that it doesn’t matter. The advantage of Open Source is that those folks really understand how to build scalable clouds of machines with a default assumption of failure and load balancing. It’s called Apache. There are some other interesting problems that the database vendors are also ignoring but for now (like how do I ask for the set of complaints that are like the ones this customer has) but for now the three above seem like the big ones to me. My message is to the Open Source community that has, so ably, built LAMP (Linux, Apache and Tomcat and MySQL and PHP and PERL and Python). Please finish the job. Do for databases what you did for web servers. Give us dynamism and robustness. Give us systems that scale linearly, are flexible and dynamically reconfigurable and load balanced and easy to use.
Light that LAMP for us please.