The trend toward cloud computing and massively parallel data processing has definitely hit the mainstream media. I can’t believe how much coverage the concept is getting on a weekly basis, despite the fact that few organizations have made much of an investment in this technology, outside of SMP-based parallel databases and ETL.
The impact on basic commercial information management software is potentially substantial. The traditional approach to managing information has been to pull it all together in one place and control its access through a DBMS or Content Repository, typically running on a huge Symmetric Multi Processor box. With the new approach, using infrastructure like Hadoop, that burden can be spread across many smaller servers which can be distributed across a broader geographic area (to match where the data is likely coming from). These servers work in tandem to be able to process much higher volumes of information, though in truth they act more like a distributed file system than like a DBMS. There is some belief that you won’t need or want those centralized DBMS’s once you have this technology. However, I believe a hybrid model will reign for at least the foreseeable future, since businesses need the more mature controls afforded by DBMS infrastructure. Plus, DBMS’s are naturally evolving further toward cross-node parallelism (see IBM pureScale), which provides many of the same scalability benefits.
And not only are the DBMS advancing their internal architectures, but they are also beginning to provide seamless interoperability with these distributed file systems. An example of this trend can be seen in Quest’s recent announcement with Cloudera, where they are building adapters for Oracle to allow existing Oracle databases to be extended with Hadoop. This follows on the heels of IBM’s similar announcement about Hadoop support. I find it interesting that Oracle isn’t the one announcing this… they’ve been conspicuously silent on this topic (though they did publish this blog post on how to link the two using scotch tape and baling wire).
Intrestingly, the impact is not just on software. This NY Times article talks about the effect this same trend is having on hardware:
“The focus instead is on taking chunks of information, chopping them up and spreading the data across thousands of computers and storage devices. It’s a divide-and-conquer approach to making the avalanche of data produced online manageable.“
The article discusses how larger arrays of smaller processors are showing up in hardware, where computing tasks are not complex, but simply high in volume. The idea is that a smaller, less power-hungry chips (like those found in cell phones) can process simple things like Web requests just as effectively as more powerful chips. And these chips can be packed more densely into hardware while still consuming much less power and generating much less heat. Some interesting startups have started down this path with some promising offerings.
So keep an eye out for this trend, and make sure your vendors have a strategy for this, because it is likely to change the way you think about software and hardware in the near future.