In my earlier post on Big Data Goes Mainstream, I referenced an InformationWeek article on Big Data. I made the comment that most of the examples were just big data warehouses rather than true distributed analytics. Interestingly, the other examples that did fit the true Big Data mold were primarily Netezza examples.
Netezza has a great track record of pushing the envelope with analytics, which makes them an ideal acquisition for us, given our strategy.
As I spend more time on the big data issue, it occurs to me that one of the challenges of dealing with big data is that it tends to have a relatively low value density, as compared to the data we manage in traditional systems. I’m stealing the idea of value density from the logistics industry, where it is used to determine the best mode of transport for goods based on weight and value. In this case, I’m referring to the ratio of business relevance to the size of the data (you can think of it as business-relevant facts per gigabyte).
The issue is that the lower the value density, the less sense it makes to manage it using conventional enterprise technologies, like data warehouses. The management infrastructure around these conventional technologies is relatively large, and the costs relatively high. At some point, the cost of all that management infrastructure exceeds the value of the incremental insights provided in the data. As a result, the things we tend to classify as “Big Data” tend to fall beneath the value density curve. It just doesn’t make financial sense to spend all that money to bring them into the mainstream.
That said, organizations do want to pull those nuggets of insight out of big data, and likely incorporate that lower volume of filtered information into conventional warehouses, so that it can all be analyzed and sliced and diced together. This is why things like InfoSphere BigInsights (for static distributed information) and InfoSphere Streams (for real-time data streams) are needed to help find those valuable nuggets. Interestingly, I don’t hear the majority of the data warehouse vendors talking about this.
The paradox is that there is no way to know how many facts per GB there are in a given big data set until you actually do some analysis on it. And at the same time, understanding what and how to analyze (or as Kevin Weil and Stephen O’Grady describe it, “knowing the right questions to ask”) is impossible unless you spend some time modeling out the right questions (which you really can only do effectively by using your structured data in your nice high-cost environment).
So, ideally organizations like to design (and over time refine) their analytic models against their structured warehouse data, but then run them against big data sets using things like Hadoop so that they can pull out the good stuff. This is why data warehousing and Hadoop are really codependent.