Tag Archives: InfoSphere BigInsights

Visualizing Big Data

One of the big problems with Big Data is understanding quickly what is in it. Visualization tools can help this enormously. IBM provides two sets of tools that can help solve this problem for big data sets.

First of all, IBM BigSheets is our insight engine for line of business professionals within BigInsights. This layers on top of Hadoop, leveraging MapReduce to process through big data sets. It provides what looks and feels like a browser-based spreadsheet for manipulating huge datasets. Like a spreadsheet, data can be sorted, filtered,pivoted, or calculated. You can even run macros against it, including LanguageWare macros.

Also like a spreadsheet, you can visualize the data. For this BigSheets uses IBM Many Eyes. The result is some really nice visualizations of huge data sets that can be tailored by anybody who knows how to use a regular spreadsheet. There is a great demo on YouTube if you are interested.

For more intensive analytics, we provide Cognos Content Analytics, which provides more detailed analytic capabilities, with the ability to find specific associations of concepts across unstructured documents. For example, you can use the tool to discover the incidences of specific part failures across warranty claims, and then understand what the primary customer complaints were for each part failure.

Tagged , , , , ,

The Value Density of Information

As I spend more time on the big data issue, it occurs to me that one of the challenges of dealing with big data is that it tends to have a relatively low value density, as compared to the data we manage in traditional systems. I’m stealing the idea of value density from the logistics industry, where it is used to determine the best mode of transport for goods based on weight and value. In this case, I’m referring to the ratio of business relevance to the size of the data (you can think of it as business-relevant facts per gigabyte).

The issue is that the lower the value density, the less sense it makes to manage it using conventional enterprise technologies, like data warehouses. The management infrastructure around these conventional technologies is relatively large, and the costs relatively high.  At some point, the cost of all that management infrastructure exceeds the value of the incremental insights provided in the data. As a result, the things we tend to classify as “Big Data” tend to fall beneath the value density curve. It just doesn’t make financial sense to spend all that money to bring them into the mainstream.

That said, organizations do want to pull those nuggets of insight out of big data, and likely incorporate that lower volume of filtered information into conventional warehouses, so that it can all be analyzed and sliced and diced together. This is why things like InfoSphere BigInsights (for static distributed information) and InfoSphere Streams (for real-time data streams) are needed to help find those valuable nuggets. Interestingly, I don’t hear the majority of the data warehouse vendors talking about this.

The paradox is that there is no way to know how many facts per GB there are in a given big data set until you actually do some analysis on it. And at the same time, understanding what and how to analyze (or as Kevin Weil and Stephen O’Grady describe it, “knowing the right questions to ask”) is impossible unless you spend some time modeling out the right questions (which you really can only do effectively by using your structured data in your nice high-cost environment).

So, ideally organizations like to design (and over time refine) their analytic models against their structured warehouse data, but then run them against big data sets using things like Hadoop so that they can pull out the good stuff. This is why data warehousing and Hadoop are really codependent.

Tagged , , , ,