Tag Archives: Hadoop

Visualizing Big Data

One of the big problems with Big Data is understanding quickly what is in it. Visualization tools can help this enormously. IBM provides two sets of tools that can help solve this problem for big data sets.

First of all, IBM BigSheets is our insight engine for line of business professionals within BigInsights. This layers on top of Hadoop, leveraging MapReduce to process through big data sets. It provides what looks and feels like a browser-based spreadsheet for manipulating huge datasets. Like a spreadsheet, data can be sorted, filtered,pivoted, or calculated. You can even run macros against it, including LanguageWare macros.

Also like a spreadsheet, you can visualize the data. For this BigSheets uses IBM Many Eyes. The result is some really nice visualizations of huge data sets that can be tailored by anybody who knows how to use a regular spreadsheet. There is a great demo on YouTube if you are interested.

For more intensive analytics, we provide Cognos Content Analytics, which provides more detailed analytic capabilities, with the ability to find specific associations of concepts across unstructured documents. For example, you can use the tool to discover the incidences of specific part failures across warranty claims, and then understand what the primary customer complaints were for each part failure.

Tagged , , , , ,

Is the database dead?

Is the database dead? Its an interesting question, and one that analysts like Donald Feinberg from Gartner are beginning to raise. The noSQL movement is in full gear, with a variety of alternatives to the traditional RDBMS  like Google’s BigTable, Amazon’s Dynamo, Hadoop HBase. In fact, most of the new wave of large internet-based businesses uses a variation of these key/value stores. So why are they doing this? What is wrong with the traditional database that would drive these companies to alternatives? Were they just looking to avoid paying the DBMS vendors, or has the world simply moved on?

As it turns out, the nature of these new internet-based businesses (Google, Facebook, LinkedIn, Yahoo, Amazon, etc.), the kinds of information they need to store, and their patterns of access were what guided them to find a new kind of solution. There are three primary characteristics of the way these companies manage their data that has baffled traditional databases:

  1. Their data tends to be distributed across large grids of systems, often  geographically dispersed
  2. Their data is inconsistently structured (no consistent schema)
  3. Access to the data is more search-oriented (what would amount to full table scans without predictable indexes in an RDBMS)

These three characteristics are particularly nasty for your traditional RDBMS to handle. Most RDBMSs were designed to run on large SMP systems, as they rely heavily on fast channels to disk and memory. Some vendors, notably IBM with DB2 pureScale and Oracle with RAC, have provided the ability to scale out across servers (though only pureScale has been shown to have near linear scalability over 100’s of nodes).

From a structure perspective, RDBMSs require a set schema. Some databases like DB2 have the ability to also store XML data natively, but even that needs to have a set schema. When data isn’t rigidly structured, there is no efficient way for the database to manage it.

In addition, databases are built for fast transactional retrieval based on specific keys and indexes, or queries based on specific relational structures. Search-like scanning queries not based on indexes can be horribly inefficient in a RDBMS.

So I guess that means the database is dead, right? Well, not so fast… While the key/value stores are extremely scalable, great at fast search retrieval and able to deal with inconsistent data structures (or even unstructured data), they aren’t particularly efficient at managing transactional application-oriented access. They are really designed for pulling back everything you know about something based on a keyword. When you know exactly what multiple related things you want (like the accounts for a specific customer), relational databases are much more efficient (and much more predicable).

In addition, things like transaction integrity, security, compression, and workload management are much more advanced in RDBMSs – and all of these are table stakes for most business applications. I liken the comparison to the initial debate about REST vs. SOAP. REST, with its lack of restrictions and overhead, appeared poised to overtake standards-happy SOAP. In the end, they both found their niche – SOAP in places where the depth and control were needed, and REST in places where that stuff doesn’t matter.

That said, there is nothing to prevent these new kinds of databases from developing advanced capabilities. And the advantages of free-structured search, massive scalability, and data type independence make key/value stores a very attractive addition to a data management portfolio. So, while I think it will be a while before they mature enough to be able to deal with mainstream application processing, I do think that more and more companies will be adopting them as a complement to their RDBMSs – and doing so soon.

Tagged , , , ,

The Value Density of Information

As I spend more time on the big data issue, it occurs to me that one of the challenges of dealing with big data is that it tends to have a relatively low value density, as compared to the data we manage in traditional systems. I’m stealing the idea of value density from the logistics industry, where it is used to determine the best mode of transport for goods based on weight and value. In this case, I’m referring to the ratio of business relevance to the size of the data (you can think of it as business-relevant facts per gigabyte).

The issue is that the lower the value density, the less sense it makes to manage it using conventional enterprise technologies, like data warehouses. The management infrastructure around these conventional technologies is relatively large, and the costs relatively high.  At some point, the cost of all that management infrastructure exceeds the value of the incremental insights provided in the data. As a result, the things we tend to classify as “Big Data” tend to fall beneath the value density curve. It just doesn’t make financial sense to spend all that money to bring them into the mainstream.

That said, organizations do want to pull those nuggets of insight out of big data, and likely incorporate that lower volume of filtered information into conventional warehouses, so that it can all be analyzed and sliced and diced together. This is why things like InfoSphere BigInsights (for static distributed information) and InfoSphere Streams (for real-time data streams) are needed to help find those valuable nuggets. Interestingly, I don’t hear the majority of the data warehouse vendors talking about this.

The paradox is that there is no way to know how many facts per GB there are in a given big data set until you actually do some analysis on it. And at the same time, understanding what and how to analyze (or as Kevin Weil and Stephen O’Grady describe it, “knowing the right questions to ask”) is impossible unless you spend some time modeling out the right questions (which you really can only do effectively by using your structured data in your nice high-cost environment).

So, ideally organizations like to design (and over time refine) their analytic models against their structured warehouse data, but then run them against big data sets using things like Hadoop so that they can pull out the good stuff. This is why data warehousing and Hadoop are really codependent.

Tagged , , , ,

Big data goes mainstream

Great article in InformationWeek on Big Data. To be honest, most of the referenced examples are just big data warehouses and not extended beyond that to streaming data and large distributed file systems, which I think will be the norm in the future, but it still provides some interesting examples of where people are today.

I’ve spent a lot of time with 3UK, referenced in the article, who went with the IBM Smart Analytic System. They are doing some very interesting stuff, beginning with their model-driven approach to data integration, and extending all the way out to their recent investments in in-database analytics. What they are doing with their warehouse investment is very cool.

What the 3UK example shows that many of the other examples in this article fail to show is that “big data” isn’t just about having a big data warehouse and running conventional reporting and BI against it. It is really about running analytic models against data that would be too big and too expensive to analyze conventionally. The analytic models allow you to find nuggets of value by sifting through huge amounts of information. Some of the examples in the article capture this thought, but most are just big data warehouses. I think you can really tell the difference in the warehouse vendors by looking at their vision and investments around analytics. Analytics is the future of the industry. The ones making the investments today will be the leaders of the next wave of change.

Tagged , , ,

A million tiny pieces…

binaryThe trend toward cloud computing and massively parallel data processing has definitely hit the mainstream media. I can’t believe how much coverage the concept is getting on a weekly basis, despite the fact that few organizations have made much of an investment in this technology, outside of SMP-based parallel databases and ETL.

The impact on basic commercial information management software is potentially substantial. The traditional approach to managing information has been to pull it all together in one place and control its access through a DBMS or Content Repository, typically running on a huge Symmetric Multi Processor box. With the new approach, using infrastructure like Hadoop, that burden can be spread across many smaller servers which can be distributed across a broader geographic area (to match where the data is likely coming from). These servers work in tandem to be able to process much higher volumes of information, though in truth they act more like a distributed file system than like a DBMS. There is some belief that you won’t need or want those centralized DBMS’s once you have this technology. However, I believe a hybrid model will reign for at least the foreseeable future, since businesses need the more mature controls afforded by DBMS infrastructure. Plus, DBMS’s are naturally evolving  further toward cross-node parallelism (see IBM pureScale), which provides many of the same scalability benefits.

And not only are the DBMS advancing their internal architectures, but they are also beginning to provide seamless interoperability with these distributed file systems. An example of this trend can be seen in Quest’s recent announcement with Cloudera, where they are building adapters for Oracle to allow existing Oracle databases to be extended with Hadoop. This follows on the heels of IBM’s similar announcement about Hadoop support. I find it interesting that Oracle isn’t the one announcing this… they’ve been conspicuously silent on this topic (though they did publish this blog post on how to link the two using scotch tape and baling wire).

Intrestingly, the impact is not just on software. This NY Times article talks about the effect this same trend is having on hardware:

The focus instead is on taking chunks of information, chopping them up and spreading the data across thousands of computers and storage devices. It’s a divide-and-conquer approach to making the avalanche of data produced online manageable.

The article discusses how larger arrays of smaller processors are showing up in hardware, where computing tasks are not complex, but simply high in volume. The idea is that a smaller, less power-hungry chips (like those found in cell phones) can process simple things like Web requests just as effectively as more powerful chips. And these chips can be packed more densely into hardware while still consuming much less power and generating much less heat. Some interesting startups have started down this path with some promising offerings.

So keep an eye out for this trend, and make sure your vendors have a strategy for this, because it is likely to change the way you think about software and hardware in the near future.

Tagged , , , ,

Jumping through “big” Hadoops

arrowsSince my blog is called Information Explosion, I figured I should start out with a discussion on big data. I started to subtitle my blog “The Story of Big Data,” but then I remembered Stephen O’Grady’s words of wisdom on the topic and decided to change that… That said, I do think that “big data” is as good a descriptor for the problem as anything, so I plan to use it without shame.

I was a bit surprised to find a Wikipedia definition of “big data”:

“The term Big data from software engineering and computer science describes datasets that grow so large that they become awkward to work with using on-hand database management tools.”

Not surprisingly, the definition pulls directly from Tom White’s book on Hadoop. It also correctly raises the same point that Stephen makes, that what constitutes “big” will vary from organization to organization. In fact, I think that “big” is really characterized not just by sheer volume, but also by how complex the types of data are (lots of unstructured data), how distributed it is, and whether or not it is even something you would store.

That last point is important… the more we hook up sensors and other “things” to the internet, the more data we’re creating (obviously), but a lot of that data is just noise. We don’t need to use all of it, much less store and manage it. So we use streaming analytics to extract the valuable insights out of all that noise and store those so that we can analyze and use them in relation to our other data, using our conventional systems – no different in concept than data mining that has been done for years, just on a different time and volume scale (and working against much more complex and less structured data types).

For the data that does land someplace, more and more of it is not structured in a way that makes it easy to parse through, analyze, and query using traditional technologies. It is also becoming increasingly distributed due to the methods used to collect and manage it. This kind of data has been the bread and butter of search technology for a long time, so it’s not surprising that Google inspired one of the key technologies for running analytics against this data to find the valuable nuggets of insight within it – Apache Hadoop. I believe Hadoop holds all kinds of promise for how companies will manage this new class of big data. Analytics is just the tip of the iceberg.

So as you see, the patterns used against big data are the same as the ones we’ve always used with our nice, controlled database environments – just working against a much more chaotic set of data.

Tagged , , , , ,