In my earlier post on Big Data Goes Mainstream, I referenced an InformationWeek article on Big Data. I made the comment that most of the examples were just big data warehouses rather than true distributed analytics. Interestingly, the other examples that did fit the true Big Data mold were primarily Netezza examples.
Netezza has a great track record of pushing the envelope with analytics, which makes them an ideal acquisition for us, given our strategy.
As I spend more time on the big data issue, it occurs to me that one of the challenges of dealing with big data is that it tends to have a relatively low value density, as compared to the data we manage in traditional systems. I’m stealing the idea of value density from the logistics industry, where it is used to determine the best mode of transport for goods based on weight and value. In this case, I’m referring to the ratio of business relevance to the size of the data (you can think of it as business-relevant facts per gigabyte).
The issue is that the lower the value density, the less sense it makes to manage it using conventional enterprise technologies, like data warehouses. The management infrastructure around these conventional technologies is relatively large, and the costs relatively high. At some point, the cost of all that management infrastructure exceeds the value of the incremental insights provided in the data. As a result, the things we tend to classify as “Big Data” tend to fall beneath the value density curve. It just doesn’t make financial sense to spend all that money to bring them into the mainstream.
That said, organizations do want to pull those nuggets of insight out of big data, and likely incorporate that lower volume of filtered information into conventional warehouses, so that it can all be analyzed and sliced and diced together. This is why things like InfoSphere BigInsights (for static distributed information) and InfoSphere Streams (for real-time data streams) are needed to help find those valuable nuggets. Interestingly, I don’t hear the majority of the data warehouse vendors talking about this.
The paradox is that there is no way to know how many facts per GB there are in a given big data set until you actually do some analysis on it. And at the same time, understanding what and how to analyze (or as Kevin Weil and Stephen O’Grady describe it, “knowing the right questions to ask”) is impossible unless you spend some time modeling out the right questions (which you really can only do effectively by using your structured data in your nice high-cost environment).
So, ideally organizations like to design (and over time refine) their analytic models against their structured warehouse data, but then run them against big data sets using things like Hadoop so that they can pull out the good stuff. This is why data warehousing and Hadoop are really codependent.
Great article in InformationWeek on Big Data. To be honest, most of the referenced examples are just big data warehouses and not extended beyond that to streaming data and large distributed file systems, which I think will be the norm in the future, but it still provides some interesting examples of where people are today.
I’ve spent a lot of time with 3UK, referenced in the article, who went with the IBM Smart Analytic System. They are doing some very interesting stuff, beginning with their model-driven approach to data integration, and extending all the way out to their recent investments in in-database analytics. What they are doing with their warehouse investment is very cool.
What the 3UK example shows that many of the other examples in this article fail to show is that “big data” isn’t just about having a big data warehouse and running conventional reporting and BI against it. It is really about running analytic models against data that would be too big and too expensive to analyze conventionally. The analytic models allow you to find nuggets of value by sifting through huge amounts of information. Some of the examples in the article capture this thought, but most are just big data warehouses. I think you can really tell the difference in the warehouse vendors by looking at their vision and investments around analytics. Analytics is the future of the industry. The ones making the investments today will be the leaders of the next wave of change.
IBM has announced a new system called Watson, that is capable of answering human language questions. In fact, there is a great profile article in the NY Times that provides an overview of how it is going to take on a group of Jeopardy! champions.
OK… I deal with a lot of cool implementations of business analytics, but I have to admit, I haven’t seen anything like this… Just imagine the kind of analytical power behind something that needs to be able to parse through human language, recognize subtle associations between words and concepts, and also access an entire universe worth of trivial information. It just goes to show you how much we take the human mind for granted.
When IBM beat Kasparov with Deep Blue back in the 90’s, I was impressed, but that challenge seemed surmountable. Chess has a limited number of permutations. At the end of the day, it is really just a very big and complex math equation. But something like Jeopardy! is completely different. Not only is the number of permutations limitless, but Jeopardy! specializes in twists of human language – puns that require a knowledge of language structure, history, and even popular culture. I could imagine, for example, that Jeopardy! would challenge even an expert linguist unless they had the foundation of being immersed in our culture.
Let me give you some examples. Here are three questions from recent Jeopardy! contests:
Let’s take each one of these individually. The first one is the easiest. As a computer, your first task would be to parse out the composition of the sentence to determine what it is asking for. Okay, something that was sold between 1908-1927 that cost between $360 and $825. Perhaps as a computer you could just use search to determine what cost that much in those years, but its likely that multiple things might come up (in fact, a simple search on Google brings back the 1908 Indian Twin motorcycle first, which sold for… you guessed it, $360). You’d probably have to know that that much money was quite a lot for back then, and since it was limited production it was something manufactured (ie. not a house). You might be able to narrow in to the fact that this was likely a car or other vehicle, and that the only vehicle with that long of a production run starting in 1908 was a Model-T. But of course, there is a lot of interpolation in that logic, and of course this all has to happen within a second or two in order to beat the buzz in of the other contestants…
Okay, so that one was easy. Let’s take on #2, which is a bit harder… The problem with this one is that it is SO open-ended. The answer could be just about anything. The first movie about war? The first movie about pacificism? The first movie to promote white supremacy? All these might be true… A search on Wikipedia does turn up that this is considered to be the first sequel, but it also turns up that this is the first feature film with its own original symphonic score. So which is it likely to be. A human would understand that the former is likely more significant, because we understand our culture. A computer? I don’t see how it could…
So let’s take on question number 3. This one is a real doozy… The dissection of the parts of speech alone on this would challenge any English major. Then, for a computer to understand how to recognize a spoonerism… the programming on something like this would have to be ridiculously complex
If you want to see more examples, just sift through the Jeopardy! archives. I can’t imagine a computer being able to understand so many of these (“Do not go gentle into this “ursine” Canadian lake /Rage, rage against the cold; it can be just too much to take” – really? Great Bear Lake? really?). Yet somehow, they’ve gotten Watson to the point where they are willing to challenge the top Jeopardy! champions of all time.
So you might be asking, “why are they doing this?”From a pure analytics perspective, this is something of a holy grail… Something that is capable of parsing through human language and understanding meaning and intent can change the way we think about analytics. Of course, there are a lot of technologies that do this, but typically they are just looking for word associations within a specific subject area. Being able to do this across the range of subjects required for Watson opens up whole new possibilities for extracting value out of oceans of information.
But beyond even the analytics, the implications of a natural language question answering machine are incredible. This is the technology of HAL, and the Star Trek bridge computer, and (gulp) the Matrix, and just about every other Science Fiction movie ever. Within two decades, I predict we’ll have something like this in our offices and houses, answering questions as we ask them, performing tasks on voice command.
“Hey Watson, could ya start up a pot of coffee for me? And while your at it, what was the name of that pretty dark-dark haired woman who used to be married to that bald action actor – you know, the guy that was in those ‘Die Hard’ movies?”
“Coffee is brewing. Demi Moore was married to Bruce Willis from 1987 to 2000.”
“Thanks… hey Watson… you wouldn’t… like… imprison mankind as a source of energy for a world dominated by machines, would ya?”
“I’m sorry, I don’t understand your question.”
“That’s good, Watson… that’s good. Thanks.“
Ok, just a wee bit terrifying, but also extremely cool…
Since my blog is called Information Explosion, I figured I should start out with a discussion on big data. I started to subtitle my blog “The Story of Big Data,” but then I remembered Stephen O’Grady’s words of wisdom on the topic and decided to change that… That said, I do think that “big data” is as good a descriptor for the problem as anything, so I plan to use it without shame.
I was a bit surprised to find a Wikipedia definition of “big data”:
“The term Big data from software engineering and computer science describes datasets that grow so large that they become awkward to work with using on-hand database management tools.”
Not surprisingly, the definition pulls directly from Tom White’s book on Hadoop. It also correctly raises the same point that Stephen makes, that what constitutes “big” will vary from organization to organization. In fact, I think that “big” is really characterized not just by sheer volume, but also by how complex the types of data are (lots of unstructured data), how distributed it is, and whether or not it is even something you would store.
That last point is important… the more we hook up sensors and other “things” to the internet, the more data we’re creating (obviously), but a lot of that data is just noise. We don’t need to use all of it, much less store and manage it. So we use streaming analytics to extract the valuable insights out of all that noise and store those so that we can analyze and use them in relation to our other data, using our conventional systems – no different in concept than data mining that has been done for years, just on a different time and volume scale (and working against much more complex and less structured data types).
For the data that does land someplace, more and more of it is not structured in a way that makes it easy to parse through, analyze, and query using traditional technologies. It is also becoming increasingly distributed due to the methods used to collect and manage it. This kind of data has been the bread and butter of search technology for a long time, so it’s not surprising that Google inspired one of the key technologies for running analytics against this data to find the valuable nuggets of insight within it – Apache Hadoop. I believe Hadoop holds all kinds of promise for how companies will manage this new class of big data. Analytics is just the tip of the iceberg.
So as you see, the patterns used against big data are the same as the ones we’ve always used with our nice, controlled database environments – just working against a much more chaotic set of data.