Since my blog is called Information Explosion, I figured I should start out with a discussion on big data. I started to subtitle my blog “The Story of Big Data,” but then I remembered Stephen O’Grady’s words of wisdom on the topic and decided to change that… That said, I do think that “big data” is as good a descriptor for the problem as anything, so I plan to use it without shame.
I was a bit surprised to find a Wikipedia definition of “big data”:
“The term Big data from software engineering and computer science describes datasets that grow so large that they become awkward to work with using on-hand database management tools.”
Not surprisingly, the definition pulls directly from Tom White’s book on Hadoop. It also correctly raises the same point that Stephen makes, that what constitutes “big” will vary from organization to organization. In fact, I think that “big” is really characterized not just by sheer volume, but also by how complex the types of data are (lots of unstructured data), how distributed it is, and whether or not it is even something you would store.
That last point is important… the more we hook up sensors and other “things” to the internet, the more data we’re creating (obviously), but a lot of that data is just noise. We don’t need to use all of it, much less store and manage it. So we use streaming analytics to extract the valuable insights out of all that noise and store those so that we can analyze and use them in relation to our other data, using our conventional systems – no different in concept than data mining that has been done for years, just on a different time and volume scale (and working against much more complex and less structured data types).
For the data that does land someplace, more and more of it is not structured in a way that makes it easy to parse through, analyze, and query using traditional technologies. It is also becoming increasingly distributed due to the methods used to collect and manage it. This kind of data has been the bread and butter of search technology for a long time, so it’s not surprising that Google inspired one of the key technologies for running analytics against this data to find the valuable nuggets of insight within it – Apache Hadoop. I believe Hadoop holds all kinds of promise for how companies will manage this new class of big data. Analytics is just the tip of the iceberg.
So as you see, the patterns used against big data are the same as the ones we’ve always used with our nice, controlled database environments – just working against a much more chaotic set of data.