A recent article entitled, “Limited role for big data seen in developing predictive models”, splashes a little cold water on the idea that Big Data will magically help develop better predictive analytics tools. The headline caught my attention, as it’s become a truism that a poor algorithm with lots of data will outperform a great algorithm with not enough data. Let’s go ahead and ask, can Big Data help with prediction?
Now, I understand the author’s point. If you are performing a well-structured study and you have a deep understanding of the domain, then a smaller and carefully constructed data set will probably serve you better. Later in the article, however, Peter Amstutz, analytics strategist at advertising agency Carmichael Lynch, mentions that in many cases you’re not even sure what you’re looking for and often need to aggregate loosely structured data from disparate sources. After all, there’s a lot more unstructured data in the world, and it’s growing quickly.
I find myself favoring the dissenting view. In my job I’m often trying to answer questions like, “Will our next release ship on time given what I now know about the backlog, other projects taking away resources…,” and so on. It’s not as simple as looking at a burn down chart to track progress. In my head I’m meshing all types of data points – chatter on the engineering forums, vacation schedules, QA panic boards, et cetera. I sometimes get a ‘pit of my stomach’ feeling that the schedule is slipping, but when I try to actually quantify what I’m seeing, it’s difficult. There are so many sources of data to correlate, and none of them report consistently.
Of course, if we had a data warehouse I could run some cool reports on trends I’m seeing, but I wouldn’t try to convince the higher-ups to make that level of investment (ETL tools, data stores, visualization front end) and I’m sure they won’t give me JDBC access to all of our databases.
On the other hand, I’ve got a small Hadoop cluster available – just a set of VMs, but sufficient for the volume of data I need to examine – and I know how to pull data using tools like Flume and Sqoop. All of a sudden I’m seeing possibilities.
This is one of the real benefits of ‘Big Data’ for predictive analytics. It can handle the variety of data I need without ETL tools, at a fairly low cost.