When Big Data is not Big Understanding
Good article from Tim Harford (he of the enjoyable “Undercover Economist” books) in the FT last week called “Big data: are we making a big mistake“. Tim injects some healthy realism into the hype of Big Data without dismissing its importance and potential benefits. The article talks about the four claims often made when talking about Big Data:
- Data analysis often produces uncannily accurate results
- Make statistical samplying obsolete by capturing all the data
- Statistical correlation is all you need – no need to understand causation
- Enough data means that scientific or statistical models aren’t needed
Now models can have their own problems, but I can see where he is coming from, for instance 3. and 4. above seem to be in direct contradiction. I particularly like the comment later in the article that “causality won’t be discarded, but it is being knocked off its pedestal as the primary fountain of meaning.”
Also I liked the definition by one of the academics mentioned of a big data set being one where “N = All”, and that you have “all” the data is an incorrect assumption behind some Big Data analysis put forward. Large data sets can mean that sample error is low, but sample bias is still a potentially big problem – for example everyone on Twitter is probably not representative of the population of the human race in general.
So I will now press save on this blog post, publish in Twitter and help re-enforce the impression that Big Data is a hot topic…which it is, but not for everyone I guess is the point.
On the NYTimes: http://www.nytimes.com/2014/04/07/opinion/eight-no-nine-problems-with-big-data.html