I read a thought-provoking and contrarian perspective on Big Data a few days ago by Maciej Cegłowski, Haunted by Data. Maciej argues that data is like radioactive waste, in that it's extremely persistent and dangerous if leaked. He draws parallels between the hype and promises of big data today and the hype and promises of radioactivity a hundred years ago when people sold products like radium cigarettes and radioactive underwear.
Personally, I think this analogy is extreme. A more apt metaphor is that Big Data is the industrial byproduct of the 21st century. Like the sludge that spewed from factories in the 20th century, vast quantities of data are produced by almost every commercial activity today. Some of this data is valuable, but the overwhelming majority is worthless and potentially dangerous. And we are only beginning to appreciate the risks of these storehouses of data.
Unlike the physical kind of toxic goo, data is cheap to store and easy to destroy (as long as it remains contained). So there's a strong temptation to hold on to all data just in case some value is discovered in the future, but in many cases the responsible thing to do is get rid of it.
The problem with having all this data lying around is that, while any single piece of information may be fairly innocuous, we're finding out more and more often that it's possible to piece together lots bits of data to learn remarkably personal things. Anyone who knows your recent purchases can figure out not just your hobbies and interests, but also knows your medical condition including whether you or your partner is pregnant and whether you suffer from a particular illness. Anyone who has your list of Facebook friends also knows your sexual orientation and marital status and can probably figure out how faithful you are.
And let's not even think about what someone can figure out from your search history, the websites you've visited over the years, or the GPS tracking of your phone.
Fortunately there is a middle ground that lets companies find the value in their customer data and dramatically mitigate the risk of uncontrolled leakage: statistical sampling. We use it all the time in customer feedback, since it's usually not practical to try to survey every single customer.
It only requires a surprisingly small random sample of data to find a result that's remarkably close to what you would get if you look at all the data. Sampling 10,000 customers out of a population of a hundred million--looking at only 0.01% of the data--will almost always get within 1% of the result of looking at all the data. That means you can throw out 99.99% of the data and not get a meaningfully different analysis.
Of course the details of the statistical sampling matter, and it needs to be designed to meet the requirements of the particular analysis. But the key point remains that companies which keep all data just in case it might be useful someday are holding far more than they actually need, and creating a lot of risk to themselves and their customers in the process.
So think before you hold on to data. If it doesn't have a well-defined reason to be kept, you are probably just creating industrial waste.