The Economist is a great “newspaper”, my favorite. A couple weeks ago they did a special report on “The Data Deluge” which explored the recent and rapid expansion of data, and how to handle it. There were two parts of the report that caught my eye because they seemed contradictory. The first was an 1894 quote from Oscar Wilde:
It is a very sad thing that nowadays there is so little useless information.
To which The Economist added, “He did not know the half of it”.
The other part was this chart:
The question I have is, if there’s “so little useless information”, why doesn’t data storage more closely track data creation? Wouldn’t one want to store and analyze all of this data if it were truly useful? It’s a pretty obvious but important question because the answers could tell us a lot about what we as a society think is valuable. So what are some potential answers?
We Don’t Value the Data We Create Enough, So We Don’t Store It
Maybe we already store all the data we deem important enough to save. Everything else is expendable. It’s not free to store data, so one has to weigh costs and benefits about what is kept and what is not. It’s possible that this excess data doesn’t have any value, but I doubt it.
We Know the Data We Create is Somehow Valuable, But We Don’t Know How to Make it Valuable
This scenario argues for more statisticians or better tools to extract insights from large data sets. Most data is unstructured and it takes specific expertise to organize and analyze it. Generally speaking, it’s big companies that have the in-house skills needed to glean real value from these data. Smaller- and medium-sized businesses generate plenty of data, too, but they may lack the resources and personnel required to make their data useful and actionable. When a business decides what data gets stored and what sublimates, it will only spend money storing what is required to run the business and nothing more.
We Throw Out Old Data, So Available Storage Capacity Lags New Data Creation
Perhaps older data is perceived as less valuable (rightly or wrongly) and is discarded as it expires. This “hole” in the bottom of the proverbial cup would account for the flatter growth in available storage vs. information created.
We Can’t Make Data Storage Capabilities Fast Enough to Store the Data
This would be a great problem for companies such as EMC to have, but it’s just not the case. It’s becoming less expensive to store more data. Kenneth Cukier points out that companies such as Wal-Mart store more than 2.5 petabytes (the equivalent of 167 times the books in the Library of Congress) of data, fed by more than 1 million customer transactions per hour. Facebook stores over 40 billion photos. Guaranteed that Facebook’s “available storage” curve closely hugs its “information created” curve because obviously Facebook sees economic value in storing its users’ data. It’s a safe bet that Facebook’s “available storage” curve is actually above its “information created” curve since FB probably has at least two mirrors for each piece of data.
There’s no doubt that data is becoming a more valuable commodity, even to businesses that have traditionally been less data-intensive than the Facebooks of the world. The bottom line is that it’s relatively expensive to store data (vs. discard it), so we need to have a good reason to store it. Perhaps the solution is to create better tools that can make data more useful for people who lack interest or training in statistics and data mining. This may be another aspect of The Facebook Imperative that Marc Benioff recently wrote about. Companies such as Oracle, SAS, SAP, Salesforce, and Tibco already offer software tools to help make data more useful, so there’s got to be something else pulling down the growth in data storage. Maybe there’s just a lack of will to implement and use these tools? What do you think?