Tuesday, May 31, 2016

The era of Big Data has quietly come to an end

With relatively little fanfare or notice, lost in the furore over the FBI vs. Apple earlier this year, was a significant announcement regarding the end-to-end encryption of all messages on the WhatsApp network as of April 7, 2016. WhatsApp has been embroiled in a battle with the Brazilian government, which moved to block access to the network in late 2015 for failing to provide wiretaps on certain accounts. The service was blocked again earlier this month. The previously insatiable demand for data, and the assumption that astronomical growth rates would continue in perpetuity, has now come to a screeching halt. And with that, the era of Big Data as we have come to know it, is coming to an end.

Welcome to the era of Secure Data


While the decision to encrypt messages on the WhatsApp network was presented to the public as a privacy protecting measure, and it is undoubtedly something privacy advocates should welcome, this was clearly not the primary motivation. End-to-end encryption of the content of messages is effectively to destroy them, when looked from the perspective of “store everything” that had been common in previous years. Start-ups are now actively seeking ways to rid themselves of data that may become a liability at some point in the future. Some have even gone as far as to call data a toxic asset.
The tech industry is laying its Big Data goose to rest rather than share its golden eggs


The $19 billion pricetag Facebook paid for WhatsApp was met with concern by privacy advocates at the time. Despite an impassioned plea by the founders after the acquisition, it was not difficult to imagine a softening of this stance at some point in the future. How else to justify such a pricetag? Whether the decision to implement end-to-end encryption was a reflection of core values, or a pragmatic decision to kill the goose rather than share its golden eggs, is something we can only speculate about.

Indeed, many tech startups over the past decade have had, either explicitly or implicitly, an assumption that a part of their future value would be in the vast trove of data they would collect about their users. The assumption was that they would be able to keep this data for themselves. Unfortunately, both cyber-criminals and governments have proven to be far more motivated to share in the spoils than was previously anticipated.

Of course we will still continue to store more data than ever, but the growth rates will slow as organizations being to assess the risks associated with storing ever more granular data about their users, and the risks associated with government requests and data breaches by cyber criminals.

Adding a V to the Big Four


The four V’s of Big Data - volume, velocity, variety and veracity - should really include value as a fifth "V". We should ask, how valuable (and risky), is each individual piece of information that we collect. At Openstack Tokyo last year, I presented this graph that illustrates my view of the relationship between value and volume, the tendencies of Big Data in a typical organization, and how we would eventually hit a point of absurdity in what we choose to store. I could not have imagined that external factors would force this shift so soon, however:

The exponentially diminishing value of data we can store and update on a daily basis

The four points on the graph above are to illustrate four different types of data that could be collected by an organization, and the relative value or importance of each. For a B2B type of business, the rows in your primary customer table may be as few as 10,000, and stored in a robust RDBMS. Ideally the data is encrypted at rest and is replicated and backed up in a number of locations for disaster recovery purposes.

An entry in your visitor summary table is an aggregate of some other raw data that may be coming from your big data systems. Having been processed, it is certainly more valuable to a manager viewing a dashboard than any individual log entry.

Less is more


The electron spin point at the right of the graph is an example that I hope is obviously ridiculous. But more interestingly, it is no longer a given that the raw log entry in the graph above is something that should be stored for any period of time. ETL and stream processing systems exist that will allow such data to be anonymized, aggregated and cleansed to make it more useful, less of a liability and less costly to store. If the data must reside on a disk somewhere, it's better stored in aggregate form unless there is clear value to be derived from doing so. Given the current regulatory and security environment we find ourselves in, less data is more.