Not so big data: 2016

The evolution of key-value "NoSQL" data stores over the past decade has been a natural evolution of the tendency of web-based applications to avoid the problems with the object-relational impedence mismatch, by storing complex objects inside blob fields in traditional relational databases. These values sometimes contained even nested structures, much to the horror of old-school DBAs used to strict adherence to third normal form [1]. Once this pattern set in, it was only a matter of time before people began to realize that you don't really need the whole RDBMS if you're going to use such a limited subset of its capabilities.

During the growth days of the LAMP stack in the early 2000's, you had to use the tools available. A pattern I saw frequently in many applications was the use of MySQL or Postgresql as effectively a key-value store, with a single, or small number of monster tables storing serialized objects in a format such as JSON.

Surprisingly, this pattern persists even in 2016, despite the availability of mature, specialized and open source datastores [2] and even add-ons (maybe better remove-ons?) for MySQL and PostgreSQL [3] allowing a more streamlined interface.

One thing it is easy to forget, is that stable and complex RDBMS systems like MySQL are multi-tiered applications in and of themselves, and these tiers can be decoupled. Systems like MoSQL, and Vertica essentially took the front-end of MySQL and Postgres and connected them to entirely different backend storage systems. The memcached interface to InnoDB does the inverse - allowing developers that have no need for a full-blown SQL front end to make use of the stable and high-performance InnoDB storage engine that is the most commonly used backend for MySQL.

My previous research experience in this area was in the construction of a MoSQL, which was a storage engine for MySQL that we connected to a transactional, in-memory k-v store, along the lines of the deprecated BerkeleyDB storage engine for MySQL. One thing we became aware of very quickly in this process, is the considerable performance penalty for running simple key-value workloads through MySQL compared to hitting the backend directly through it's key-value interface - though our storage engine code was probably not sufficiently optimized.

Comparing vanilla Postgres, mysql, memcached and memcached+innodb

After reading the article discussing Wix's use of MySQL, I became curious about just how expensive this whole SQL layer really is, and how much are we "paying" for it relative to a streamlined interface.

There are a number of sites exploring the performance of memcached-innodb, but I wanted to dig a bit deeper and understand the performance relative to a baseline, which would be vanilla memcached itself, and compare to SQL-based PgSql for good measure. In a follow-up post, I am hoping to include some comparisons to HStore and JSONB for Postgres.

In other words, I will intentionally compare apples to oranges and pineapples. You’ve been warned. The reason for this is if you are going to "misuse" an apple as an orange, why not try to understand what effect this has?

The Hypothesis

My hypothesis going in is that, for key-value like workloads, MySQL and PostgreSQL would perform comparably, and given its simplicity, memcached would be significantly faster - it’s not an RDBMS after all. What was less clear to me was where in between memcached+innodb would be.

The tool and tests

As I’ve been looking for excuses to get better with both Go and Docker, I decided to write a simple sysbench-inspired benchmarking tool that supports the above systems, that you can find on github. Why Go? From my perspective, it’s like a happy marriage of C and Python, keeping much of the benefit of both and discarding most of what is tedious, dangerous or downright annoying.

Docker-machine is an amazing tool, and, an under appreciated gem for the benchmarker. With a few commands I can deploy a colocated benchmarking run with my framework, the system I want to test, all on varying types of hardware and on several different platforms. I decided on using a KVM-backed VM on my own laptop, and a t2.medium instance on Amazon as the two testing grounds.

The workloads I tested are random PK-based INSERT, UPDATE, SELECT and DELETE operations against a relatively small table (160k rows), and the equivalent operations against memcached (against memcached itself, and the memcached interface for InnoDB). I tried to get at least three runs of each test, and ran tests with 1-16 clients [5]. I kept the table small because I wanted it to be big enough such that the btrees were of a non-trivial height in the RDBMSs, but still comfortably within memory, since ultimately what I'm most interested in understanding is the cost of the SQL layer.

Benchmark Results

Select

Probably the most interesting result in terms of illustrating the cost of operating a SQL front end to your data is random select performance. The test is about as easy as you can get from the perspective of an RDBMS - the data is entirely cached in memory, and there are no expensive operations such as BTree modifications or disk I/O.

As would be expected, performance for MySQL and PgSQL is comparable.

Where we see the cost of the SQL front end is in the difference between MySQL and InnoDB-memcache - about a 2x difference for the same test. Also note the difference between innodb-memcache and vanilla memcache - when we "unshackle" InnoDB of the need to do all the extra things a RDBMS has to do in comparison to a simple in-memory k-v system, it runs at nearly the same speed as vanilla memcached. This is quite impressive.

I should note that it is possible that this difference could be even more pronounced if not for the cost of running the benchmark itself .

Insert

The insert performance results highlights one of the dangers of using a BTree-based RDBMS table for a key-value workload, and of evaluating the taste of an orange after you have bitten into a heirloom apple[4].

Our baseline, memcached alone, has no need to maintain a BTree or worry about writing anything to disk, and so writes are not too different from selects above.

All operations in the benchmark are done as single transactions - in other words, a single begin/commit and sync to disk. This is an example of apples to oranges comparison, but, is exactly why it can be unfair to think of "legacy" RDBMS systems such as MySQL or PgSQL as "slow". Given enough parallel clients, and some tuning, we could make the RDBMS results a bit closer to the memcache result. But if you naively use a RDBMS as a k-v store, this is the sort of performance you can expect.

Delete

Delete performance has the interesting result of being the fastest of the operations for vanilla memcache. My guess is because not even any memory copying is necessary - a flag is updated for the key that is garbage collected at some point, though I am not familiar with the details of the memcache implementation.

Performance for the other three is comparable to the results for insert, given the need to maintain and periodically resize the BTree.

Update

Updates also follow a similar pattern.

AWS Performance

The results for the tests against a t2.medium instance on EC2 were largely consistent with the results I got against my laptop - however the effect of lower I/O performance on a standard EC2 instance, along with 2 CPU cores instead of the one available on my KVM instance, make the differences observed above even more pronounced. Presented as one big facet_grid:

Concluding Thoughts

As always, whenever possible, use the right tool for the job. If you have good reasons for using one tool for a number of purposes, as RDBMS systems historically were made to do, expect to need to make some effort to make it perform as well as the tool specifically-built for that purpose.

Providing a SQL-based interface to your data can reduce performance by 2x or more. Moreover, maintaining a B+Tree is expensive. As we can see with the innodb-memcache select results, when unshacked and compared to NoSQL systems in a more apples-to-apples fashion, legacy RDBMS systems show that they run quite fast.

[1] At least those old-school DBAs that were allowed anywhere these systems by the cool kids of the day

[2] Too numerous to name, Redis, Cassandra, Mongo, etc.

[3] innodb-memcache, hstore and JSONB for postgres

[4] At some point I'd like to expand these tests to include the Hash index type for MySQL, but wanted to use a more "out of the box" set up for mysql

[5] This isn't many connections, and many of these systems scale right up to 256+ connections, but with a 1 or 2 core system, the extra performance is not so substantial and can be seen with the declining improvement already at 16 for most of the tests

With relatively little fanfare or notice, lost in the furore over the FBI vs. Apple earlier this year, was a significant announcement regarding the end-to-end encryption of all messages on the WhatsApp network as of April 7, 2016. WhatsApp has been embroiled in a battle with the Brazilian government, which moved to block access to the network in late 2015 for failing to provide wiretaps on certain accounts. The service was blocked again earlier this month. The previously insatiable demand for data, and the assumption that astronomical growth rates would continue in perpetuity, has now come to a screeching halt. And with that, the era of Big Data as we have come to know it, is coming to an end.

Welcome to the era of Secure Data

While the decision to encrypt messages on the WhatsApp network was presented to the public as a privacy protecting measure, and it is undoubtedly something privacy advocates should welcome, this was clearly not the primary motivation. End-to-end encryption of the content of messages is effectively to destroy them, when looked from the perspective of “store everything” that had been common in previous years. Start-ups are now actively seeking ways to rid themselves of data that may become a liability at some point in the future. Some have even gone as far as to call data a toxic asset.

The tech industry is laying its Big Data goose to rest rather than share its golden eggs

The $19 billion pricetag Facebook paid for WhatsApp was met with concern by privacy advocates at the time. Despite an impassioned plea by the founders after the acquisition, it was not difficult to imagine a softening of this stance at some point in the future. How else to justify such a pricetag? Whether the decision to implement end-to-end encryption was a reflection of core values, or a pragmatic decision to kill the goose rather than share its golden eggs, is something we can only speculate about.

Indeed, many tech startups over the past decade have had, either explicitly or implicitly, an assumption that a part of their future value would be in the vast trove of data they would collect about their users. The assumption was that they would be able to keep this data for themselves. Unfortunately, both cyber-criminals and governments have proven to be far more motivated to share in the spoils than was previously anticipated.

Of course we will still continue to store more data than ever, but the growth rates will slow as organizations being to assess the risks associated with storing ever more granular data about their users, and the risks associated with government requests and data breaches by cyber criminals.

Adding a V to the Big Four

The four V’s of Big Data - volume, velocity, variety and veracity - should really include value as a fifth "V". We should ask, how valuable (and risky), is each individual piece of information that we collect. At Openstack Tokyo last year, I presented this graph that illustrates my view of the relationship between value and volume, the tendencies of Big Data in a typical organization, and how we would eventually hit a point of absurdity in what we choose to store. I could not have imagined that external factors would force this shift so soon, however:

The exponentially diminishing value of data we can store and update on a daily basis

The four points on the graph above are to illustrate four different types of data that could be collected by an organization, and the relative value or importance of each. For a B2B type of business, the rows in your primary customer table may be as few as 10,000, and stored in a robust RDBMS. Ideally the data is encrypted at rest and is replicated and backed up in a number of locations for disaster recovery purposes.

An entry in your visitor summary table is an aggregate of some other raw data that may be coming from your big data systems. Having been processed, it is certainly more valuable to a manager viewing a dashboard than any individual log entry.

Less is more

The electron spin point at the right of the graph is an example that I hope is obviously ridiculous. But more interestingly, it is no longer a given that the raw log entry in the graph above is something that should be stored for any period of time. ETL and stream processing systems exist that will allow such data to be anonymized, aggregated and cleansed to make it more useful, less of a liability and less costly to store. If the data must reside on a disk somewhere, it's better stored in aggregate form unless there is clear value to be derived from doing so. Given the current regulatory and security environment we find ourselves in, less data is more.

Not so big data

Thursday, June 16, 2016

Performance implications of using an RDBMS as a key-value store