Not so big data

Thursday, June 16, 2016

Performance implications of using an RDBMS as a key-value store

The evolution of key-value "NoSQL" data stores over the past decade has been a natural evolution of the tendency of web-based applications to avoid the problems with the object-relational impedence mismatch, by storing complex objects inside blob fields in traditional relational databases. These values sometimes contained even nested structures, much to the horror of old-school DBAs used to strict adherence to third normal form [1]. Once this pattern set in, it was only a matter of time before people began to realize that you don't really need the whole RDBMS if you're going to use such a limited subset of its capabilities.

During the growth days of the LAMP stack in the early 2000's, you had to use the tools available. A pattern I saw frequently in many applications was the use of MySQL or Postgresql as effectively a key-value store, with a single, or small number of monster tables storing serialized objects in a format such as JSON.

Surprisingly, this pattern persists even in 2016, despite the availability of mature, specialized and open source datastores [2] and even add-ons (maybe better remove-ons?) for MySQL and PostgreSQL [3] allowing a more streamlined interface.

One thing it is easy to forget, is that stable and complex RDBMS systems like MySQL are multi-tiered applications in and of themselves, and these tiers can be decoupled. Systems like MoSQL, and Vertica essentially took the front-end of MySQL and Postgres and connected them to entirely different backend storage systems. The memcached interface to InnoDB does the inverse - allowing developers that have no need for a full-blown SQL front end to make use of the stable and high-performance InnoDB storage engine that is the most commonly used backend for MySQL.

My previous research experience in this area was in the construction of a MoSQL, which was a storage engine for MySQL that we connected to a transactional, in-memory k-v store, along the lines of the deprecated BerkeleyDB storage engine for MySQL. One thing we became aware of very quickly in this process, is the considerable performance penalty for running simple key-value workloads through MySQL compared to hitting the backend directly through it's key-value interface - though our storage engine code was probably not sufficiently optimized.

Comparing vanilla Postgres, mysql, memcached and memcached+innodb

After reading the article discussing Wix's use of MySQL, I became curious about just how expensive this whole SQL layer really is, and how much are we "paying" for it relative to a streamlined interface.

There are a number of sites exploring the performance of memcached-innodb, but I wanted to dig a bit deeper and understand the performance relative to a baseline, which would be vanilla memcached itself, and compare to SQL-based PgSql for good measure. In a follow-up post, I am hoping to include some comparisons to HStore and JSONB for Postgres.

In other words, I will intentionally compare apples to oranges and pineapples. You’ve been warned. The reason for this is if you are going to "misuse" an apple as an orange, why not try to understand what effect this has?

The Hypothesis

My hypothesis going in is that, for key-value like workloads, MySQL and PostgreSQL would perform comparably, and given its simplicity, memcached would be significantly faster - it’s not an RDBMS after all. What was less clear to me was where in between memcached+innodb would be.

The tool and tests

As I’ve been looking for excuses to get better with both Go and Docker, I decided to write a simple sysbench-inspired benchmarking tool that supports the above systems, that you can find on github. Why Go? From my perspective, it’s like a happy marriage of C and Python, keeping much of the benefit of both and discarding most of what is tedious, dangerous or downright annoying.

Docker-machine is an amazing tool, and, an under appreciated gem for the benchmarker. With a few commands I can deploy a colocated benchmarking run with my framework, the system I want to test, all on varying types of hardware and on several different platforms. I decided on using a KVM-backed VM on my own laptop, and a t2.medium instance on Amazon as the two testing grounds.

The workloads I tested are random PK-based INSERT, UPDATE, SELECT and DELETE operations against a relatively small table (160k rows), and the equivalent operations against memcached (against memcached itself, and the memcached interface for InnoDB). I tried to get at least three runs of each test, and ran tests with 1-16 clients [5]. I kept the table small because I wanted it to be big enough such that the btrees were of a non-trivial height in the RDBMSs, but still comfortably within memory, since ultimately what I'm most interested in understanding is the cost of the SQL layer.

Benchmark Results

Select

Probably the most interesting result in terms of illustrating the cost of operating a SQL front end to your data is random select performance. The test is about as easy as you can get from the perspective of an RDBMS - the data is entirely cached in memory, and there are no expensive operations such as BTree modifications or disk I/O.

As would be expected, performance for MySQL and PgSQL is comparable.

Where we see the cost of the SQL front end is in the difference between MySQL and InnoDB-memcache - about a 2x difference for the same test. Also note the difference between innodb-memcache and vanilla memcache - when we "unshackle" InnoDB of the need to do all the extra things a RDBMS has to do in comparison to a simple in-memory k-v system, it runs at nearly the same speed as vanilla memcached. This is quite impressive.

I should note that it is possible that this difference could be even more pronounced if not for the cost of running the benchmark itself .

Insert

The insert performance results highlights one of the dangers of using a BTree-based RDBMS table for a key-value workload, and of evaluating the taste of an orange after you have bitten into a heirloom apple[4].

Our baseline, memcached alone, has no need to maintain a BTree or worry about writing anything to disk, and so writes are not too different from selects above.

All operations in the benchmark are done as single transactions - in other words, a single begin/commit and sync to disk. This is an example of apples to oranges comparison, but, is exactly why it can be unfair to think of "legacy" RDBMS systems such as MySQL or PgSQL as "slow". Given enough parallel clients, and some tuning, we could make the RDBMS results a bit closer to the memcache result. But if you naively use a RDBMS as a k-v store, this is the sort of performance you can expect.

Delete

Delete performance has the interesting result of being the fastest of the operations for vanilla memcache. My guess is because not even any memory copying is necessary - a flag is updated for the key that is garbage collected at some point, though I am not familiar with the details of the memcache implementation.

Performance for the other three is comparable to the results for insert, given the need to maintain and periodically resize the BTree.

Update

Updates also follow a similar pattern.

AWS Performance

The results for the tests against a t2.medium instance on EC2 were largely consistent with the results I got against my laptop - however the effect of lower I/O performance on a standard EC2 instance, along with 2 CPU cores instead of the one available on my KVM instance, make the differences observed above even more pronounced. Presented as one big facet_grid:

Concluding Thoughts

As always, whenever possible, use the right tool for the job. If you have good reasons for using one tool for a number of purposes, as RDBMS systems historically were made to do, expect to need to make some effort to make it perform as well as the tool specifically-built for that purpose.

Providing a SQL-based interface to your data can reduce performance by 2x or more. Moreover, maintaining a B+Tree is expensive. As we can see with the innodb-memcache select results, when unshacked and compared to NoSQL systems in a more apples-to-apples fashion, legacy RDBMS systems show that they run quite fast.

[1] At least those old-school DBAs that were allowed anywhere these systems by the cool kids of the day

[2] Too numerous to name, Redis, Cassandra, Mongo, etc.

[3] innodb-memcache, hstore and JSONB for postgres

[4] At some point I'd like to expand these tests to include the Hash index type for MySQL, but wanted to use a more "out of the box" set up for mysql

[5] This isn't many connections, and many of these systems scale right up to 256+ connections, but with a 1 or 2 core system, the extra performance is not so substantial and can be seen with the declining improvement already at 16 for most of the tests

Tuesday, May 31, 2016

The era of Big Data has quietly come to an end

With relatively little fanfare or notice, lost in the furore over the FBI vs. Apple earlier this year, was a significant announcement regarding the end-to-end encryption of all messages on the WhatsApp network as of April 7, 2016. WhatsApp has been embroiled in a battle with the Brazilian government, which moved to block access to the network in late 2015 for failing to provide wiretaps on certain accounts. The service was blocked again earlier this month. The previously insatiable demand for data, and the assumption that astronomical growth rates would continue in perpetuity, has now come to a screeching halt. And with that, the era of Big Data as we have come to know it, is coming to an end.

Welcome to the era of Secure Data

While the decision to encrypt messages on the WhatsApp network was presented to the public as a privacy protecting measure, and it is undoubtedly something privacy advocates should welcome, this was clearly not the primary motivation. End-to-end encryption of the content of messages is effectively to destroy them, when looked from the perspective of “store everything” that had been common in previous years. Start-ups are now actively seeking ways to rid themselves of data that may become a liability at some point in the future. Some have even gone as far as to call data a toxic asset.

The tech industry is laying its Big Data goose to rest rather than share its golden eggs

The $19 billion pricetag Facebook paid for WhatsApp was met with concern by privacy advocates at the time. Despite an impassioned plea by the founders after the acquisition, it was not difficult to imagine a softening of this stance at some point in the future. How else to justify such a pricetag? Whether the decision to implement end-to-end encryption was a reflection of core values, or a pragmatic decision to kill the goose rather than share its golden eggs, is something we can only speculate about.

Indeed, many tech startups over the past decade have had, either explicitly or implicitly, an assumption that a part of their future value would be in the vast trove of data they would collect about their users. The assumption was that they would be able to keep this data for themselves. Unfortunately, both cyber-criminals and governments have proven to be far more motivated to share in the spoils than was previously anticipated.

Of course we will still continue to store more data than ever, but the growth rates will slow as organizations being to assess the risks associated with storing ever more granular data about their users, and the risks associated with government requests and data breaches by cyber criminals.

Adding a V to the Big Four

The four V’s of Big Data - volume, velocity, variety and veracity - should really include value as a fifth "V". We should ask, how valuable (and risky), is each individual piece of information that we collect. At Openstack Tokyo last year, I presented this graph that illustrates my view of the relationship between value and volume, the tendencies of Big Data in a typical organization, and how we would eventually hit a point of absurdity in what we choose to store. I could not have imagined that external factors would force this shift so soon, however:

The exponentially diminishing value of data we can store and update on a daily basis

The four points on the graph above are to illustrate four different types of data that could be collected by an organization, and the relative value or importance of each. For a B2B type of business, the rows in your primary customer table may be as few as 10,000, and stored in a robust RDBMS. Ideally the data is encrypted at rest and is replicated and backed up in a number of locations for disaster recovery purposes.

An entry in your visitor summary table is an aggregate of some other raw data that may be coming from your big data systems. Having been processed, it is certainly more valuable to a manager viewing a dashboard than any individual log entry.

Less is more

The electron spin point at the right of the graph is an example that I hope is obviously ridiculous. But more interestingly, it is no longer a given that the raw log entry in the graph above is something that should be stored for any period of time. ETL and stream processing systems exist that will allow such data to be anonymized, aggregated and cleansed to make it more useful, less of a liability and less costly to store. If the data must reside on a disk somewhere, it's better stored in aggregate form unless there is clear value to be derived from doing so. Given the current regulatory and security environment we find ourselves in, less data is more.

Friday, October 3, 2014

It's time to start thinking about databases as multi-tier applications

Relational databases have historically been thought of as single entities, black boxes to developers where SQL goes in and results come out. The dark arts of wizardly, bearded DBAs capable of maintaining this illusion even in large, clustered environments commanded large salaries, armies of budget-busting consultants and the begrudging respect of mere mortal developers.

This remains largely the case today, but the emergence of the cloud in the enterprise is starting to change this picture. Databases will still remain an insoluble, difficult to hack entity of any application stack in the cloud until we are able to easily think about databases not as single, hostile units, but multi-tier applications in and of themselves. Then we can use cloud tools ready at our disposal like the higher tiers of typical cloud-based applications.

Multi-tiered DBs

These tiers already exist in databases, it's just that most people other than DBAs aren't aware of them. Indeed, understanding the various tiers of a typical relational database is fundamental to knowing how to tune such systems. Take MySQL as an example:

This is an oversimplification, but MySQL already has several tiers, and it does a good job of hiding this complexity from most developers. It could be argued that MySQL owes its success largely to its ability to perform well with a minimal of tinkering, a testament to a well-architected system. There are various levels of cache, including a MySQL query cache, storage-engine specific caches and buffers such as InnoDB's bufferpools, and then there is the interaction each storage engine does with the OS for storage and logging purposes. Typically this involves a write to disk, but there is no reason that this cannot involve writing data to memory (like the Memory “storage” engine) or out to the network (as MySQL Cluster does), thanks to MySQL's flexible storage engine interface.

Scaling out MySQL traditionally means creating a new instance of this multi-tiered database process on another server and replicating out data. This is architecturally unpleasant, introducing all sorts of pain in terms of maintenance headaches and potential inconsistencies between systems that leak up into the higher tiers of applications depending on just how imperfectly this replication is done. For a typical scale out of MySQL where reads are offloaded to asynchronously replicated slaves, why should an application server have to know and/or care if one pool of servers is read-only and another can take writes but should avoid reads? Systems like Galera take an important step in the right direction, but still remain monolithic, multi-tiered systems made to work together awkwardly.

What we really need is the ability to invert this picture: to be able to scale up and down various tiers of the database depending on our needs, and most importantly, maintain the abstraction of a single database to the application. For highly cyclical, read-mostly workloads, it should be possible to maintain only a core set of servers for logging and storage, and have servers that can be added and removed from the cache tier with ease to handle workload spikes.

The future of databases in the cloud

Ideally, all of this should be presented to an application as a single, strongly consistent entry point – if an application developer is given login credentials and an IP address, she should have no clue that this is any different from a single-instance database server. Indeed, it is remarkable that we are still so far from this ideal in the database world.

A system that does just that for MySQL is (shameless plug) the MoSQL storage engine, the system I helped develop as part of my research at the University of Lugano in Switzerland:

The benefit of a system like MoSQL is that the caching, storage and logging tiers have been decoupled, allowing an application architect to scale the tiers independently dependent on application needs, all while providing the illusion of a single, strongly consistent database no matter which node is connected to. Not only is this architecturally cleaner, leading to easier development, it allows easy integration into cloud orchestration tools such as Openstack Heat or Amazon's CloudFormation to save potentially significant costs due to overprovisioning, a topic I discussed at length in a previous post.

(A version of this article originally appeared on LinkedIn Pulse)

Tuesday, September 9, 2014

Database Elasticity: It's more important than you think

Databases are notoriously hard to do right: to design, to administer, to make highly available and above all, scalable.

But this discussion will not be about SQL vs. NoSQL, Big Data vs. Not-So-Big Data. This is about elastic vs. inelastic.

Workloads are not only continuing their perpetual climb to the sky, they are becoming less predictable. This means ever-larger worst-case peaks to prepare for. There is an old adage among IT managers: No manager ever got fired for buying IBM. Similarly, no DBA was ever fired for over-provisioning a database server. It's not hard to understand why: paying $10000 for an un-sexy, infrastructure-like service like a relational database when $5000 would have been enough to cover any peaks of demand is often a trivial $5000 cost in comparison to the potentially disastrous costs due to high latency, instability or unavailability that could be brought on by using an under-provisioned $2500 service [1]. Systems tend to be at their most unpredictable when approaching their peak capacity, and so over-provisioning is a deeply ingrained practice.

Consider the demand and capacity curves for a typical inelastic database service here:

The workload graph above is admittedly contrived and simplified for the purposes of this discussion, but fairly well represents the peaks and valleys that characterize most workloads, say, on a daily basis. The black line is the capacity of the server/VM you paid for. It could be a physical server in your own data center or a virtual machine deployed through a database-as-a-service like Amazon RDS or Openstack Trove. The entirety of the green area of the curve below the capacity line represents wasted computing power, and thus, wasted money given a constant price per time, which is the typical case in most public clouds. Put another way, the cost per operation is highest in the valleys of the workload and lowest at peak. Similarly, the blue area of the curve is the dangerous situation of having a service that is under-provisioned.

The holy grail of elastic services, database or otherwise, is to make the flat capacity curve fit the demand curve as closely as possible, thereby approaching a constant cost per operation and not time. The state of the art in relational database systems is short of this ideal: services like Amazon RDS with Provisioned IOPS indirectly take a step in this direction through the use of cost-per-IO in the underlying block storage. Research systems, like the MoSQL storage engine, bring elasticity a layer higher, enabling a flexible allotment of servers in a cluster in order to match workload.

An interesting twist is that public cloud providers will not have much interest in changing this status quo. As long as customers are able and willing to pay to over-provision services in the cloud, whether databases or anything else, public clouds will be able to over-commit their underlying hardware resources and sell more VMs with less hardware.

It is not clear how long this situation will last. As services such as databases become more elastic and capacity begins to more closely match workload, users will be attracted to the cost savings and place downward pressure on the over-commitment ratios for underlying hardware. Public clouds may be forced to either raise prices or move away from an instance-hour pricing model.

(A version of this article originally appeared on LinkedIn Pulse)

[1] The numbers are of course arbitrary; it could be $10,000 for an actual server housed in your data center or a $10/hour VM on a public cloud