this is a first in a series of posts on Oracle’s Exadata Hybrid Columnar Compression (HCC), which is actually a great feature of Oracle database. It is currently locked to Oracle-only storage (Exadata, ZFS appliance etc) and Oracle marketing pushes it hard as it provides “10x” compression to Oracle customers.
Oracle have bold claims regarding HCC all over. For example in this whitepaper from November 2012, the first paragraph claims “average storage savings can range from 10x to 15x” and the second paragraph illustrates it with 100TB DB going down to 10TB, with 90TB of storage savings. After that, the paper switch to a real technical discussion on HCC.
So, what does HCC “10x” compression looks like in real life? How much storage savings will Oracle customers see if they move to Exadata and start using HCC?
It is very hard to find some unbiased analysis. So, to find out and start an hype-free discussion, I decided to get some real world data from Oracle customers. Here are my findings.
To start, I needed access to an undisputed data source. Luckily, one can be found on Oracle’s web site – an impressive 76-page long Exadata customer reference booklet from September 2012 containing a sample of 33 customer stories. Obviously, it is not very “representative” – reference customers tend to be more successful than the average ones – but I think there is still a lot value in analyzing it. Hey, maybe we’ll find that their storage saving is even larger than 10x-15x, who knows!
So, once I had data,
This time I’ll discuss “The Google File System“ (GFS) by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung from 2003. While GFS is propriety to Google, it was the direct inspiration for the Hadoop Distributed File System (HDFS), which is the fundamental layer of the popular Hadoop ecosystem.
In a nutshell, what GFS does is it takes a cluster of commodity servers with local disks and builds a fault-tolerant distributed file system on to top of it. The main innovation was picking up a new set of assumptions and optimizing the file system around a specific use case from Google:
- The cluster is designed for low-end, low-cost nodes with local disks. Specifically, failure of disks and servers is handled automatically and transparently as they are expected to be a common, normal happening,not some rarely tested corner case.
- The file system does not aim to be general-purpose. It is optimized for large, sequential I/O for both read and writes (high bandwidth, not low latency).
In addition, GFS aims to hold relatively few (millions) files, mostly large ones (multi-GBs).
The architecture of GFS will look very familiar if you know HDFS. In GFS, there is a single master server (similar to HDFS Name Node) and one chunkserver per server (similar to HDFS Data Node). The files are broken down to large, fixed-size chunks of 64MB (similar to HDFS blocks), which are stored as local linux files and are replicated for HA (three replicas by default). The master maintains all the metadata of the files and chunks in-memory. Clients get metadata from the master, but their read/write communications go directly to the chunkservers.The master logs metadata changes persistently to a local and remote operation log (similar to HDFS EditLog), but chunk location metadata is not persisted it is gathered from the chunkservers during master startup etc etc.
Cool features – surprisingly, GFS had in 2003 some features that are yet to appear in HDFS.
Following my previous two posts on concurrency, I’d like to explain why “too much” concurrency always hurts database performance (in any database), and discuss a couple of common database features that were designed to manage it, including an example from Oracle and Greenplum.
What do I mean by “too much” concurrency? Let’s play with an example. Let’s assume a small, simple database system where each SQL is processed by a single-threaded, CPU-bound process (maybe it is an in-memory database or all the relevant data is cached). Let’s further assume that each SQL takes 10 seconds to execute, and that the system can only efficiently execute four parallel SQLs. So, if we fire up four SQLs at a time every 10 seconds, we will get a throughput of 24 queries/minutes and average response time of 10 seconds. So far, life is good.
But what happens if we fire up 24 queries simultaneously once a minute? Let’s assume no interference between the SQLs and a fair scheduler that cycles between the processes many times per second. In that case, we will still get 24 queries per minute, but all queries will finish about 59-60 seconds, so the average response time will be almost 60 seconds – or six times slower with the same throughput. So, scheduling too many SQLs at once just drove response time through the roof without improving throughput!
Another way that “too much” concurrency hurts performance is
I decided to try writing once in a while a post on some of the classical papers and topics that had major effect on our big data technologies, and there is no better place to start that than the CAP Theorem.
The CAP Theorem by Eric Brewer was a philosophical fuel behind the so-called NoSQL movement, the battle cry that for a while united them all (at least in 2010). CAP stands for Consistency, Availability, (network) Partition tolerance and the theorem claims that in a distributed system, when there is an inevitable network partition (and the cluster breaks into two or more “islands”), you can’t guarantee both availability (for updates) and consistency. However, it was sometimes dumbed down to to a “Consistency, Availablity, Partition Tolerance – pick any two” slogan to explain why an eventual consistency model for a NoSQL database is legit. The discussion usually classified relational databases as “CA” and typically NoSQL databases as “AP”. Here is one example, and another representative one as an image:
Following my previous post on the various meanings of a customer concurrency requirement, I will try to explain why database (SQL) concurrency is usually the wrong target to set.
My main point is that database SQL concurrency is the result of both the SQL workload’s throughput (like “queries per hour”) and the database-specific latency (SQL response time). For example, I’ll demonstrate how, for a fixed workload, making a query go faster (tuning it) automatically reduces the database concurrency. This is a generic point, it is not specific to a database technology, and applies beyond the database domain.
Here is a simple example. Let us assume that a database is required to support 1800 similar queries per hour, arriving randomly. That means on average one new query every two seconds. Let us also assume that for now, each query runs on average 60 seconds, regardless of the database load (just for simplicity sake). So, given those specific query throughput and latency, the database will have about 30 concurrent SQLs running on average.
Continuing the example, let’s assume we now somehow tune the database to make these type of SQL faster and now each query execution takes only 10 seconds. If the workload is still 1800 SQLs per hour, suddenly we will only have about five concurrent SQLs! If we further tune the SQLs to execute in half a second, we will see less than one concurrent SQL – as the rate of which SQLs are submitted is much lower than each SQL run time.
What this thought exercise nicely demonstrate is that SQL concurrency is a derived metric
“”You must demonstrate a support for 1000 concurrent SQLs in the database”
During my years at Oracle and Greenplum, I’ve heard similar statements (anywhere between 100 and 5000) numerous times during a data warehouse POC planning. In every case, what followed was more or less the same. I worked toward understanding what do they mean by that – what are the real requirements – and then tried to adjust the POC metrics to reflect the real customer goals.
Looking back, it seems to be two recurring points of confusion. The first one is regarding which type of concurrency are we talking about,. The second one is regarding how the expected workload translates eventually into database concurrency. In this post, I’ll elaborate on the first point and a follow up will discuss the other point.
The crucial thing to understand is, that at most customers most of the times, when different people talk about concurrency, each do likely mean a different thing. So, what could they mean? here are some options: