Industry Standard SQL-on-Hadoop benchmarking?

Earlier today a witty comment I made on twitter led to a long and heated discussion about vendor exaggerations regarding SQL-on-Hadoop relative performance. It started as a link to this post by Hyeong-jun Kim, CTO and Chief Architect at Gruter. That post discusses some ways that vendor exaggerate and suggests to verify these claims with your own data and queries.

Anyway, Milind Bhandarkar from Pivotal suggested that joining an industry standard benchmarking effort might be the right think. I disagree, wanted to elaborate on that:

Industry Standard SQL-on-Hadoop benchmark won’t improve a thing

  • These benchmarks (at least in the SQL space) don’t help users to pick a technology. I’ve never heard of a customer who picked some solution because it lead the performance or price/performance list of TPC-C or TPC-H.
    Customers will always benchmark on their data and their workload…
  • …And they do so because in this space, many of the small variations in data (ex: data distribution within a column) or workload (SQL features, concurrency, transaction type mix) will have dramatic impact on the results.
  • So, the vendors who write the benchmark will fight to death to have the benchmark highlight their (exisitng) strengths. If the draft benchmark will show them at the bottom, they’ll retire and bad mouth the benchmark for not incorporating their suggestions.
  • Of course, the close-source players still won’t allow publishing benchmark results – the “DeWitt Clause” (right Milind? What about HAWK?)
  • And even with a standard, all vendors will still use micro-benchmarks to highlight the full extent of new features and optimizations, so the rebuttals and flame wars will not magically end.

What I think is the right thing for users

Since vendors (and users) will not agree on a single benchmark, the next best thing is that each player will develop their own benchmark(s), but will make it easily reproducible.

For example, share the dataset, the SQLs, the specific pre-processing if any, the specific non-default config options if any, the exact SW versions involved, the exact HW config and hopefully – provide the script that runs the whole benchmark.

This would allow the community and ecosystem – users, prospects and vendors – to:

  • Quickly reproduce the results (maybe on a somewhat different config).
  • Play with different variations to learn how stable are the conclusions (different file formats, data set size, SQL and parameter variations any many more).
  • Share their result in the open, using the same disclosure, for everyone to learn from it or respond to it.

Short version – community over committee.

Why that also won’t happen

Sharing a detailed, easily reproducible report is great for smart users who want to educate themselves and choose the right product.

However, there is nearly zero incentive for vendors and projects to share it, especially for the leaders. Why? Because they are terrified that a competitor will use it to show that he is way faster… That could be the ultimate marketing fail (a small niche player afford it, since they have little to lose).

Some other reasonable excuses – there could be a dependency on internal testing frameworks, scripts or non-public datasets, not enough resources to clean up and document each micro-benchmark or to follow up with everyone etc.

Also, such disclosure may prevent marketing / management from highlighting some results out of context (or add wishful thinking)… Not sure many are willing to report to their board – “sales are down since we told everyone that our product is not yet good enough”.
For example – I haven’t yet seen a player claiming “Great Performance! We are now 0.7x faster than the current performance leader!”. Somehow you’ll need to claim that you are great, even if for some a specific scenario under specific (undisclosed?) constraints.

Summary

  • You can’t really optimize picking the right tool by building a universal performance benchmark (and picking is not only about performance).
  • Human interests (commercial and others) will generate friction as all compete for some mindshare. Social pressure might sometimes help to make competitors support civilized technical discussion, but only rarely.
  • when sharing performance numbers, please be nice – share what you can and don’t mislead.
  • As a user, try to learn and verify before making big commitments, and be prepare to screw up occasionally…

Big Data PaaS part 1 – Amazon Redshift

Last month I attended AWS Summit in Tel-Aviv. It was a great event, one of the largest local tech events I’ve seen. Anyway, a session on Redshift got me thinking.

When Amazon Redshift was announced (Nov 2012), I was working at Greenplum. I remember that while Redshift pricing was impressive, I mostly dismissed it at that time due to some perceived gaps in functionality and being in a “limited beta” status. Back to last month’s session, I came to see what’s new and was taken by surprise by the tone of the presentation.

As a database expert, I thought I would hear about the implementation details – the sharding mechanism, the join strategies, the various types of columnar encoding and compression, the pipelined data flow, the failover design etc… While a few of these were briefly mentioned, they were definitely not the focus. Instead, the main focus was the value of getting your data warehouse in a PaaS platform.

So what is the value? It is a combination of several things. Lower cost is of course a big one – it allows smaller organization to have a decent data warehouse. But it is much more than cost – being a PaaS means you can immediately start working without worrying about everything from a complex, time-consuming POC to all the infrastructure work for successful production-grade deployment. Just provision a production cluster for a day or two, try it out on your data and queries, and if you like the performance (and cost) – simply continue to use it. It allows a very safe play – or as they put it “low cost of failure“. In addition, a PaaS environment can handle all the ugly infrastructure tasks around MPP databases or appliances that typically block adoption – backups, DR, connectivity etc.
Let me elaborate on these points:

I guess anyone who did a DW POC can relate to the following painful process. You have to book the POC many weeks in advance, pick from several uneasy options (remote POC, fly to vendor central site or coordinate a delivery to your data center), pick in advance a small subset of data and queries to test, don’t use your native BI/ETL tools, rely on the vendor experts to run and tune the POC, finish with a lot of open questions as you ran out of time, pick something, pay a lot of money, wait a couple of months for delivery, start migration / testing / re-tuning (or even DW design and implementation) and several months later you will likely realize how good (or bad) your choice really was… This is always a very high risk maneuver, and the ability to provision a production environment, try it out and if you like it, just keep it without a huge upfront commitment is a very refreshing concept.

Same goes for many “annoying” infrastructure bits. For example – backup, recovery and DR. Typically these will not be tested in a real-world configuration as part of a POC due to various constraints, and the real tradeoff between feasible options in cost, RPO and RTO will not seriously evaluated or in many cases not even considered. Having a working backup included with Redshift – meaning, the backup functionality with underlying infrastructure and storage – is again huge. Another similar one is patching – it is nice that it’s Amazon’s problem.

Last of these is scaling. With an appliance, scaling out is painful. You order an (expensive) upgrade, wait a couple of month for it to be install, then try to roll it out (which likely involves many hours of storage re-balancing), then pray it works. Obviously you can’t scale down to reduce costs. With Redshift, they provision a second cluster (while your production is switched to read-only mode), and switch you over to the new one “likely within an hour”. Of course, with Redshift you can scale up or down on demand, or even turn off the cluster when not needed (but there are some pricing gotchas as there is a strong pricing bias to a 3-year reserved instances).

If you’re interested, you can find Guy Ernest’s presentation here .During his presentation, most slides were hidden as he and Intel had only 45 minutes – I see now that the full slide deck actually does have some slides full of implementation details 🙂

BTW –  another thing that I was curious about was – can the Amazon team significantly evolve the Redshift code base given its external origin(ParAccel)? I assumed that it is not a trivial task. Well, I just read last week that AWS are releasing a big functionality upgrade or Redshift (plus some more bits), so I think that one is also off the table.

It would be interesting to see if Redshift would now gain more traction, especially as it got more integrated into the AWS offering and workflow.