Industry Standard SQL-on-Hadoop benchmarking?

Earlier today a witty comment I made on twitter led to a long and heated discussion about vendor exaggerations regarding SQL-on-Hadoop relative performance. It started as a link to this post by Hyeong-jun Kim, CTO and Chief Architect at Gruter. That post discusses some ways that vendor exaggerate and suggests to verify these claims with your own data and queries.

Anyway, Milind Bhandarkar from Pivotal suggested that joining an industry standard benchmarking effort might be the right think. I disagree, wanted to elaborate on that:

Industry Standard SQL-on-Hadoop benchmark won’t improve a thing

  • These benchmarks (at least in the SQL space) don’t help users to pick a technology. I’ve never heard of a customer who picked some solution because it lead the performance or price/performance list of TPC-C or TPC-H.
    Customers will always benchmark on their data and their workload…
  • …And they do so because in this space, many of the small variations in data (ex: data distribution within a column) or workload (SQL features, concurrency, transaction type mix) will have dramatic impact on the results.
  • So, the vendors who write the benchmark will fight to death to have the benchmark highlight their (exisitng) strengths. If the draft benchmark will show them at the bottom, they’ll retire and bad mouth the benchmark for not incorporating their suggestions.
  • Of course, the close-source players still won’t allow publishing benchmark results – the “DeWitt Clause” (right Milind? What about HAWK?)
  • And even with a standard, all vendors will still use micro-benchmarks to highlight the full extent of new features and optimizations, so the rebuttals and flame wars will not magically end.

What I think is the right thing for users

Since vendors (and users) will not agree on a single benchmark, the next best thing is that each player will develop their own benchmark(s), but will make it easily reproducible.

For example, share the dataset, the SQLs, the specific pre-processing if any, the specific non-default config options if any, the exact SW versions involved, the exact HW config and hopefully – provide the script that runs the whole benchmark.

This would allow the community and ecosystem – users, prospects and vendors – to:

  • Quickly reproduce the results (maybe on a somewhat different config).
  • Play with different variations to learn how stable are the conclusions (different file formats, data set size, SQL and parameter variations any many more).
  • Share their result in the open, using the same disclosure, for everyone to learn from it or respond to it.

Short version – community over committee.

Why that also won’t happen

Sharing a detailed, easily reproducible report is great for smart users who want to educate themselves and choose the right product.

However, there is nearly zero incentive for vendors and projects to share it, especially for the leaders. Why? Because they are terrified that a competitor will use it to show that he is way faster… That could be the ultimate marketing fail (a small niche player afford it, since they have little to lose).

Some other reasonable excuses – there could be a dependency on internal testing frameworks, scripts or non-public datasets, not enough resources to clean up and document each micro-benchmark or to follow up with everyone etc.

Also, such disclosure may prevent marketing / management from highlighting some results out of context (or add wishful thinking)… Not sure many are willing to report to their board – “sales are down since we told everyone that our product is not yet good enough”.
For example – I haven’t yet seen a player claiming “Great Performance! We are now 0.7x faster than the current performance leader!”. Somehow you’ll need to claim that you are great, even if for some a specific scenario under specific (undisclosed?) constraints.

Summary

  • You can’t really optimize picking the right tool by building a universal performance benchmark (and picking is not only about performance).
  • Human interests (commercial and others) will generate friction as all compete for some mindshare. Social pressure might sometimes help to make competitors support civilized technical discussion, but only rarely.
  • when sharing performance numbers, please be nice – share what you can and don’t mislead.
  • As a user, try to learn and verify before making big commitments, and be prepare to screw up occasionally…
Advertisements

3 thoughts on “Industry Standard SQL-on-Hadoop benchmarking?

  1. Pingback: A small step for Impala, a big step for SQL-on-Hadoop. More to come, hopefully. | Linked Data Orchestration

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s