The end of the monolithic database engine dream

It seems I like calling my posts “The End of…” 🙂

Anyway, Rob Klopp wrote an interesting post titled “Specialized Databases vs. Swiss Army Knives“. In it, he argues with Stonebraker’s claim that the database market will split to three-to-six categories of databases, each with its own players. Rob counter claims by saying that data is typically used in several ways, so it is cumbersome to have several specialized databases instead of one decent one (like Hana of course…).

I have a somewhat different perspective. In the past, let’s say ten years ago, I was sure that Oracle  database is the right thing to throw at any database challenge, and I think many in the industry shared that feeling (each with his/her favorite database, of course). There was a belief that a single database engine could be smart enough, flexible enough, powerful enough to handle almost everything.

That belief is now history. As I will show, it is now well understood and acknowledged by all the major vendors that a general-purpose database engine just can’t compete in all high-end niches. HOWEVER, the existing vendors are, as always, adapting. All of them are extending their databases to offer multiple database engines inside their product, each for a different use case.

The leader here seems to be Microsoft SQL Server. SQL Server 2014 (currently at CTP2) comes with three separate database engines. In addition to the existing engine, they introduced Hekaton – an in-memory OLTP engine that looks very promising. They also delivered a brand new implementation of their columnar format – now called clustered columnstore index – which is now fully updatable and is actually not an index – it is a primary table storage format with all the usual plumbing (delta trees with a tuple mover process when enough rows have accumulated).

Continue reading

Advertisements

Industry Standard SQL-on-Hadoop benchmarking?

Earlier today a witty comment I made on twitter led to a long and heated discussion about vendor exaggerations regarding SQL-on-Hadoop relative performance. It started as a link to this post by Hyeong-jun Kim, CTO and Chief Architect at Gruter. That post discusses some ways that vendor exaggerate and suggests to verify these claims with your own data and queries.

Anyway, Milind Bhandarkar from Pivotal suggested that joining an industry standard benchmarking effort might be the right think. I disagree, wanted to elaborate on that:

Industry Standard SQL-on-Hadoop benchmark won’t improve a thing

  • These benchmarks (at least in the SQL space) don’t help users to pick a technology. I’ve never heard of a customer who picked some solution because it lead the performance or price/performance list of TPC-C or TPC-H.
    Customers will always benchmark on their data and their workload…
  • …And they do so because in this space, many of the small variations in data (ex: data distribution within a column) or workload (SQL features, concurrency, transaction type mix) will have dramatic impact on the results.
  • So, the vendors who write the benchmark will fight to death to have the benchmark highlight their (exisitng) strengths. If the draft benchmark will show them at the bottom, they’ll retire and bad mouth the benchmark for not incorporating their suggestions.
  • Of course, the close-source players still won’t allow publishing benchmark results – the “DeWitt Clause” (right Milind? What about HAWK?)
  • And even with a standard, all vendors will still use micro-benchmarks to highlight the full extent of new features and optimizations, so the rebuttals and flame wars will not magically end.

What I think is the right thing for users

Since vendors (and users) will not agree on a single benchmark, the next best thing is that each player will develop their own benchmark(s), but will make it easily reproducible.

For example, share the dataset, the SQLs, the specific pre-processing if any, the specific non-default config options if any, the exact SW versions involved, the exact HW config and hopefully – provide the script that runs the whole benchmark.

This would allow the community and ecosystem – users, prospects and vendors – to:

  • Quickly reproduce the results (maybe on a somewhat different config).
  • Play with different variations to learn how stable are the conclusions (different file formats, data set size, SQL and parameter variations any many more).
  • Share their result in the open, using the same disclosure, for everyone to learn from it or respond to it.

Short version – community over committee.

Why that also won’t happen

Sharing a detailed, easily reproducible report is great for smart users who want to educate themselves and choose the right product.

However, there is nearly zero incentive for vendors and projects to share it, especially for the leaders. Why? Because they are terrified that a competitor will use it to show that he is way faster… That could be the ultimate marketing fail (a small niche player afford it, since they have little to lose).

Some other reasonable excuses – there could be a dependency on internal testing frameworks, scripts or non-public datasets, not enough resources to clean up and document each micro-benchmark or to follow up with everyone etc.

Also, such disclosure may prevent marketing / management from highlighting some results out of context (or add wishful thinking)… Not sure many are willing to report to their board – “sales are down since we told everyone that our product is not yet good enough”.
For example – I haven’t yet seen a player claiming “Great Performance! We are now 0.7x faster than the current performance leader!”. Somehow you’ll need to claim that you are great, even if for some a specific scenario under specific (undisclosed?) constraints.

Summary

  • You can’t really optimize picking the right tool by building a universal performance benchmark (and picking is not only about performance).
  • Human interests (commercial and others) will generate friction as all compete for some mindshare. Social pressure might sometimes help to make competitors support civilized technical discussion, but only rarely.
  • when sharing performance numbers, please be nice – share what you can and don’t mislead.
  • As a user, try to learn and verify before making big commitments, and be prepare to screw up occasionally…

Big Data PaaS part 1 – Amazon Redshift

Last month I attended AWS Summit in Tel-Aviv. It was a great event, one of the largest local tech events I’ve seen. Anyway, a session on Redshift got me thinking.

When Amazon Redshift was announced (Nov 2012), I was working at Greenplum. I remember that while Redshift pricing was impressive, I mostly dismissed it at that time due to some perceived gaps in functionality and being in a “limited beta” status. Back to last month’s session, I came to see what’s new and was taken by surprise by the tone of the presentation.

As a database expert, I thought I would hear about the implementation details – the sharding mechanism, the join strategies, the various types of columnar encoding and compression, the pipelined data flow, the failover design etc… While a few of these were briefly mentioned, they were definitely not the focus. Instead, the main focus was the value of getting your data warehouse in a PaaS platform.

So what is the value? It is a combination of several things. Lower cost is of course a big one – it allows smaller organization to have a decent data warehouse. But it is much more than cost – being a PaaS means you can immediately start working without worrying about everything from a complex, time-consuming POC to all the infrastructure work for successful production-grade deployment. Just provision a production cluster for a day or two, try it out on your data and queries, and if you like the performance (and cost) – simply continue to use it. It allows a very safe play – or as they put it “low cost of failure“. In addition, a PaaS environment can handle all the ugly infrastructure tasks around MPP databases or appliances that typically block adoption – backups, DR, connectivity etc.
Let me elaborate on these points:

I guess anyone who did a DW POC can relate to the following painful process. You have to book the POC many weeks in advance, pick from several uneasy options (remote POC, fly to vendor central site or coordinate a delivery to your data center), pick in advance a small subset of data and queries to test, don’t use your native BI/ETL tools, rely on the vendor experts to run and tune the POC, finish with a lot of open questions as you ran out of time, pick something, pay a lot of money, wait a couple of months for delivery, start migration / testing / re-tuning (or even DW design and implementation) and several months later you will likely realize how good (or bad) your choice really was… This is always a very high risk maneuver, and the ability to provision a production environment, try it out and if you like it, just keep it without a huge upfront commitment is a very refreshing concept.

Same goes for many “annoying” infrastructure bits. For example – backup, recovery and DR. Typically these will not be tested in a real-world configuration as part of a POC due to various constraints, and the real tradeoff between feasible options in cost, RPO and RTO will not seriously evaluated or in many cases not even considered. Having a working backup included with Redshift – meaning, the backup functionality with underlying infrastructure and storage – is again huge. Another similar one is patching – it is nice that it’s Amazon’s problem.

Last of these is scaling. With an appliance, scaling out is painful. You order an (expensive) upgrade, wait a couple of month for it to be install, then try to roll it out (which likely involves many hours of storage re-balancing), then pray it works. Obviously you can’t scale down to reduce costs. With Redshift, they provision a second cluster (while your production is switched to read-only mode), and switch you over to the new one “likely within an hour”. Of course, with Redshift you can scale up or down on demand, or even turn off the cluster when not needed (but there are some pricing gotchas as there is a strong pricing bias to a 3-year reserved instances).

If you’re interested, you can find Guy Ernest’s presentation here .During his presentation, most slides were hidden as he and Intel had only 45 minutes – I see now that the full slide deck actually does have some slides full of implementation details 🙂

BTW –  another thing that I was curious about was – can the Amazon team significantly evolve the Redshift code base given its external origin(ParAccel)? I assumed that it is not a trivial task. Well, I just read last week that AWS are releasing a big functionality upgrade or Redshift (plus some more bits), so I think that one is also off the table.

It would be interesting to see if Redshift would now gain more traction, especially as it got more integrated into the AWS offering and workflow.

Highlights from writing a data sheet for JethroData

A couple of weeks ago, my friends at JethroData asked me to help them out writing a technical data sheet.

Now, JethroData product is a SQL engine for Hadoop, currently in private beta, that has a very different design and architecture than the rest of the crowd. Specifically, it actually always indexes all the columns of your tables, and also uses HDFS as a shared, remote storage instead of running locally on the data nodes. It has some very good reasons to work this way, but being radically different has made it harder to explain.

So, my task was to write a technical data sheet that would help technical people switch from “this is an insane nonsense” to “that actually makes sense in these cases, but how do you handle this or that…”. In other words, tell a consistent technical story that makes sense and highlights their strengths, encouraging readers to engage further.

I think data sheets in general are a piece of paper you get from a vendor at their booth, and quickly browse through them later while you are stuck at a boring presentation (or is it just me?). So, they have to be interesting and appealing. Also, due to their short format, they should be very focused on the main values and not too high-level or low-level. Quite a challenge!

To overcome the challenge, I first needed deeper knowledge regarding the product. For that, I spent some time with JethroData team, especially with Boaz Raufman, their co-founder and CTO. I have grilled him with questions on architecture and functionality, which led to great discussions. While of course some functionality is still lacking at this early phase, I was impressed to see that the core architecture is designed to handle quite a lot.

Then came the harder part – focus. There are many ways to use JethroData, its implementation has various advantages in different scenarios and of course, the implementation has many details, each with some logic behind it. It is impossible to cram it all into a two-page datasheet, even if most of it is very important. So, I had several lively discussions with Eli Singer, JethroData CEO, of everything from positioning to many specific strengths and supporting arguments and their relative importance.

At the end of the day, I think we got something pretty sweet. You can read it on their web site technology page. If you are attending Strata/Hadoop World, JethroData team will be on the Startup Showcase on Monday and in booth #75 Tuesday-Wednesday – bring questions and arguments 🙂

How will the rise of stream processing effect Big Data paradigm?

Even if you’ve never heard so far about stream processing, it is an easy guess you’ll likely run into this term in the coming weeks around Hadoop World, given the way it illustrates so nicely what YARN all about and also of course given the Hortonworks announcement of endorsing Apache Storm (disclaimer – I love Hortonworks).

Now, before I start fantasizing about the potential popularity of various architectures in the years to come, some quick background…

“Traditional” big data streaming / messaging is just about moving events from place to place. For example, collecting log records from dozens or hundreds of servers in near real-time and delivering each to the relevant target(s), like HDFS files,HBase, a database or another application in a reliable fashion (with specific semantics). It is a re-incarnation of the message queuing concepts, in a high-throughput, distributed fashion and extended both to actively collect events directly from various sources and to persist the events on the other side in various targets.

It seems to me the most popular system is Apache Kafka (originally from LinkedIn), but there is also Apache Flume (originally from Cloudera) which is bundled in most Hadoop distributions.

Stream processing on the other hands, focuses on doing complex parsing, filtering, transformations and computations on the data stream as the data flows, for example maintaining an up-to-date counters and aggregated statistics per user or rolling window “top n” (continuous computation). This is opposed to batch processing frameworks, like MapReduce and Tez, that are based on one-time bulk processing. In this space, Apache Storm (originally from Backtype / Twitter) is the main player, but as LinkedIn released Apache Samza last month, it might lead to some healthy competition. There are other initiatives – for example, Cloudera contributed a library or building real-time analytics on Flume which might be good enough for simpler cases.

By the way, these two types of systems are complementary. Specifically, the combination of Kafka and Storm seems to be gaining traction – here is one success story from LivePerson.

Now, stay with me for a minute while I hype… What if you could write most of the common batches as stream processing? What if you could provide your business users updated data that is only seconds or tens of seconds behind the present? What if you could just have the answers to many common questions always updated and ready? Could stream processing significantly de-emphasize / replace batch processing on Hadoop in the future?

Now, that is of course mostly crazy talk today. In reality, maybe stream processing would not be a good fit for many computation types, and it is rather early to fully understand it. Also, the functionality, optimizations, integrations, methodologies and best practices around stream processing would likely take a few years to really mature.

Having said that, it seems to me that at least in the database / ETL space, stream processing has a great potential to evolve into a powerful alternative paradigm on how we design a system that adds analytic to operational systems.

But, instead of hyping anymore, I’ll ask – how crazy is that? How far are we from such capabilities? What I already see looks promising in theory. Here are a couple of examples:

  • Storm already has its own official abstraction layer to simplify writing complex computations. It is called Trident, contributed by Twitter and is part of Storm 0.8. It provides higher-level micro-batch primitives that allows, for example, to build parallel aggregations and joins. Interestingly, when Twitter released it a year ago, they specifically described it as similar to Pig or Cascading, but for stream processing.
    It seems to me that declarative,SQL-like continuous computation is only one step away from that.
  • Even more fascinating is Twitter’s Summingbird (open-sourced last month). It is a programmer-friendly abstraction of stream processing that can be executed either in real-time (on Storm) or as batch (using Scalding, translated to MapReduce, in the future maybe to Tez). More interestingly, It was designed to allow an hybrid mode, transparently mixing historical data (batch) from Hadoop with last-minute updates from Storm.
    So, batch and streaming could co-exist and even transparently mix, even today.

Now, it remains to be seen of course how will those technologies evolve, but it seems that the current state of events is already not too bad. With the adoption steam processing by mainstream Hadoop distributions (surely Hortonworks wont be the last), usability, functionality and innovation will likely all accelerate. So, if you are starting to look into “next-gen” Big Data infrastructure and low require latency from event to results, it might be a good time to start looking into stream processing.

Update 30/10/13: I can’t believe I didn’t spend time to read about Lambda Architecture before. Nathan Marz (who led Twitter’s Storm) came up with a coherent, end-to-end architecture for a data system to make relational databases obsolete, with both a batch processing component and a stream processing component. Find some time to watch his fascinating presentation from Strange Loop 2012, you won’t regret it.

Oracle In-Memory Option: the good, the bad, the ugly

Last week, Oracle have announced the new Oracle Database In-Memory Option. While there is already a great discussion at Rob’s blog and further analysis at Curt’s, I thought I could add a bit.

The Good

If Oracle does deliver what Larry promises, it will be the Oracle’s biggest advance for analytics since Exadata V2 in 2009, which introduced Hybrid Columnar Compression and Storage Indexes. I mean, especially for mixed workloads – getting 100x faster analytical queries while OLTP goes faster… Quite a bold promise.

The Bad

I’ll be brief:

  • How Much– in Oraclespeak, Option means extra-cost option. Both the pricing model (per core / per GB) and the actual price haven’t been announced. Since this is Oracle, both of them will be decided by Larry a month before GA – so the TCO analysis will have to wait…
  • When – it seems this is a very early pre-announcement of a pre-beta code. Since it missed 12c release 1 (which came out this July), I assume it will have to wait to 12c release 2, so it will likely be end of next year. Why? I would assume that a feature so intrusive is too much for a patchset (unless they are desperate).

Andy Mendelsohn says In-Memory option is pre-beta and will be released “some time next year.” #oow13

— Doug Henschen (@DHenschen) September 23, 2013

  • Why now– Oracle is obviously playing catch up…Even if we put Hana aside, it is lagging behind both DB2 (BLU) and SQL Server (especially 2014 – mostly updatable column store indexes, also in-memory OLTP). Also, there might be other potential competitors rising in the analytics space (Impala for starter?). So, this announcement is aimed at delay customers attrition to faster platforms while Oracle scrambles to deliver something.

The Ugly

So, my DB will have 100x faster analytics and 2x faster OLTP? Just by flipping an option? Sound much better (and cheaper) then buying Exadata… Or did Larry mean 100x faster than Exadata? hard to tell.
For some use cases, there will be cannibalization, for sure – for example, apps (EBS, Siebel etc) with up to a few TBs of hot data (which is almost every enterprise deployment) should seriously reconsider Exadata – replace smart scan with in-memory scan and get flash from their storage.

BTW – is this the reason why Oracle didn’t introduce a new Exadata model? Still thinking of how to squeeze in the RAM? That would be interesting to watch.

Update: Is Oracle suggesting In-Memory is 10x faster than Exadata? Check the pic:

 

Big Data products/projects types – from proprietary to industry standard

I recently read Merv’s excellent post on proprietary vs open-source Hadoop – suggesting that use the term distribution-specific is likely more appropriate. It reminded that I wanted to write about the subject – I had a somewhat more detailed classification in mind.
I’m focusing on Hadoop but this also is very relevant to the broader Big Data ecosystem, including the NoSQL/NewSQL solutions etc.

But first, why bother? Well, the classification typically hints some things to decision makers regarding that piece of software. For enterprises, which typically relies on outside help, some open questions might be:

  • Skills – How can we get confidence in using this software (people, training etc)? Can we get help from our partners / consultants / someone new if needed? In our region / country?
  • Support  – Can we get commercial support? Global or local? Is there more than one vendor so we can choose / switch if the relationship goes bad?
  • Finally, how much of a “safe bet” is it? Will it be in a few years a niche, abandoned code or mainstream? Will it integrate well with the ecosystem over time (ex: YARN-friendly) or not? and what are the odds that there will be someone around supporting it in several years when we still rely on it in production?

Of course, the state of piece of software may change over time, but still I think each of the following categories has some immediate implications regarding these questions. For example, proprietary (closed-source) projects should come with decent paid support, optional professional services and training, average plus documentation, and decent chance of long term support (but small startups are still a potential risk).

Having said that, here is the way I classify projects and products:

  • A proprietary offering of a specific Hadoop distribution –  a closed-source offering bundled with a vendor’s Hadoop distribution. Some examples are MapR (with proprietary implementation of HDFS, MapReduce etc), Pivotal (with HAWQ as the central attraction of their distribution) and also Cloudera (with Cloudera Manager, although proprietary installation and management is almost universal).
    The propriety bits are usually considered by the vendor as a significant part of their secret sauce.
  • Proprietary offering on Hadoop – a closed-source offering that runs on Hadoop. Some are distribution-agnostic (A.K.A Bring-Your-Own-Hadoop), for example Platfora, which runs on Cloudera, Hortonworks, MapR and Amazon. Some are certfiied on only a specific distribution (for now?) like Teradata-Hortonworks.
    Also, while some are Hadoop-only, many are an extension of an existing vendor offering to Hadoop, for example – all ETL vendors, some analytics vendors, some security vendors etc (the BI vendors typically just interact with Hadoop using SQL).
  • Open source projects, not bundled with a distribution – many great tools have been released as open source and have started building a community around them. While they are not yet bundled with an existing distribution, this doesn’t stop many from using them.
    Some of the projects complement existing functionality  – like salesforce.com Phoenix for smart, thin SQL layer over HBase or Netflix Lipstick for great Pig workflow visualization. Other projects are alternatives to existing projects, like Linkedin (now Apache) Kafka for distributed messaging as a popular alternative for Apache Flume or like Apache Accumulo as a security-enhanced alternative to Apache HBase. Yet others are just new types of stuff on Hadoop, like Twitter’s Storm for distributed, real-time, stream computations or any of the distributed graph processing alternatives (Apache Giraph and others).
  • Open source projects, bundled with a single distribution – a few of them are  existing open source projects that got picked by a distribution, like Linkedin DataFu that adds useful UDFs to Apache Pig and is now bundled with Cloudera. However, most of them are projects led by the distribution, with varying community around them, for example Hortonworks Ambari (Apache Incubator), Cloudera Impala and Cloudera Hue, plus various security initiatives all over the place (Cloudera Sentry, Hortonworks Knox, Intel Rhino etc).
    As an interesting anecdote of the market dynamics – it seems Cloudera Hue is now also bundled in Hortonworks 2.0 beta  under “third-party components”, which might hint it could evolve into last category:
  • Industry-standard open source projects – these are projects that you can expect to find on most leading distributions, meaning, several vendors provide support.
    While some are obvious (Pig,Hive etc), some are new or just arriving. For example, HCatalog was created by Hortonworks but is now included in Cloudera and MapR. I would guess that merging it into Hive helped smooth things out (likely it is easier to deliver it than rip it out…). It would be interesting to see if Tez will follow the same adoption path, as Hortonworks planned to eventually build support directly in Hive, Pig and Cacading. A similar fate would likely come to most of the security projects I mentioned, as well as ORC and Parquet support (ORC is already in Hive).

So there you have you it. I believe knowing the type of project can help organization gauge the risk involved (vs. the benefit) to make an informed decision about its use.

Oh, and on a side note – “bundled in a distribution” and “distribution does provides decent support, bug fixes and enhancements”  are of course two different things… That’s why the “brag the number of your commiters” resonate well with me, but that’s a different story.