Big Data PaaS part 1 – Amazon Redshift

Last month I attended AWS Summit in Tel-Aviv. It was a great event, one of the largest local tech events I’ve seen. Anyway, a session on Redshift got me thinking.

When Amazon Redshift was announced (Nov 2012), I was working at Greenplum. I remember that while Redshift pricing was impressive, I mostly dismissed it at that time due to some perceived gaps in functionality and being in a “limited beta” status. Back to last month’s session, I came to see what’s new and was taken by surprise by the tone of the presentation.

As a database expert, I thought I would hear about the implementation details – the sharding mechanism, the join strategies, the various types of columnar encoding and compression, the pipelined data flow, the failover design etc… While a few of these were briefly mentioned, they were definitely not the focus. Instead, the main focus was the value of getting your data warehouse in a PaaS platform.

So what is the value? It is a combination of several things. Lower cost is of course a big one – it allows smaller organization to have a decent data warehouse. But it is much more than cost – being a PaaS means you can immediately start working without worrying about everything from a complex, time-consuming POC to all the infrastructure work for successful production-grade deployment. Just provision a production cluster for a day or two, try it out on your data and queries, and if you like the performance (and cost) – simply continue to use it. It allows a very safe play – or as they put it “low cost of failure“. In addition, a PaaS environment can handle all the ugly infrastructure tasks around MPP databases or appliances that typically block adoption – backups, DR, connectivity etc.
Let me elaborate on these points:

I guess anyone who did a DW POC can relate to the following painful process. You have to book the POC many weeks in advance, pick from several uneasy options (remote POC, fly to vendor central site or coordinate a delivery to your data center), pick in advance a small subset of data and queries to test, don’t use your native BI/ETL tools, rely on the vendor experts to run and tune the POC, finish with a lot of open questions as you ran out of time, pick something, pay a lot of money, wait a couple of months for delivery, start migration / testing / re-tuning (or even DW design and implementation) and several months later you will likely realize how good (or bad) your choice really was… This is always a very high risk maneuver, and the ability to provision a production environment, try it out and if you like it, just keep it without a huge upfront commitment is a very refreshing concept.

Same goes for many “annoying” infrastructure bits. For example – backup, recovery and DR. Typically these will not be tested in a real-world configuration as part of a POC due to various constraints, and the real tradeoff between feasible options in cost, RPO and RTO will not seriously evaluated or in many cases not even considered. Having a working backup included with Redshift – meaning, the backup functionality with underlying infrastructure and storage – is again huge. Another similar one is patching – it is nice that it’s Amazon’s problem.

Last of these is scaling. With an appliance, scaling out is painful. You order an (expensive) upgrade, wait a couple of month for it to be install, then try to roll it out (which likely involves many hours of storage re-balancing), then pray it works. Obviously you can’t scale down to reduce costs. With Redshift, they provision a second cluster (while your production is switched to read-only mode), and switch you over to the new one “likely within an hour”. Of course, with Redshift you can scale up or down on demand, or even turn off the cluster when not needed (but there are some pricing gotchas as there is a strong pricing bias to a 3-year reserved instances).

If you’re interested, you can find Guy Ernest’s presentation here .During his presentation, most slides were hidden as he and Intel had only 45 minutes – I see now that the full slide deck actually does have some slides full of implementation details 🙂

BTW –  another thing that I was curious about was – can the Amazon team significantly evolve the Redshift code base given its external origin(ParAccel)? I assumed that it is not a trivial task. Well, I just read last week that AWS are releasing a big functionality upgrade or Redshift (plus some more bits), so I think that one is also off the table.

It would be interesting to see if Redshift would now gain more traction, especially as it got more integrated into the AWS offering and workflow.

Advertisements

The end of the classical MPP databases era

Over the years, enterprises realized that their many isolated systems generate a vast amount of data. What if they could put all that massive data into one centralized platform, correlate it and analyze it? Surely that would uncover a wealth of relevant, hidden business insights. Of course, those were the eighties (actually earlier), so this new platform was called the Data Warehouse. And it was good. So, over the years, it became very popular and by the end of the nineties nearly every large organization had one or dozen of them.

However, with all the goodness, there were some challenges. Data warehousing required a lot of specialized skills and tools. For example, regular databases couldn’t really support very large data warehouses, so from a certain scale, it required specialized and expensive products – like those from Teradata. As a response to the quick adoption of the data warehouse, the “classical” MPP databases arrived to the market about a decade ago, first Netezza and later the rest of the gang (Vertica, Greenplum etc). They were all relatively low cost, and built parallel processing on top of shared-nothing cluster architecture. They were the one place were you could throw “petabytes of data” and analyze it over a large, mostly-commodity, computing cluster.

So, what happened?

Simply, over the time the market gained experience and its requirements and expectations have evolved. For example, some examples of challenges not solved by existing MPP databases include:

  • The main challenge of a data warehouse is making sense of the data. The classical DW solution is to clean up and standardize the data as it is being loaded (ETL). This is required as the regular MPP database schema is rigid, and schema evolution is hard and painful (just like all relational databases). That standard method is, however, complex and very time consuming, which makes it very hard for the DW to adapt to the constant and frequent changes in its source systems.
    Nowadays, a common requirement is to support dynamic schemas / schema-on-read, so at the very least the frequent schema and data sources changes wont block the ingestion of data.
    This also supports the on-going shift of power from DBAs to the developers.
  • The classical MPP databases have relatively rigid HA and scalability – unlikely that you could add a couple of nodes every week or every month in real life or survive a “once an hour” node failure rate etc. In other words, MPP databases provided scalability but not elasticity, plus HA that is focused on surviving a single node failure.
    Today elasticity is a requirement – inspired by “failure is a common, normal event” mentality of Hadoop and of relying on lower-end commodity servers.
    The next step for this mentality will be to also replace the DR concept with a native multi-data-center active-active support as part of the core architecture, which some of the NOSQL and NewSQL players are advocating.
  • If talking about rigidity, the existing players usually had an assumption of isolation and uniformity – uniform physical nodes, dedicated high performance cluster interconnect and relatively homogenous workload.
    Nowadays with cloud deployments or deployments on a shared Hadoop cluster (maybe virtualized in the future), those assumptions needs to be revisited and products needs to take into account never-ending fluctuation of CPU, I/O and network on shared resources.
  • Finally, MPP databases are propriety, closed-source beasts. That doesn’t fit well with the collaborative open-source ecosystem led by the major web giants which are the main driver behind the existing big data boom, innovation and fast pace. This represents a huge mentality shift – just think why MapR doesn’t own the Hadoop market even though it came with a superior offering early on.

Speaking of mentality shift – one more thing to add .Most organizations that will start implementing a solution increasingly ask for a single “big data” platform – that must support complex SQL but also online/OLTP lookups, batches processing unstructured data, free-text searches, graph analysis and whatever else comes up next (future-proofing). For that, the industry is no doubt standardizing on Hadoop as the unifying infrastructure – and will converge around YARN + HDFS infrastructure. Now, no one wants start its big data journey with two big data platforms, so even if sometimes it could have made sense, it will be very hard to promote such solution at the current industry buzz level. The only exception might be SAP Hana, which built a unique offering of merging OLTP and DW into a single MPP system – skipping the ETL headache altogether (while also of course supports integration with Hadoop).

So, what will happen?

Well, looking at my crystal ball, it seems that popularity will continue to quickly shift to modern MPP databases on top of Hadoop.They will have flexible schema (which sucks, but it’s the only way to keep up with the schema changes and data sources proliferation). They will be significantly less monolithic and will leverage the Hadoop ecosystem instead – for example HDFS for storage management, YARN for resource management, HCatalog for metadata management, maybe even support for various open on-disk file formats etc. So, they will be mostly trimmed down to parallel query optimization and runtime engines.

As for the old MPP players, they would sooner or later try to adapt to this world, but for most of them it is simply too late already. The only player who have committed to such change is Greenplum (now: Pivotal) which started migrating its MPP database to HDFS (and renamed it to HAWQ). Even for them, even though their first release is now out, handling the challenges mentioned will not be easy. But it is still early in the game, and their lead (in my opinion) in query optimization over the rest of the “SQL on/in Hadoop” players may buy them enough time to evolve into a truly leading “next-gen” MPP-on-Hadoop player.