The end of the classical MPP databases era

Over the years, enterprises realized that their many isolated systems generate a vast amount of data. What if they could put all that massive data into one centralized platform, correlate it and analyze it? Surely that would uncover a wealth of relevant, hidden business insights. Of course, those were the eighties (actually earlier), so this new platform was called the Data Warehouse. And it was good. So, over the years, it became very popular and by the end of the nineties nearly every large organization had one or dozen of them.

However, with all the goodness, there were some challenges. Data warehousing required a lot of specialized skills and tools. For example, regular databases couldn’t really support very large data warehouses, so from a certain scale, it required specialized and expensive products – like those from Teradata. As a response to the quick adoption of the data warehouse, the “classical” MPP databases arrived to the market about a decade ago, first Netezza and later the rest of the gang (Vertica, Greenplum etc). They were all relatively low cost, and built parallel processing on top of shared-nothing cluster architecture. They were the one place were you could throw “petabytes of data” and analyze it over a large, mostly-commodity, computing cluster.

So, what happened?

Simply, over the time the market gained experience and its requirements and expectations have evolved. For example, some examples of challenges not solved by existing MPP databases include:

  • The main challenge of a data warehouse is making sense of the data. The classical DW solution is to clean up and standardize the data as it is being loaded (ETL). This is required as the regular MPP database schema is rigid, and schema evolution is hard and painful (just like all relational databases). That standard method is, however, complex and very time consuming, which makes it very hard for the DW to adapt to the constant and frequent changes in its source systems.
    Nowadays, a common requirement is to support dynamic schemas / schema-on-read, so at the very least the frequent schema and data sources changes wont block the ingestion of data.
    This also supports the on-going shift of power from DBAs to the developers.
  • The classical MPP databases have relatively rigid HA and scalability – unlikely that you could add a couple of nodes every week or every month in real life or survive a “once an hour” node failure rate etc. In other words, MPP databases provided scalability but not elasticity, plus HA that is focused on surviving a single node failure.
    Today elasticity is a requirement – inspired by “failure is a common, normal event” mentality of Hadoop and of relying on lower-end commodity servers.
    The next step for this mentality will be to also replace the DR concept with a native multi-data-center active-active support as part of the core architecture, which some of the NOSQL and NewSQL players are advocating.
  • If talking about rigidity, the existing players usually had an assumption of isolation and uniformity – uniform physical nodes, dedicated high performance cluster interconnect and relatively homogenous workload.
    Nowadays with cloud deployments or deployments on a shared Hadoop cluster (maybe virtualized in the future), those assumptions needs to be revisited and products needs to take into account never-ending fluctuation of CPU, I/O and network on shared resources.
  • Finally, MPP databases are propriety, closed-source beasts. That doesn’t fit well with the collaborative open-source ecosystem led by the major web giants which are the main driver behind the existing big data boom, innovation and fast pace. This represents a huge mentality shift – just think why MapR doesn’t own the Hadoop market even though it came with a superior offering early on.

Speaking of mentality shift – one more thing to add .Most organizations that will start implementing a solution increasingly ask for a single “big data” platform – that must support complex SQL but also online/OLTP lookups, batches processing unstructured data, free-text searches, graph analysis and whatever else comes up next (future-proofing). For that, the industry is no doubt standardizing on Hadoop as the unifying infrastructure – and will converge around YARN + HDFS infrastructure. Now, no one wants start its big data journey with two big data platforms, so even if sometimes it could have made sense, it will be very hard to promote such solution at the current industry buzz level. The only exception might be SAP Hana, which built a unique offering of merging OLTP and DW into a single MPP system – skipping the ETL headache altogether (while also of course supports integration with Hadoop).

So, what will happen?

Well, looking at my crystal ball, it seems that popularity will continue to quickly shift to modern MPP databases on top of Hadoop.They will have flexible schema (which sucks, but it’s the only way to keep up with the schema changes and data sources proliferation). They will be significantly less monolithic and will leverage the Hadoop ecosystem instead – for example HDFS for storage management, YARN for resource management, HCatalog for metadata management, maybe even support for various open on-disk file formats etc. So, they will be mostly trimmed down to parallel query optimization and runtime engines.

As for the old MPP players, they would sooner or later try to adapt to this world, but for most of them it is simply too late already. The only player who have committed to such change is Greenplum (now: Pivotal) which started migrating its MPP database to HDFS (and renamed it to HAWQ). Even for them, even though their first release is now out, handling the challenges mentioned will not be easy. But it is still early in the game, and their lead (in my opinion) in query optimization over the rest of the “SQL on/in Hadoop” players may buy them enough time to evolve into a truly leading “next-gen” MPP-on-Hadoop player.

19 thoughts on “The end of the classical MPP databases era

  1. Ofir good article .The big question is how can you run reports against hadoop and get results quickly ? If it could be done.There will be no need for analytic DB.

  2. Hi Alon,
    thanks for the comment!
    I fully agree – the whole purpose of having an MPP database / query engine on top of Hadoop is to support analytical queries – large scans, joins and aggregations.
    Impala, Hive (+TEZ) and friends ARE already basic MPP SQL engines on top of Hadoop, and HAWQ (amongst others) is joining the crowd to compete with them.
    My point is that it makes no sense to have two separate “big data” clusters – Hadoop and Analytic DB, if the “MPP on Hadoop” solutions would be fast enough. And they are becoming fast enough pretty fast…
    I hope it is clearer

  3. Hi Ofirm, congratulations on your blog – excellent quality. I’ve just simple question that is bothering me. It’s related to the phrase you wrote “The next step for this mentality will be to also replace the DR concept with a native multi-data-center active-active support as part of the core architecture, which some of the NOSQL and NewSQL players are advocating.”

    I constantly see people mentioning “active-active datacenters/systems” concept, but when i start asking more advanced questions they unusually end-up in mentioning serving static (!) content from both DCs, segregating/partitioning loads. There is a lot to consider.

    Have you actually seen/heard of ANY system(NewSQL/NoSQLsolution?) doing active-active work between multiple datacenters at 1000 trx/s at than 200+ km/15+ ms? I’m not talking about laptop database experiments but mission-critical systems like that. NuoDB mentions Geo A/A but when you want to download it is marked as “beta”, Cassandra mentions it too. I’ve heard about possibility doing it via MySQL circular slave replication, Oracle Streams A/A config or doing in Paypal way (via NDB MySQL Cluster)… have you seen at least one live, middle sized system? Or is just talking another IT fashion that everybody is talking about?

    1. Hi Jakub,
      sorry for the delay – the new gmail tabs hid your comment notification…
      That is a very good comment!
      I strongly believe that Active-Active has to be baked into database design from early days not at all… Trying to bake it into MySQL or Oracle (streams or externally with GoldenGate etc) will lead to a lot of pain, complexity and unlikely to success.
      But, there are more and more new databases with Active-Active baked in.
      The best example is Google Spanner (and Google F1 on top of it), which is an Active-Active database deployed across five sites. It mainly runs Google AdWords – not a laptop POC 🙂 We all know, once a tech is proven at google, it will be duplicated as open source within a few years…
      As for open-source example, check out CouchBase:
      Another alternative is using a cloud service that replicates data on multiple data centers and handles DC failure for you, like Amazon SimpleDB and Amaon DynamoDB…
      Having said that, usually active-active is between important to nice to have… as most organizations have bigger issues if a DC fails. However, if you build your company infrastructure in the cloud, it could be important – both for HA and lower latency (serving customers across the world).

      1. Hi and thanks for answer. I’ve read the link you provided about this XDCR but I’m puzzled… how do they handle fundamental replication conflicts that certainly will happen. To be honest implementation of the replication bit doesn’t matter to me (it can be as it is in Streams), but at least Streams provides handlers for resolving conflicts… (e.g. often rewrites data based on timestamp -> data lost)…

        My take on A-A between DCs is that it is just another buzz. Just like distributed computing currently and scale-out ideas are… considering that we’ll be able to put rock solid HA clusters on RAS boxes with nearly 100% uptime with some AIX on Power8 and/or Solaris on M6 CPUs … what’s the point of playing the distributed game? What’s the benefit? because cost is getting huge (in maintaining such distributed systems) — I know it looks cool, but what is the business benefit in typical IT system? (in 99% cases we are not google).

  4. Well, regarding Couchbase – you can continue reading the docs, conflict resolution is section 5.8.9
    But reading your comment, I think you are missing the context. All this “big data” buzz is around processing significantly more data at significantly less cost, mostly in context outside the classical enterprise OLTP systems.
    If you have standard enterprise workload and your requirements can be met with standard technology at reasonable costs – by all mean, the existing technologies are the best fit (although there is no reason in my opinion to ever go back to the historical AIX or Solaris, linux already won more than a decade ago with its price/performance).
    But if your processing requires many hundreds of cores, many TBs of data, and ability to dramatically scale further easily and cost-efficiently, not sure how a big Unix box helps.
    Now, not everyone is Google (it was just the largest, most-mission critical example of multi-DC active-active I came up with), but for example, most web startups aims to reach tens of millions of users in a few years, and scale-out on commodity HW and open source SW is the only way that proved itself again and again.
    Active-Active is not a critical requirement, but if you start from scratch, it is a great thing to have if you build a massive, internet facing service.

    1. OK my fault regarding not finding 5.8.9 section and thanks for pointing it out. But the conflict resolution is on metadata only, so technically speaking, even there is no way of resolving it according to buisness rules (e.g. in Streams you can develop your own handlers and anything you can with PLSQL, even with regexp() on data if needed).

      Most (99%) of companies are not anything even close to Google, but still there is buzz regarding active-active. I’m asking why do you need active-active? People are talking active-active as a “business requirement”, while in my opinion BR can be only specify SLA (e.g 99.99%or higher)… so active-actve is really a service/technology requirement coming out of business (sic!). So really active-active DC problem is nothing more than HA problem… and there are various way to solve it, of which active-active write-anywhere-DB seems to be most problematic.. and actually there is much more do you can do a) in your own single DC to get those SLAs b) IT processes/flows … that active-active gets you.

      The only other reason (BR) for active-active seems to be desire to have low latency to/from user, but it does not always mean that you have do full active-active, wont you agree with me? I do not have accept writes in 2nd location, just provide read only view of that data… + perform quick failover (< 600s?) in case of stuff going wrong…. and everyone's life is much simpler, isn't it?

      And finally, Linux sucks hard when it comes to RAS compared to even low-end AIX/Solaris boxes. Don't get me wrong, but years ago there was HA-by-hardware-mentality, now we have HA-by-software-active/active-mentaility… like nobody would realize that HA is just much more than technology/single idea…

      1. I’ll just touch the main points..
        Regarding active-active… Again, I personally don’t see a big buzz or interest. But still, if you build a new company around a web application that aims to have millions of users, why not? You take a proven, open-source, scalable SW like Cassandra that is built with multiple data center support, deploy your 30 nodes across three different geographical regions of your cloud provider, and “you are done” 🙂
        The app itself will not be out of sync, as you design it to fit the functionality of your database, just like you would do with Streams (to handle its limitations). Also, comparing it to Streams is just plain wrong. With Oracle Streams, you have multiple independent databases that are stitched together by replication. In cassandra for example, there is a single distributed database, that you control its consistency (even per transaction) – including if to wait for quorum in the local DC or globally. Anyway, Oracle Streams “Active-Active” implies Enterprise Edition (nearly 25K$ price list per CORE), and at 30 nodes it will probably be the most absurdly expensive option vs. any alternative.
        Also, you remarked about a quick failover of 600s. Why not pick an architecture where you don’t have to pay that price?
        And please, let’s not start a “linux sucks” discussion. First, it doesn’t. Second, it already killed the Unix market years ago (check market share), so it is overall better. Most of issues you see are low quality HW that some people pick to save extra money. Anyway, failed node is a solved SW issue in modern distributed databases and platform, so even if there was a general massive RAS issue with linux servers, it has no effect on database uptime.
        Finally, here is an example smaller than Google, that fits the spirit of what I said:—different-replication-properties-for-each

        1. OK i’ll touch main points here… you wrote:
          1) “You take a proven, open-source, scalable SW like Cassandra that is built with multiple data center support” — please define “proven”. For me proven is MySQL and PgSQL.. but you are seriously not serious about defining NoSQL as proven to me. Would you use Cassandra in strict OLTP-like env? I would love observing a big NoSQL implementation with all those hassles (single writer locks in MongoDB, sections in MongoDB that state “backups in distributed envs are challenging and might be impossible… just add next replicasets” – what where they smoking? etc, etc). Have you tried to perfom online indexing without impact on rest of stuff, in one of those “proven” DBs? Have you tried to fix 1% of corrupted data without impact to rest of data while maintaining uptime? Have you tried to do a group + sort JSON query or matched ORM-generated data to JSON query? What’s the level of performance instrumentailization of those proven DBs?.. ah i see i need to run it on Solaris to get Dtrace to diagnose problems…

          2) “And please, let’s not start a “linux sucks” discussion. First, it doesn’t. Second, it already killed the Unix market years ago (check market share), so it is overall better. ” — it didn’t. Just like UNIX is immature compared to Mainframe RAS features (and Mainframes are still doing fine in real mission critical envs; the only game changer for me personally is the IBM bringing some features from Mainframes to AIX@POWER). Same for Linux: Linux on commodity is so immature compared to UNIX RAS that it is even hard to describe. Those features are missing, mainly because lack of integration with HW / Vendors.. Linux vs UNIX RAS it almost like apple VS oranges, agreed that most companies didn’t need that uptime…

          3) “Anyway, failed node is a solved SW issue in modern distributed databases and platform, so even if there was a general massive RAS issue with linux servers, it has no effect on database uptime.” …
          Nice theory but still theory:) But you are talking to practitioner (production DBA), so i can actually observe flavors of types of downtime (MySQL, PostgreSQL, Oracle, Oracle RAC, MongoDB)…do you realize that even in RAC, downtime of node -> 1/N loss of availability ??? Be it 50s + time remaster blocks – but still. Too much cool-aid about active/active solutions and distributed systems, and this is my point, people who create buzz about active-active, never ever tried to implement 1000+trx/s over > 150km/15ms+ latency, .. and the truth is that 99% people don’t need need it..

          4) Currently 90% of blogs are about NoSQL/alternative DB solutions, but Google for how many instances EBay used in 2010/2011.. from memory 2000+ Oracle instances, 1000+ MySQL instances, less than 100 Cassandra instances

  5. Hi Jakub,
    thanks for coming back!
    1. I don’t know what is “proven” for you, but hundreds of companies have deployed mission-critical production web-facing apps on Cassandra (BTW – Cassandra, like most NoSQL DBs, is focused on OLTP workloads). So, some percentage of the IT world must perceive Cassandra as proven. This doesn’t imply it has technical feature parity with Oracle or something else, just that it is being widely used in production, successfully… See for example:
    2. Again, I’m not trying to convince you, I’m just saying that MF or Unix are nowadays niche technologies used mostly by the legacy install base. I don’t care if Linux had better / equivalent / lower RAS than Unix – my point is that the market have moved long time ago to Linux – it had way better TCO and was good enough.
    (personally, my experience with x86 servers from Cisco, HP, IBM etc is positive, with occasional, rare local disk failure…)
    3. I happen to have spent many years with Oracle RAC, helping many customers when I worked at Oracle. The truth is that 99% of Enterprise customers don’t need Oracle RAC – they could easily scale up nowadays and also use virtualization for fast HA. But this is not the point…
    Bottom line is that the web (facebook, twitter, google, yagoo, bing, amazon, etc etc etc) don’t run on Oracle, but on various distributed systems, and seems to have enough scalability and availability. You don’t have to learn about it, but it is real.
    4. ebay (and salesforce) are examples of web companies with Oracle legacy systems (both started using Oracle more than a decade ago). I never said you can’t succeed with Oracle – it is just very rare in the web world.

  6. What would you do if there is the need to acquire a faster solution within 6 months to replace a standard RDBMS for analytics that today just is not fast enought four our worload? After several POCs we are stuck with classical MPP options due these 3 factors: a) vertical solution, avoiding the battle in the IT team on HW resource allocation; b) Good SQL support, our Hadoop experience was really bad on this field; c) Need of a single vendor for the whole stack of technology, avoiding risks and the extra time on problem determination and diagnosis…

    1. I would say reasons A+C are a mix of political (in-fighting) and cultural (risk-aversive), not technical. I’m sure you could get end-to-end Hadoop support from Cloudera/Hortonworks/MapR or even IBM/Oracle. Regarding in-fighting – if you lack strong IT leadership,yes, it will be very hard. Regarding B – can’t argue, only mention that the rate of improvements is amazing, even though it doesn’t fit your time constraints. I would guess though that you are looking at MPP not as the ultimate organizational Big Data platform, but as a niche DW solution, right?
      Anyway, if you have an existing RDBMS and you look for a drop-in replacement, you are in a tough place with or without Hadoop. Migration between DB platforms is very hard and painful (including BI, ETL, integrations, backup/recovery, monitoring etc), will likely kill all cost saving.
      Likely you can’t currently scale within a reasonable cost – typically it is about cost of scaling, not the feasibility of scaling. For example, with Oracle it is always feasible to try to throw Exadata racks on the problem….
      Will the alternative MPP option provide you with scalability with reasonable cost? If you’ll have to scale 4x in a couple of years, could you afford it? Can you afford a real-world test/pre-prod environment? some medium-sized dev environments? A DR environment? This is part of the Hadoop attractiveness.
      Also, you are likely not a classical “big data” scenario – you have some incremental scalability challenges in your DW, you are not looking for a revolution. If you grow within Moore’s law, I would also consider sticking with your current technology and upgrading the infrastructure more frequently (ex: new HW, latest fastest CPU, tons of RAM, tons of flash, latest DB version).

      1. Dear Ofir, just to go a little deeper here. Our solution today is a classic RDBMS customized and modelled to act as a classic DW reaching 25 TB compressed (2.5 compression rate). Many of the queries underperform mostly due I/O bottleneck from the corporate storage, user concurrency, and excessive use of temporary space – fault on the query optimizer in my opinion. Scaling will be mostly 2x in the next 4 years, but we are planning to acquire something with extra room by sure. Assynchronous DR is feasible for us, 2-4 hour downtime is acceptable.

        We did test it against column store, SW-only counterparts (same queries, same data, same HW/SW) and it got beaten, but a vertical solution really appealed to us (read classical MPP appliance). We are using a technology stack with HW an SW provided by a single vendor – excepting storage – have on site support from this vendor, and still struggling with issues on very specific details regarding I/O, memory management, buffer areas used by the operating system and so for. We do not want to face the same issues with some inhouse bundled solution like it is today anymore.

        By the time there are more full support (we are not in the US), vertical, hadoop-based solutions (like Pivotal HD) I think we will feel more confortable jumping on the bandwagon, but now It is not even in our dreams to acquire racks, blades and storage to assemble something and install some Hadoop based distribuition on it.

        My guess is that the ecosystem will change a lot in the next 3 or 4 years, classical MPP DW appliance prices will go down – a lot due Hadoop penetration – and we will surely negociate better terms on a future contract – and even consider going for a more mature, enterprise-ready Hadoop solution. But for our short term need, despite all the fuzz about it, I would go for a conservative approach.

        1. Great comment!
          I don’t have a simple solution for you – Hadoop is better suited for new projects, not as a cheaper, “drop in replacement”.
          When considering an MPP appliance, did you evaluate migration costs? In many cases, there are thousands of ETL maps and scripts to be tested and updated, many clients to check, tons of SQL to modify (different SQL dialect, hints, behaviors etc) and outright different / missing functionality… Very painful. Same for the infrastructure – how to monitor, tune, backup etc.
          On the other hand, you mentioned that Hadoop is “build your own” – you can actually buy a pre-built, optimized, tested Hadoop cluster from your favorite vendors (IBM/Oracle/EMC/HP/Cisco etc) or use their blueprints.
          As a large DW that grows slowly, the alternative is to directly tackle your existing bottlenecks for major performance boost, for example:
          – I/O bottleneck from the corporate storage – buy a dedicated enterprise storage with decent amount of flash (TBs or more). Add enough connectivity for high bandwidth (8/16GB FC cards or 10Gb/s NICs) – there are design patterns to reach nGB/s. (For Oracle, Exadata provides similar pain relief).
          – Temp disk usage – your next server should have 1-4TB of RAM…
          – User concurrency: add more capacity (cores). Also, as you solved or improved the I/O performance bottleneck, database concurrency goes down:

          So, in your case, you could likely significantly improve your DW performance with faster infrastructure, by focusing on your bottlenecks. And dedicated storage for your DW is easier to sell to the infra team than an isolated Hadoop or MPP environment 🙂
          Just my two cents…

          1. Great feedback. Like you said, migration costs will be a issue, that will affect both ways, classic MPP or Hadoop-based. We are researching deep into SQL compatibility and application interoperability to at least minimize this effort and get the solution that will have the smoother transition. Dealing with current bottlenecks may be an option, but our management is anger to put effort and money in some definitive solution and not some workaround that may work or may expose another bottleneck. As far I can tell about big vendors supplying Haddop, it may be a reality in the US, but in my country even EMC was afraid to show us a quote for Pivotal HD, saying that it was “cheaper but slower” than Greenplum DB (a way of not discussing the possitilbity for real). The real issue is that these vendors are afraid of loosing sales in their classic MPP areas AND there is lack of skill set from this vendors to provide support for Haddop-based solutions. Right now if we acquire something like that we would work as a test-tube for their local tech people. Sad.

          2. From my experience, there is a dramatic migration cost even from SQL Server to Oracle or Oracle to Greenplum. I find it unlikely that it will be easier to migrate to other MPP players or Hadoop.
            You can explain your management that every solution will have bottlenecks… If you can do a relatively simple infra upgrade and get several years of good performance (ex: 10x perf in day 1) – what else can they ask for? Hadoop for example may run into various bottleneck – data center space, or power limits, or operational cost (power and cooling and floor space of several full racks) etc.
            Anyway, you are right that there is a challenge of getting experienced (trusted) people to implement new technologies, big part of the challenge / risk. This is way it is not yet ready for all to adopt – still for early adopters.

  7. Proper data Analytics with tolerable Latency any open stack or Hadoop stack need to go a long way .
    Talking about structured data mainly .

    If we consider Hadoop stack HDFS,HIve ,spark,Kafka etc. i believe they have different purpose than Traditional MPP systems . They both augment each other and data ingestion is bidirectional.

    Now Impala like systems are MPP in traditional sense can compete with traditional MPP and can be used for new developments but there are many ifs and buts .

    Spark RDD is also hyped. what spark RDD can do one can do with GTT also in MPP system
    Cost wise also any so called open source systems are not cheap if one considers the hardware cost ,development,maintenance etc.
    As per as Dynamic schema is concerned it is over hyped and for dimensional analysis one needs a fixed schema . Key value table does not give all information
    To circumvent the inflexibility of the data model which can only answer some set of questions u create another table with same size, u create materialized views , create secondary indexes on fields as well as Collection .
    Even MongoDB,Cassandra etc. is now moving towards more traditional design .
    All the solutions are hiding behind in memory option . , hardware ,redundancy and functional inflexibility

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s