Oracle In-Memory Option: the good, the bad, the ugly

Last week, Oracle have announced the new Oracle Database In-Memory Option. While there is already a great discussion at Rob’s blog and further analysis at Curt’s, I thought I could add a bit.

The Good

If Oracle does deliver what Larry promises, it will be the Oracle’s biggest advance for analytics since Exadata V2 in 2009, which introduced Hybrid Columnar Compression and Storage Indexes. I mean, especially for mixed workloads – getting 100x faster analytical queries while OLTP goes faster… Quite a bold promise.

The Bad

I’ll be brief:

  • How Much– in Oraclespeak, Option means extra-cost option. Both the pricing model (per core / per GB) and the actual price haven’t been announced. Since this is Oracle, both of them will be decided by Larry a month before GA – so the TCO analysis will have to wait…
  • When – it seems this is a very early pre-announcement of a pre-beta code. Since it missed 12c release 1 (which came out this July), I assume it will have to wait to 12c release 2, so it will likely be end of next year. Why? I would assume that a feature so intrusive is too much for a patchset (unless they are desperate).

Andy Mendelsohn says In-Memory option is pre-beta and will be released “some time next year.” #oow13

— Doug Henschen (@DHenschen) September 23, 2013

  • Why now– Oracle is obviously playing catch up…Even if we put Hana aside, it is lagging behind both DB2 (BLU) and SQL Server (especially 2014 – mostly updatable column store indexes, also in-memory OLTP). Also, there might be other potential competitors rising in the analytics space (Impala for starter?). So, this announcement is aimed at delay customers attrition to faster platforms while Oracle scrambles to deliver something.

The Ugly

So, my DB will have 100x faster analytics and 2x faster OLTP? Just by flipping an option? Sound much better (and cheaper) then buying Exadata… Or did Larry mean 100x faster than Exadata? hard to tell.
For some use cases, there will be cannibalization, for sure – for example, apps (EBS, Siebel etc) with up to a few TBs of hot data (which is almost every enterprise deployment) should seriously reconsider Exadata – replace smart scan with in-memory scan and get flash from their storage.

BTW – is this the reason why Oracle didn’t introduce a new Exadata model? Still thinking of how to squeeze in the RAM? That would be interesting to watch.

Update: Is Oracle suggesting In-Memory is 10x faster than Exadata? Check the pic:


Big Data products/projects types – from proprietary to industry standard

I recently read Merv’s excellent post on proprietary vs open-source Hadoop – suggesting that use the term distribution-specific is likely more appropriate. It reminded that I wanted to write about the subject – I had a somewhat more detailed classification in mind.
I’m focusing on Hadoop but this also is very relevant to the broader Big Data ecosystem, including the NoSQL/NewSQL solutions etc.

But first, why bother? Well, the classification typically hints some things to decision makers regarding that piece of software. For enterprises, which typically relies on outside help, some open questions might be:

  • Skills – How can we get confidence in using this software (people, training etc)? Can we get help from our partners / consultants / someone new if needed? In our region / country?
  • Support  – Can we get commercial support? Global or local? Is there more than one vendor so we can choose / switch if the relationship goes bad?
  • Finally, how much of a “safe bet” is it? Will it be in a few years a niche, abandoned code or mainstream? Will it integrate well with the ecosystem over time (ex: YARN-friendly) or not? and what are the odds that there will be someone around supporting it in several years when we still rely on it in production?

Of course, the state of piece of software may change over time, but still I think each of the following categories has some immediate implications regarding these questions. For example, proprietary (closed-source) projects should come with decent paid support, optional professional services and training, average plus documentation, and decent chance of long term support (but small startups are still a potential risk).

Having said that, here is the way I classify projects and products:

  • A proprietary offering of a specific Hadoop distribution –  a closed-source offering bundled with a vendor’s Hadoop distribution. Some examples are MapR (with proprietary implementation of HDFS, MapReduce etc), Pivotal (with HAWQ as the central attraction of their distribution) and also Cloudera (with Cloudera Manager, although proprietary installation and management is almost universal).
    The propriety bits are usually considered by the vendor as a significant part of their secret sauce.
  • Proprietary offering on Hadoop – a closed-source offering that runs on Hadoop. Some are distribution-agnostic (A.K.A Bring-Your-Own-Hadoop), for example Platfora, which runs on Cloudera, Hortonworks, MapR and Amazon. Some are certfiied on only a specific distribution (for now?) like Teradata-Hortonworks.
    Also, while some are Hadoop-only, many are an extension of an existing vendor offering to Hadoop, for example – all ETL vendors, some analytics vendors, some security vendors etc (the BI vendors typically just interact with Hadoop using SQL).
  • Open source projects, not bundled with a distribution – many great tools have been released as open source and have started building a community around them. While they are not yet bundled with an existing distribution, this doesn’t stop many from using them.
    Some of the projects complement existing functionality  – like Phoenix for smart, thin SQL layer over HBase or Netflix Lipstick for great Pig workflow visualization. Other projects are alternatives to existing projects, like Linkedin (now Apache) Kafka for distributed messaging as a popular alternative for Apache Flume or like Apache Accumulo as a security-enhanced alternative to Apache HBase. Yet others are just new types of stuff on Hadoop, like Twitter’s Storm for distributed, real-time, stream computations or any of the distributed graph processing alternatives (Apache Giraph and others).
  • Open source projects, bundled with a single distribution – a few of them are  existing open source projects that got picked by a distribution, like Linkedin DataFu that adds useful UDFs to Apache Pig and is now bundled with Cloudera. However, most of them are projects led by the distribution, with varying community around them, for example Hortonworks Ambari (Apache Incubator), Cloudera Impala and Cloudera Hue, plus various security initiatives all over the place (Cloudera Sentry, Hortonworks Knox, Intel Rhino etc).
    As an interesting anecdote of the market dynamics – it seems Cloudera Hue is now also bundled in Hortonworks 2.0 beta  under “third-party components”, which might hint it could evolve into last category:
  • Industry-standard open source projects – these are projects that you can expect to find on most leading distributions, meaning, several vendors provide support.
    While some are obvious (Pig,Hive etc), some are new or just arriving. For example, HCatalog was created by Hortonworks but is now included in Cloudera and MapR. I would guess that merging it into Hive helped smooth things out (likely it is easier to deliver it than rip it out…). It would be interesting to see if Tez will follow the same adoption path, as Hortonworks planned to eventually build support directly in Hive, Pig and Cacading. A similar fate would likely come to most of the security projects I mentioned, as well as ORC and Parquet support (ORC is already in Hive).

So there you have you it. I believe knowing the type of project can help organization gauge the risk involved (vs. the benefit) to make an informed decision about its use.

Oh, and on a side note – “bundled in a distribution” and “distribution does provides decent support, bug fixes and enhancements”  are of course two different things… That’s why the “brag the number of your commiters” resonate well with me, but that’s a different story.