I recently read Merv’s excellent post on proprietary vs open-source Hadoop – suggesting that use the term distribution-specific is likely more appropriate. It reminded that I wanted to write about the subject – I had a somewhat more detailed classification in mind.
I’m focusing on Hadoop but this also is very relevant to the broader Big Data ecosystem, including the NoSQL/NewSQL solutions etc.
But first, why bother? Well, the classification typically hints some things to decision makers regarding that piece of software. For enterprises, which typically relies on outside help, some open questions might be:
- Skills – How can we get confidence in using this software (people, training etc)? Can we get help from our partners / consultants / someone new if needed? In our region / country?
- Support – Can we get commercial support? Global or local? Is there more than one vendor so we can choose / switch if the relationship goes bad?
- Finally, how much of a “safe bet” is it? Will it be in a few years a niche, abandoned code or mainstream? Will it integrate well with the ecosystem over time (ex: YARN-friendly) or not? and what are the odds that there will be someone around supporting it in several years when we still rely on it in production?
Of course, the state of piece of software may change over time, but still I think each of the following categories has some immediate implications regarding these questions. For example, proprietary (closed-source) projects should come with decent paid support, optional professional services and training, average plus documentation, and decent chance of long term support (but small startups are still a potential risk).
Having said that, here is the way I classify projects and products:
- A proprietary offering of a specific Hadoop distribution – a closed-source offering bundled with a vendor’s Hadoop distribution. Some examples are MapR (with proprietary implementation of HDFS, MapReduce etc), Pivotal (with HAWQ as the central attraction of their distribution) and also Cloudera (with Cloudera Manager, although proprietary installation and management is almost universal).
The propriety bits are usually considered by the vendor as a significant part of their secret sauce.
- Proprietary offering on Hadoop – a closed-source offering that runs on Hadoop. Some are distribution-agnostic (A.K.A Bring-Your-Own-Hadoop), for example Platfora, which runs on Cloudera, Hortonworks, MapR and Amazon. Some are certfiied on only a specific distribution (for now?) like Teradata-Hortonworks.
Also, while some are Hadoop-only, many are an extension of an existing vendor offering to Hadoop, for example – all ETL vendors, some analytics vendors, some security vendors etc (the BI vendors typically just interact with Hadoop using SQL).
- Open source projects, not bundled with a distribution – many great tools have been released as open source and have started building a community around them. While they are not yet bundled with an existing distribution, this doesn’t stop many from using them.
Some of the projects complement existing functionality – like salesforce.com Phoenix for smart, thin SQL layer over HBase or Netflix Lipstick for great Pig workflow visualization. Other projects are alternatives to existing projects, like Linkedin (now Apache) Kafka for distributed messaging as a popular alternative for Apache Flume or like Apache Accumulo as a security-enhanced alternative to Apache HBase. Yet others are just new types of stuff on Hadoop, like Twitter’s Storm for distributed, real-time, stream computations or any of the distributed graph processing alternatives (Apache Giraph and others).
- Open source projects, bundled with a single distribution – a few of them are existing open source projects that got picked by a distribution, like Linkedin DataFu that adds useful UDFs to Apache Pig and is now bundled with Cloudera. However, most of them are projects led by the distribution, with varying community around them, for example Hortonworks Ambari (Apache Incubator), Cloudera Impala and Cloudera Hue, plus various security initiatives all over the place (Cloudera Sentry, Hortonworks Knox, Intel Rhino etc).
As an interesting anecdote of the market dynamics – it seems Cloudera Hue is now also bundled in Hortonworks 2.0 beta under “third-party components”, which might hint it could evolve into last category:
- Industry-standard open source projects – these are projects that you can expect to find on most leading distributions, meaning, several vendors provide support.
While some are obvious (Pig,Hive etc), some are new or just arriving. For example, HCatalog was created by Hortonworks but is now included in Cloudera and MapR. I would guess that merging it into Hive helped smooth things out (likely it is easier to deliver it than rip it out…). It would be interesting to see if Tez will follow the same adoption path, as Hortonworks planned to eventually build support directly in Hive, Pig and Cacading. A similar fate would likely come to most of the security projects I mentioned, as well as ORC and Parquet support (ORC is already in Hive).
So there you have you it. I believe knowing the type of project can help organization gauge the risk involved (vs. the benefit) to make an informed decision about its use.
Oh, and on a side note – “bundled in a distribution” and “distribution does provides decent support, bug fixes and enhancements” are of course two different things… That’s why the “brag the number of your commiters” resonate well with me, but that’s a different story.