Just a few hours ago, Hortonworks posted several new posts on their blog regarding several stealth projects that they developed and (as can be expected from Hortonworks) are working to open source and contribute to the community.
The hottest one seems to me to be Tez, which is being donated to ASF as an incubator project. Its main aim is to accelerate the runtime of Hive, Pig and Cascading jobs. It defines a concept of a task, which comprises of a set of (Direct Acyclic Graph of) MapReduce jobs and treats it as a whole, streaming data between them without spilling (spooling) back to HDFS after each MapReduce pair.
Why is that good?
- The actual writing and reading temporary results from HDFS (reducer output from the middle of the chain) can be very slow and resource intensive.
- This is a blocking operation – so the whole parallel processing may stale due to a few slow tasks of one set of MapReducer (while speculative execution helps here, it doesn’t eliminate the problem).
It is very interesting that Hortonworks chose to publish this as a new infrastructure project. It seems they claim it to be a sort of next-gen, generalized MapReduce. I guess it started as a way to accelerate Hive, but it’s great that it will also help Pig (which also generates MapReduce chains) and Cascading (which helps developers easily create MapReduce chains).
This is part of Hortonworks effort to make Hive 100x faster - the “Stinger Initiative“. Since they is a growing number of startups and products who claim to be 10x-100x times faster than Hive, Hortonwork’s work will disrupt the baseline that everyone is using…
Other parts of Stinger initiative seems to be an optimized row-column format (ORCFile) – See the presentation in Hive-3874 for details. They already claims to be superior to Cloudera/Avro Trevni - for example with support for dictionary compression for strings and RLE for integers… I guess we’ll start seeing benchmarks later this year of Impala/Trevni vs. the enhanced Hive on Tez and ORCFile.
The Stinger post have a vague screenshot with three more projects that will come in middle of next year – it sure seems they have a clear technical roadmap for Hive.
In addition to all that, they also came up with a new security infrastructure called Knox Gateway - a secured gateway that can serve multiple Hadoop clusters of different versions. They mention that Microsoft have joined this effort, so it will likely work well with Active Directory… It should be voted for an incubator tomorrow, with early release next month.
To sum it up, today is Hortonworks day… A massive contribution and technical vision across products and functionality that will help Hadoop mature in the coming years. I’m constantly amazed by the speed at which the ecosystem moves forward, and its great to see a lot of innovation is being shared with the community. For now, all I can say is thank you Hortonwork for the awesome work! Oh, and by the way next week it’s Greenplum’s move