Just a few hours ago, Hortonworks posted several new posts on their blog regarding several stealth projects that they developed and (as can be expected from Hortonworks) are working to open source and contribute to the community.
The hottest one seems to me to be Tez, which is being donated to ASF as an incubator project. Its main aim is to accelerate the runtime of Hive, Pig and Cascading jobs. It defines a concept of a task, which comprises of a set of (Direct Acyclic Graph of) MapReduce jobs and treats it as a whole, streaming data between them without spilling (spooling) back to HDFS after each MapReduce pair.
Why is that good?
- The actual writing and reading temporary results from HDFS (reducer output from the middle of the chain) can be very slow and resource intensive.
- This is a blocking operation – so the whole parallel processing may stale due to a few slow tasks of one set of MapReducer (while speculative execution helps here, it doesn’t eliminate the problem).
It is very interesting that Hortonworks chose to publish this as a new infrastructure project. It seems they claim it to be a sort of next-gen, generalized MapReduce. I guess it started as a way to accelerate Hive, but it’s great that it will also help Pig (which also generates MapReduce chains) and Cascading (which helps developers easily create MapReduce chains).
This is part of Hortonworks effort to make Hive 100x faster - the “Stinger Initiative“. Continue reading

