Hortonworks brings out the heavy artillery

Just a few hours ago, Hortonworks posted several new posts on their blog regarding several stealth projects that they developed and (as can be expected from Hortonworks) are working to open source and contribute to the community.

The hottest one seems to me to be Tez, which is being donated to ASF as an incubator project. Its main aim is to accelerate the runtime of Hive, Pig and Cascading jobs. It defines a concept of a task, which comprises of a set of (Direct Acyclic Graph of) MapReduce jobs and treats it as a whole, streaming data between them without spilling (spooling) back to HDFS after each MapReduce pair.

Why is that good?

  • The actual writing and reading temporary results from HDFS (reducer output from the middle of the chain) can be very slow and resource intensive.
  • This is a blocking operation – so the whole parallel processing may stale due to a few slow tasks of one set of MapReducer (while speculative execution helps here, it doesn’t eliminate the problem).

It is very interesting that Hortonworks chose to publish this as a new infrastructure project. It seems they claim it to be a sort of next-gen, generalized MapReduce. I guess it started as a way to accelerate Hive, but it’s great that it will also help Pig (which also generates MapReduce chains) and Cascading (which helps developers easily create MapReduce chains).

This is part of Hortonworks effort to make Hive 100x faster –  the “Stinger Initiative.

Continue reading

Exadata HCC – storage savings revisited

This is a third post in a series regarding Exadata HCC (Hybrid Columnar Compression) and the storage savings it brings to Oracle customers. In Part 1 I showed the real-world compression ratios of Oracle’s best DW references, in Part 2 I investigated why is that so, and in this part I’ll question the whole saving accounting.

So, we saw in part 1 that most Exadata DW references don’t mention the storage savings of HCC, but those who do show an average 3.4x “storage savings”.  Now let’s see what savings, if any, this brings. It all has to do with the compromises involved in giving up modern storage capabilities and the price to pay to when fulfilling these requirements with Exadata.

Let me start with a somewhat weaker but valid point. Modern storage software allows online storage software upgrade. A mission-critical database (or any database) shouldn’t be down or at risk when upgrading storage firmware or software. In order to achieve similar results with Exadata, the storage layer has to be configured as a three-way mirror (ASM High Redundancy). This is actually Oracle’s best practice, see for example the bottom of page 5 of the Exadata MAA HA paper. This configuration uses significantly more storage than any other solution on the market. This means that while the total size of all the data files might be smaller in Exadata thanks to HCC, you still need a surprisingly large raw volume of storage to support it, or you’ll have to compromise and always use offline storage software upgrades – likely the critical quarterly patch bundle, which could take at least an hour of downtime to apply, from what I read on the blogsphere.

To make it a bit more confusing, the Exadata X3 datasheet only mentions (in page 6) the  usable data capacity with 2-way mirror (ASM normal redundancy), even though the recommended configuration is 3-way mirror. I wonder if that has anything to do with later providing less net storage?

Continue reading

Exadata HCC – where are the savings?

this is a second in a series of posts on Oracle’s Exadata Hybrid Columnar Compression (HCC), which is actually a great feature of Oracle database. It is currently locked to Oracle-only storage (Exadata, ZFS appliance etc) and Oracle marketing pushes it hard as it provides “10x” compression to Oracle customers.

In the previous post, I showed that Oracle’s best data warehouse reference customers gets only an average “storage saving” of at most 3.4x. In this post, I’ll investigate why they don’t get the promised “10x-15x savings” that Oracle marketing occasionally mentions. In the next post, I plan to explain why I use double quotes around storage savings – why even that number is highly inflated.

10x compared to what? I remember that in one of the recent Oracle Openworld (was it this year?), Oracle had a marketing slide claiming Exadata provides 10x compression and non-Exadata provides 0x compression… (BTW – please post a link in the comments if you can share a copy). But leaving the funny / sad ExaMath aside, do non-Exadata customers enjoy any compression?

Well, as most Oracle DBAs will know, Oracle have introduced in 9i Release 2 (around 2002) a new feature that was called Data Segment Compression, which was renamed in 10g to Table Compression, in 11g release 1 to a catchy “compress for direct_load operations” and as of 11g release 2 is called Basic Compression. This feature is included in Enterprise Edition without extra cost. It provides dictionary-based compression at the Oracle’s table data block level. It is most suited for data warehousing, as the compression kicks in only during bulk load or table reorganization – updates and small inserts leaves data uncompressed.

What is the expected (and real world) average compression ratio of tables using this feature? The consensus is around 3x compression. Yes, for data warehousing on non-Exadata in the last decade Oracle provides 3x compression with Enterprise Edition!

Continue reading