Highlights from writing a data sheet for JethroData

A couple of weeks ago, my friends at JethroData asked me to help them out writing a technical data sheet.

Now, JethroData product is a SQL engine for Hadoop, currently in private beta, that has a very different design and architecture than the rest of the crowd. Specifically, it actually always indexes all the columns of your tables, and also uses HDFS as a shared, remote storage instead of running locally on the data nodes. It has some very good reasons to work this way, but being radically different has made it harder to explain.

So, my task was to write a technical data sheet that would help technical people switch from “this is an insane nonsense” to “that actually makes sense in these cases, but how do you handle this or that…”. In other words, tell a consistent technical story that makes sense and highlights their strengths, encouraging readers to engage further.

I think data sheets in general are a piece of paper you get from a vendor at their booth, and quickly browse through them later while you are stuck at a boring presentation (or is it just me?). So, they have to be interesting and appealing. Also, due to their short format, they should be very focused on the main values and not too high-level or low-level. Quite a challenge!

To overcome the challenge, I first needed deeper knowledge regarding the product. For that, I spent some time with JethroData team, especially with Boaz Raufman, their co-founder and CTO. I have grilled him with questions on architecture and functionality, which led to great discussions. While of course some functionality is still lacking at this early phase, I was impressed to see that the core architecture is designed to handle quite a lot.

Then came the harder part – focus. There are many ways to use JethroData, its implementation has various advantages in different scenarios and of course, the implementation has many details, each with some logic behind it. It is impossible to cram it all into a two-page datasheet, even if most of it is very important. So, I had several lively discussions with Eli Singer, JethroData CEO, of everything from positioning to many specific strengths and supporting arguments and their relative importance.

At the end of the day, I think we got something pretty sweet. You can read it on their web site technology page. If you are attending Strata/Hadoop World, JethroData team will be on the Startup Showcase on Monday and in booth #75 Tuesday-Wednesday – bring questions and arguments 🙂

How will the rise of stream processing effect Big Data paradigm?

Even if you’ve never heard so far about stream processing, it is an easy guess you’ll likely run into this term in the coming weeks around Hadoop World, given the way it illustrates so nicely what YARN all about and also of course given the Hortonworks announcement of endorsing Apache Storm (disclaimer – I love Hortonworks).

Now, before I start fantasizing about the potential popularity of various architectures in the years to come, some quick background…

“Traditional” big data streaming / messaging is just about moving events from place to place. For example, collecting log records from dozens or hundreds of servers in near real-time and delivering each to the relevant target(s), like HDFS files,HBase, a database or another application in a reliable fashion (with specific semantics). It is a re-incarnation of the message queuing concepts, in a high-throughput, distributed fashion and extended both to actively collect events directly from various sources and to persist the events on the other side in various targets.

It seems to me the most popular system is Apache Kafka (originally from LinkedIn), but there is also Apache Flume (originally from Cloudera) which is bundled in most Hadoop distributions.

Stream processing on the other hands, focuses on doing complex parsing, filtering, transformations and computations on the data stream as the data flows, for example maintaining an up-to-date counters and aggregated statistics per user or rolling window “top n” (continuous computation). This is opposed to batch processing frameworks, like MapReduce and Tez, that are based on one-time bulk processing. In this space, Apache Storm (originally from Backtype / Twitter) is the main player, but as LinkedIn released Apache Samza last month, it might lead to some healthy competition. There are other initiatives – for example, Cloudera contributed a library or building real-time analytics on Flume which might be good enough for simpler cases.

By the way, these two types of systems are complementary. Specifically, the combination of Kafka and Storm seems to be gaining traction – here is one success story from LivePerson.

Now, stay with me for a minute while I hype… What if you could write most of the common batches as stream processing? What if you could provide your business users updated data that is only seconds or tens of seconds behind the present? What if you could just have the answers to many common questions always updated and ready? Could stream processing significantly de-emphasize / replace batch processing on Hadoop in the future?

Now, that is of course mostly crazy talk today. In reality, maybe stream processing would not be a good fit for many computation types, and it is rather early to fully understand it. Also, the functionality, optimizations, integrations, methodologies and best practices around stream processing would likely take a few years to really mature.

Having said that, it seems to me that at least in the database / ETL space, stream processing has a great potential to evolve into a powerful alternative paradigm on how we design a system that adds analytic to operational systems.

But, instead of hyping anymore, I’ll ask – how crazy is that? How far are we from such capabilities? What I already see looks promising in theory. Here are a couple of examples:

  • Storm already has its own official abstraction layer to simplify writing complex computations. It is called Trident, contributed by Twitter and is part of Storm 0.8. It provides higher-level micro-batch primitives that allows, for example, to build parallel aggregations and joins. Interestingly, when Twitter released it a year ago, they specifically described it as similar to Pig or Cascading, but for stream processing.
    It seems to me that declarative,SQL-like continuous computation is only one step away from that.
  • Even more fascinating is Twitter’s Summingbird (open-sourced last month). It is a programmer-friendly abstraction of stream processing that can be executed either in real-time (on Storm) or as batch (using Scalding, translated to MapReduce, in the future maybe to Tez). More interestingly, It was designed to allow an hybrid mode, transparently mixing historical data (batch) from Hadoop with last-minute updates from Storm.
    So, batch and streaming could co-exist and even transparently mix, even today.

Now, it remains to be seen of course how will those technologies evolve, but it seems that the current state of events is already not too bad. With the adoption steam processing by mainstream Hadoop distributions (surely Hortonworks wont be the last), usability, functionality and innovation will likely all accelerate. So, if you are starting to look into “next-gen” Big Data infrastructure and low require latency from event to results, it might be a good time to start looking into stream processing.

Update 30/10/13: I can’t believe I didn’t spend time to read about Lambda Architecture before. Nathan Marz (who led Twitter’s Storm) came up with a coherent, end-to-end architecture for a data system to make relational databases obsolete, with both a batch processing component and a stream processing component. Find some time to watch his fascinating presentation from Strange Loop 2012, you won’t regret it.