Even if you’ve never heard so far about stream processing, it is an easy guess you’ll likely run into this term in the coming weeks around Hadoop World, given the way it illustrates so nicely what YARN all about and also of course given the Hortonworks announcement of endorsing Apache Storm (disclaimer – I love Hortonworks).
Now, before I start fantasizing about the potential popularity of various architectures in the years to come, some quick background…
“Traditional” big data streaming / messaging is just about moving events from place to place. For example, collecting log records from dozens or hundreds of servers in near real-time and delivering each to the relevant target(s), like HDFS files,HBase, a database or another application in a reliable fashion (with specific semantics). It is a re-incarnation of the message queuing concepts, in a high-throughput, distributed fashion and extended both to actively collect events directly from various sources and to persist the events on the other side in various targets.
It seems to me the most popular system is Apache Kafka (originally from LinkedIn), but there is also Apache Flume (originally from Cloudera) which is bundled in most Hadoop distributions.
Stream processing on the other hands, focuses on doing complex parsing, filtering, transformations and computations on the data stream as the data flows, for example maintaining an up-to-date counters and aggregated statistics per user or rolling window “top n” (continuous computation). This is opposed to batch processing frameworks, like MapReduce and Tez, that are based on one-time bulk processing. In this space, Apache Storm (originally from Backtype / Twitter) is the main player, but as LinkedIn released Apache Samza last month, it might lead to some healthy competition. There are other initiatives – for example, Cloudera contributed a library or building real-time analytics on Flume which might be good enough for simpler cases.
By the way, these two types of systems are complementary. Specifically, the combination of Kafka and Storm seems to be gaining traction – here is one success story from LivePerson.
Now, stay with me for a minute while I hype… What if you could write most of the common batches as stream processing? What if you could provide your business users updated data that is only seconds or tens of seconds behind the present? What if you could just have the answers to many common questions always updated and ready? Could stream processing significantly de-emphasize / replace batch processing on Hadoop in the future?
Now, that is of course mostly crazy talk today. In reality, maybe stream processing would not be a good fit for many computation types, and it is rather early to fully understand it. Also, the functionality, optimizations, integrations, methodologies and best practices around stream processing would likely take a few years to really mature.
Having said that, it seems to me that at least in the database / ETL space, stream processing has a great potential to evolve into a powerful alternative paradigm on how we design a system that adds analytic to operational systems.
But, instead of hyping anymore, I’ll ask – how crazy is that? How far are we from such capabilities? What I already see looks promising in theory. Here are a couple of examples:
- Storm already has its own official abstraction layer to simplify writing complex computations. It is called Trident, contributed by Twitter and is part of Storm 0.8. It provides higher-level micro-batch primitives that allows, for example, to build parallel aggregations and joins. Interestingly, when Twitter released it a year ago, they specifically described it as similar to Pig or Cascading, but for stream processing.
It seems to me that declarative,SQL-like continuous computation is only one step away from that.
- Even more fascinating is Twitter’s Summingbird (open-sourced last month). It is a programmer-friendly abstraction of stream processing that can be executed either in real-time (on Storm) or as batch (using Scalding, translated to MapReduce, in the future maybe to Tez). More interestingly, It was designed to allow an hybrid mode, transparently mixing historical data (batch) from Hadoop with last-minute updates from Storm.
So, batch and streaming could co-exist and even transparently mix, even today.
Now, it remains to be seen of course how will those technologies evolve, but it seems that the current state of events is already not too bad. With the adoption steam processing by mainstream Hadoop distributions (surely Hortonworks wont be the last), usability, functionality and innovation will likely all accelerate. So, if you are starting to look into “next-gen” Big Data infrastructure and low require latency from event to results, it might be a good time to start looking into stream processing.
Update 30/10/13: I can’t believe I didn’t spend time to read about Lambda Architecture before. Nathan Marz (who led Twitter’s Storm) came up with a coherent, end-to-end architecture for a data system to make relational databases obsolete, with both a batch processing component and a stream processing component. Find some time to watch his fascinating presentation from Strange Loop 2012, you won’t regret it.