Exploring the HDFS Default Value Behaviour

In this post I’ll explore how the HDFS default values really work, which I found to be quite surprising and non-intuitive, so there s a good lesson here.

In my case, I have my local virtual Hadoop cluster, and I have a different client outside the cluster. I’ll just put a file from the external client into the cluster in two configurations.

All the nodes of my Hadoop (1.2.1) cluster have the same hdfs-site.xml file with the same (non-default) value for dfs.block.size (renamed to dfs.blocksize in Hadoop 2.x) of 134217728, which is 128MB. In my external node, I also have the hadoop executables with a minimal hdfs-site.xml.

First, I have set dfs.block.size to 268435456 (256MB) in my client hdfs-site.xml and copied a 400MB file to HDFS:

 ./hadoop fs -copyFromLocal /sw/400MB.file /user/ofir

Checking its block size from the NameNode:

./hadoop fsck /user/ofir/400MB.file -files -blocks -racks
FSCK started by root from /10.0.1.111 for path /user/ofir/400MB.file at Thu Jan 30 22:21:29 UTC 2014
/user/ofir/400MB.file 419430400 bytes, 2 block(s):  OK
0. blk_-5656069114314652598_27957 len=268435456 repl=3 [/rack03/10.0.1.133:50010, /rack03/10.0.1.132:50010, /rack02/10.0.1.122:50010]
1. blk_3668240125470951962_27957 len=150994944 repl=3 [/rack03/10.0.1.133:50010, /rack03/10.0.1.132:50010, /rack01/10.0.1.113:50010]

Status: HEALTHY
 Total size:    419430400 B
 Total dirs:    0
 Total files:    1
 Total blocks (validated):    2 (avg. block size 209715200 B)
 Minimally replicated blocks:    2 (100.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:    0 (0.0 %)
 Mis-replicated blocks:        0 (0.0 %)
 Default replication factor:    1
 Average block replication:    3.0
 Corrupt blocks:        0
 Missing replicas:        0 (0.0 %)
 Number of data-nodes:        6
 Number of racks:        3
FSCK ended at Thu Jan 30 22:21:29 UTC 2014 in 1 milliseconds

So far, looks good – the first block is 256MB.

Continue reading

Advertisements

Exploring HDFS Block Placement Policy

In this post I’ll cover how to see where each data block is actually placed. I’m using my fully-distributed Hadoop cluster on my laptop for exploration and the network topology definitions from my previous post – you’ll need a multi-node cluster to reproduce it.

Background

When a file is written to HDFS, it is split up into big chucks called data blocks, whose size is controlled by the parameter dfs.block.size in the config file hdfs-site.xml (in my case – left as the default which is 64MB). Each block is stored on one or more nodes, controlled by the parameter dfs.replication in the same file (in most of this post – set to 3, which is the default). Each copy of a block is called a replica.

Regarding my setup – I have nine nodes spread over three racks. Each rack has one admin node and two worker nodes (running DataNode and TaskTracker), using Apache Hadoop 1.2.1. Please note that in my configuration, there is a simple mapping between host name, IP address and rack number -host name is hadoop[rack number][node number] and IP address is 10.0.1.1[rack number][node number].

Where should the NameNode place each replica?  Here is the theory:

The default block placement policy is as follows:

  • Place the first replica somewhere – either a random node (if the HDFS client is outside the Hadoop/DataNode cluster) or on the local node (if the HDFS client is running on a node inside the cluster).
  • Place the second replica in a different rack.
  • Place the third replica in the same rack as the second replica
  • If there are more replicas – spread them across the rest of the racks.

That placement algorithm is a decent default compromise – it provides protection against a rack failure while somewhat minimizing inter-rack traffic.

So, let’s put some files on HDFS and see if it behaves like the theory in my setup – try it on yours to validate.

Continue reading

Exploring The Hadoop Network Topology

In this post I’ll cover how to define and debug a simple Hadoop network topology.

Background

Hadoop is designed to run on large clusters of commodity servers – in many cases spanning many physical racks of servers. A physical rack is in many cases a single point of failure (for example, having typically a single switch for lower cost), so HDFS tries to place block replicas on more than one rack. Also, there is typically more bandwidth within a rack than between the racks, so the software on the cluser (HDFS and MapReduce / YARN) can take it into account. This leads to a question:

How does the NameNode know the network topology?

By default, the NameNode has no idea which node is in which rack. It therefore by default assumes that all nodes are in the same rack, which is likely true for small clusters. It calls this rack “/default-rack“.

So, we have to teach Hadoop our cluster network topology – the way the nodes are grouped into racks. Hadoop supports a pluggable rack topology implementation – controlled by the parameter topology.node.switch.mapping.impl in core-site.xml, which specifies a java class implementation. The default implementation is using a user-provided script, specified in topology.script.file.name in the same config file, a script that gets a list of IP addresses or host names and returns a list of rack names (in Hadoop2: net.topology.script.file.name). The script will get up to topology.script.number.args parameters per invocation, by default up to 100 requests per invocation (in Hadoop2: net.topology.script.number.args).

Continue reading

Creating a virtualized fully-distributed Hadoop cluster using Linux Containers

TL;DR  Why and how I created a working 9-node Hadoop Cluster on my  laptop

In this post I’ll cover why I wanted to have a decent multi-node Hadoop cluster on my laptop, why I chose not to use virtualbox/VMware player, what is LXC (Linux Containers) and how did I set it up. The last part is a bit specific to my desktop O/S (Ubuntu 13.10).

Why install a fully-distributed Hadoop cluster on my laptop?

Hadoop has a “laptop mode” called pseudo-distributed mode. In that mode, you run a single copy of each service (for example, a single HDFS namenode and a single HDFS datanode), all listening under localhost.

This feature is rather useful for basic development and testing – if all you want is write some code that uses the services and check if it runs correctly (or at all). However, if your focus is the platform itself – an admin perspective, than that mode is quite limited. Trying out many of Hadoop’s feature requires a real cluster – for example replication, failure testing, rack awareness, HA (of every service types) etc.

Why VirtualBox or VMware Player are not a good fit?

When you want to run several / many virtual machines, the overhead of these desktop virtualization sotware is big. Specifically, as they run a full-blown guest O/S inside them, they:

  1. Consume plenty of RAM per guest – always a premium on laptops.
  2. Consume plenty of disk space – which is limited, especially on SSD.
  3. Take quite a while to start / stop

To run many guests on a laptop, a more lightweight infrastructure is needed.

What is LXC (Linux Containers)

Continue reading

Highlights from writing a data sheet for JethroData

A couple of weeks ago, my friends at JethroData asked me to help them out writing a technical data sheet.

Now, JethroData product is a SQL engine for Hadoop, currently in private beta, that has a very different design and architecture than the rest of the crowd. Specifically, it actually always indexes all the columns of your tables, and also uses HDFS as a shared, remote storage instead of running locally on the data nodes. It has some very good reasons to work this way, but being radically different has made it harder to explain.

So, my task was to write a technical data sheet that would help technical people switch from “this is an insane nonsense” to “that actually makes sense in these cases, but how do you handle this or that…”. In other words, tell a consistent technical story that makes sense and highlights their strengths, encouraging readers to engage further.

I think data sheets in general are a piece of paper you get from a vendor at their booth, and quickly browse through them later while you are stuck at a boring presentation (or is it just me?). So, they have to be interesting and appealing. Also, due to their short format, they should be very focused on the main values and not too high-level or low-level. Quite a challenge!

To overcome the challenge, I first needed deeper knowledge regarding the product. For that, I spent some time with JethroData team, especially with Boaz Raufman, their co-founder and CTO. I have grilled him with questions on architecture and functionality, which led to great discussions. While of course some functionality is still lacking at this early phase, I was impressed to see that the core architecture is designed to handle quite a lot.

Then came the harder part – focus. There are many ways to use JethroData, its implementation has various advantages in different scenarios and of course, the implementation has many details, each with some logic behind it. It is impossible to cram it all into a two-page datasheet, even if most of it is very important. So, I had several lively discussions with Eli Singer, JethroData CEO, of everything from positioning to many specific strengths and supporting arguments and their relative importance.

At the end of the day, I think we got something pretty sweet. You can read it on their web site technology page. If you are attending Strata/Hadoop World, JethroData team will be on the Startup Showcase on Monday and in booth #75 Tuesday-Wednesday – bring questions and arguments 🙂

How will the rise of stream processing effect Big Data paradigm?

Even if you’ve never heard so far about stream processing, it is an easy guess you’ll likely run into this term in the coming weeks around Hadoop World, given the way it illustrates so nicely what YARN all about and also of course given the Hortonworks announcement of endorsing Apache Storm (disclaimer – I love Hortonworks).

Now, before I start fantasizing about the potential popularity of various architectures in the years to come, some quick background…

“Traditional” big data streaming / messaging is just about moving events from place to place. For example, collecting log records from dozens or hundreds of servers in near real-time and delivering each to the relevant target(s), like HDFS files,HBase, a database or another application in a reliable fashion (with specific semantics). It is a re-incarnation of the message queuing concepts, in a high-throughput, distributed fashion and extended both to actively collect events directly from various sources and to persist the events on the other side in various targets.

It seems to me the most popular system is Apache Kafka (originally from LinkedIn), but there is also Apache Flume (originally from Cloudera) which is bundled in most Hadoop distributions.

Stream processing on the other hands, focuses on doing complex parsing, filtering, transformations and computations on the data stream as the data flows, for example maintaining an up-to-date counters and aggregated statistics per user or rolling window “top n” (continuous computation). This is opposed to batch processing frameworks, like MapReduce and Tez, that are based on one-time bulk processing. In this space, Apache Storm (originally from Backtype / Twitter) is the main player, but as LinkedIn released Apache Samza last month, it might lead to some healthy competition. There are other initiatives – for example, Cloudera contributed a library or building real-time analytics on Flume which might be good enough for simpler cases.

By the way, these two types of systems are complementary. Specifically, the combination of Kafka and Storm seems to be gaining traction – here is one success story from LivePerson.

Now, stay with me for a minute while I hype… What if you could write most of the common batches as stream processing? What if you could provide your business users updated data that is only seconds or tens of seconds behind the present? What if you could just have the answers to many common questions always updated and ready? Could stream processing significantly de-emphasize / replace batch processing on Hadoop in the future?

Now, that is of course mostly crazy talk today. In reality, maybe stream processing would not be a good fit for many computation types, and it is rather early to fully understand it. Also, the functionality, optimizations, integrations, methodologies and best practices around stream processing would likely take a few years to really mature.

Having said that, it seems to me that at least in the database / ETL space, stream processing has a great potential to evolve into a powerful alternative paradigm on how we design a system that adds analytic to operational systems.

But, instead of hyping anymore, I’ll ask – how crazy is that? How far are we from such capabilities? What I already see looks promising in theory. Here are a couple of examples:

  • Storm already has its own official abstraction layer to simplify writing complex computations. It is called Trident, contributed by Twitter and is part of Storm 0.8. It provides higher-level micro-batch primitives that allows, for example, to build parallel aggregations and joins. Interestingly, when Twitter released it a year ago, they specifically described it as similar to Pig or Cascading, but for stream processing.
    It seems to me that declarative,SQL-like continuous computation is only one step away from that.
  • Even more fascinating is Twitter’s Summingbird (open-sourced last month). It is a programmer-friendly abstraction of stream processing that can be executed either in real-time (on Storm) or as batch (using Scalding, translated to MapReduce, in the future maybe to Tez). More interestingly, It was designed to allow an hybrid mode, transparently mixing historical data (batch) from Hadoop with last-minute updates from Storm.
    So, batch and streaming could co-exist and even transparently mix, even today.

Now, it remains to be seen of course how will those technologies evolve, but it seems that the current state of events is already not too bad. With the adoption steam processing by mainstream Hadoop distributions (surely Hortonworks wont be the last), usability, functionality and innovation will likely all accelerate. So, if you are starting to look into “next-gen” Big Data infrastructure and low require latency from event to results, it might be a good time to start looking into stream processing.

Update 30/10/13: I can’t believe I didn’t spend time to read about Lambda Architecture before. Nathan Marz (who led Twitter’s Storm) came up with a coherent, end-to-end architecture for a data system to make relational databases obsolete, with both a batch processing component and a stream processing component. Find some time to watch his fascinating presentation from Strange Loop 2012, you won’t regret it.

Big Data products/projects types – from proprietary to industry standard

I recently read Merv’s excellent post on proprietary vs open-source Hadoop – suggesting that use the term distribution-specific is likely more appropriate. It reminded that I wanted to write about the subject – I had a somewhat more detailed classification in mind.
I’m focusing on Hadoop but this also is very relevant to the broader Big Data ecosystem, including the NoSQL/NewSQL solutions etc.

But first, why bother? Well, the classification typically hints some things to decision makers regarding that piece of software. For enterprises, which typically relies on outside help, some open questions might be:

  • Skills – How can we get confidence in using this software (people, training etc)? Can we get help from our partners / consultants / someone new if needed? In our region / country?
  • Support  – Can we get commercial support? Global or local? Is there more than one vendor so we can choose / switch if the relationship goes bad?
  • Finally, how much of a “safe bet” is it? Will it be in a few years a niche, abandoned code or mainstream? Will it integrate well with the ecosystem over time (ex: YARN-friendly) or not? and what are the odds that there will be someone around supporting it in several years when we still rely on it in production?

Of course, the state of piece of software may change over time, but still I think each of the following categories has some immediate implications regarding these questions. For example, proprietary (closed-source) projects should come with decent paid support, optional professional services and training, average plus documentation, and decent chance of long term support (but small startups are still a potential risk).

Having said that, here is the way I classify projects and products:

  • A proprietary offering of a specific Hadoop distribution –  a closed-source offering bundled with a vendor’s Hadoop distribution. Some examples are MapR (with proprietary implementation of HDFS, MapReduce etc), Pivotal (with HAWQ as the central attraction of their distribution) and also Cloudera (with Cloudera Manager, although proprietary installation and management is almost universal).
    The propriety bits are usually considered by the vendor as a significant part of their secret sauce.
  • Proprietary offering on Hadoop – a closed-source offering that runs on Hadoop. Some are distribution-agnostic (A.K.A Bring-Your-Own-Hadoop), for example Platfora, which runs on Cloudera, Hortonworks, MapR and Amazon. Some are certfiied on only a specific distribution (for now?) like Teradata-Hortonworks.
    Also, while some are Hadoop-only, many are an extension of an existing vendor offering to Hadoop, for example – all ETL vendors, some analytics vendors, some security vendors etc (the BI vendors typically just interact with Hadoop using SQL).
  • Open source projects, not bundled with a distribution – many great tools have been released as open source and have started building a community around them. While they are not yet bundled with an existing distribution, this doesn’t stop many from using them.
    Some of the projects complement existing functionality  – like salesforce.com Phoenix for smart, thin SQL layer over HBase or Netflix Lipstick for great Pig workflow visualization. Other projects are alternatives to existing projects, like Linkedin (now Apache) Kafka for distributed messaging as a popular alternative for Apache Flume or like Apache Accumulo as a security-enhanced alternative to Apache HBase. Yet others are just new types of stuff on Hadoop, like Twitter’s Storm for distributed, real-time, stream computations or any of the distributed graph processing alternatives (Apache Giraph and others).
  • Open source projects, bundled with a single distribution – a few of them are  existing open source projects that got picked by a distribution, like Linkedin DataFu that adds useful UDFs to Apache Pig and is now bundled with Cloudera. However, most of them are projects led by the distribution, with varying community around them, for example Hortonworks Ambari (Apache Incubator), Cloudera Impala and Cloudera Hue, plus various security initiatives all over the place (Cloudera Sentry, Hortonworks Knox, Intel Rhino etc).
    As an interesting anecdote of the market dynamics – it seems Cloudera Hue is now also bundled in Hortonworks 2.0 beta  under “third-party components”, which might hint it could evolve into last category:
  • Industry-standard open source projects – these are projects that you can expect to find on most leading distributions, meaning, several vendors provide support.
    While some are obvious (Pig,Hive etc), some are new or just arriving. For example, HCatalog was created by Hortonworks but is now included in Cloudera and MapR. I would guess that merging it into Hive helped smooth things out (likely it is easier to deliver it than rip it out…). It would be interesting to see if Tez will follow the same adoption path, as Hortonworks planned to eventually build support directly in Hive, Pig and Cacading. A similar fate would likely come to most of the security projects I mentioned, as well as ORC and Parquet support (ORC is already in Hive).

So there you have you it. I believe knowing the type of project can help organization gauge the risk involved (vs. the benefit) to make an informed decision about its use.

Oh, and on a side note – “bundled in a distribution” and “distribution does provides decent support, bug fixes and enhancements”  are of course two different things… That’s why the “brag the number of your commiters” resonate well with me, but that’s a different story.