Exploring The Hadoop Network Topology

In this post I’ll cover how to define and debug a simple Hadoop network topology.

Background

Hadoop is designed to run on large clusters of commodity servers – in many cases spanning many physical racks of servers. A physical rack is in many cases a single point of failure (for example, having typically a single switch for lower cost), so HDFS tries to place block replicas on more than one rack. Also, there is typically more bandwidth within a rack than between the racks, so the software on the cluser (HDFS and MapReduce / YARN) can take it into account. This leads to a question:

How does the NameNode know the network topology?

By default, the NameNode has no idea which node is in which rack. It therefore by default assumes that all nodes are in the same rack, which is likely true for small clusters. It calls this rack “/default-rack“.

So, we have to teach Hadoop our cluster network topology – the way the nodes are grouped into racks. Hadoop supports a pluggable rack topology implementation – controlled by the parameter topology.node.switch.mapping.impl in core-site.xml, which specifies a java class implementation. The default implementation is using a user-provided script, specified in topology.script.file.name in the same config file, a script that gets a list of IP addresses or host names and returns a list of rack names (in Hadoop2: net.topology.script.file.name). The script will get up to topology.script.number.args parameters per invocation, by default up to 100 requests per invocation (in Hadoop2: net.topology.script.number.args).

In my case, I set it to /usr/local/hadoop-1.2.1/conf/topology.script.sh which I copied from the Hadoop wiki. I just made a few changes – I changed the path to my conf directory in the second line, added some logging of the call in the third line and changed the default rack name near the end to /rack01:

#!/bin/bash          
HADOOP_CONF=/usr/local/hadoop-1.2.1/conf
echo `date` input: $@ >> $HADOOP_CONF/topology.log
while [ $# -gt 0 ] ; do
  nodeArg=$1
  exec< ${HADOOP_CONF}/topology.data
  result=""
  while read line ; do
    ar=( $line )
    if [ "${ar[0]}" = "$nodeArg" ] ; then
      result="${ar[1]}"
    fi
  done
  shift
  if [ -z "$result" ] ; then
#    echo -n "/default/rack "
     echo -n "/rack01"
  else
    echo -n "$result "
  fi
done

The script basically just parses a text file (topology.data) that holds a mapping from IP address or host name to rack name. Here are the content of my file:

hadoop11        /rack01
hadoop12        /rack01
hadoop13        /rack01
hadoop14        /rack01
haddop15        /rack01
hadoop21        /rack02
hadoop22        /rack02
hadoop23        /rack02
hadoop24        /rack02
hadoop25        /rack02
hadoop31        /rack03
hadoop32        /rack03
hadoop33        /rack03
hadoop34        /rack03
hadoop35        /rack03
10.0.1.111      /rack01
10.0.1.112      /rack01
10.0.1.113      /rack01
10.0.1.114      /rack01
10.0.1.115      /rack01
10.0.1.121      /rack02
10.0.1.122      /rack02
10.0.1.123      /rack02
10.0.1.124      /rack02
10.0.1.125      /rack02
10.0.1.131      /rack03
10.0.1.132      /rack03
10.0.1.133      /rack03
10.0.1.134      /rack03
10.0.1.135      /rack03

A bit long, but pretty straight forward. Please note that in my configuration, there is a simple mapping between host name, IP address and rack number – the host name is hadoop[rack number][node number] and IP address is 10.0.1.1[rack number][node number].

You could of course write any logic into the script. For example, using my naming convention, I could have written a simple script that just takes the second-to-last character and translate it to rack number – that would work on both IP addresses and host names. As another example – I know some companies allocate IP addresses for their Hadoop cluster as x.y.[rack_number].[node_number] – so they can again just parse it directly.

Before we test the script, if you have read my previous post on LXC setup, please note that I made  a minor change – I switched to a three-rack cluster, to make the block placement more interesting.  So, my LXC nodes are:

hadoop11 namenode zookeeper
hadoop12 datanode tasktracker
hadoop13 datanode tasktracker
hadoop21 jobtracker zookeeper
hadoop22 datanode tasktracker
hadoop23 datanode tasktracker
hadoop31 secondarynamenode hiveserver2 zookeeper
hadoop32 datanode tasktracker
hadoop33 datanode tasktracker

OK, now that we covered the script, let’s start the NameNode and see see what gets logged:

# bin/hadoop-daemon.sh --config conf/ start namenode
starting namenode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-root-namenode-hadoop11.out
# cat conf/topology.log 
Mon Jan 6 19:04:03 UTC 2014 input: 10.0.1.123 10.0.1.122 10.0.1.113 10.0.1.112 10.0.1.133 10.0.1.132

As the NameNode started, it asked in a single called what is the rack name of all our nodes. This is what the script returns to the NameNode:

# conf/topology.script.sh 10.0.1.123 10.0.1.122 10.0.1.113 10.0.1.112 10.0.1.133 10.0.1.132
/rack02 /rack02 /rack01 /rack01 /rack03 /rack03

Why did the NameNode send all the IP addresses of the DataNodes in a single call? In this case, I have pre-configured another HDFS parameter called dfs.hosts in hdfs-site.xml. This parameter points to a file with a list of all nodes that are allowed to run a data node. So, when the NameNode started, it just asked for the mapping of all known data node servers.

What happens if you don’t use dfs.hosts? To check, I removed this parameter from my hdfs-site.xml file, restarted the NameNode and started all the data nodes. In this case, the NameNode called the topology script once per DataNode (when they first reported their status to the NameNode):

# cat topology.log 
Mon Jan 6 19:04:03 UTC 2014 input: 10.0.1.123 10.0.1.122 10.0.1.113 10.0.1.112 10.0.1.133 10.0.1.132
Mon Jan 6 19:07:53 UTC 2014 input: 10.0.1.112
Mon Jan 6 19:07:53 UTC 2014 input: 10.0.1.123
Mon Jan 6 19:07:53 UTC 2014 input: 10.0.1.113
Mon Jan 6 19:07:53 UTC 2014 input: 10.0.1.133
Mon Jan 6 19:07:54 UTC 2014 input: 10.0.1.132
Mon Jan 6 19:07:54 UTC 2014 input: 10.0.1.122

I hope this post has enough data for you to hack and QA your own basic network topologies. In the next post I hope to investigate block placement – how to see where HDFS is actually putting each copy of each block under various conditions.

Advertisements

2 thoughts on “Exploring The Hadoop Network Topology

  1. Pingback: Exploring HDFS Block Placement Policy | Big Data, Small Font

  2. Pingback: Exploring The Hadoop Network Topology | Big Data Analytics News

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s