In this post I’ll explore how the HDFS default values really work, which I found to be quite surprising and non-intuitive, so there s a good lesson here.
In my case, I have my local virtual Hadoop cluster, and I have a different client outside the cluster. I’ll just put a file from the external client into the cluster in two configurations.
All the nodes of my Hadoop (1.2.1) cluster have the same hdfs-site.xml file with the same (non-default) value for dfs.block.size (renamed to dfs.blocksize in Hadoop 2.x) of 134217728, which is 128MB. In my external node, I also have the hadoop executables with a minimal hdfs-site.xml.
First, I have set dfs.block.size to 268435456 (256MB) in my client hdfs-site.xml and copied a 400MB file to HDFS:
./hadoop fs -copyFromLocal /sw/400MB.file /user/ofir
Checking its block size from the NameNode:
./hadoop fsck /user/ofir/400MB.file -files -blocks -racks
FSCK started by root from /10.0.1.111 for path /user/ofir/400MB.file at Thu Jan 30 22:21:29 UTC 2014
/user/ofir/400MB.file 419430400 bytes, 2 block(s): OK
0. blk_-5656069114314652598_27957 len=268435456 repl=3 [/rack03/10.0.1.133:50010, /rack03/10.0.1.132:50010, /rack02/10.0.1.122:50010]
1. blk_3668240125470951962_27957 len=150994944 repl=3 [/rack03/10.0.1.133:50010, /rack03/10.0.1.132:50010, /rack01/10.0.1.113:50010]
Total size: 419430400 B
Total dirs: 0
Total files: 1
Total blocks (validated): 2 (avg. block size 209715200 B)
Minimally replicated blocks: 2 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 6
Number of racks: 3
FSCK ended at Thu Jan 30 22:21:29 UTC 2014 in 1 milliseconds
So far, looks good – the first block is 256MB.
In this post I’ll cover how to see where each data block is actually placed. I’m using my fully-distributed Hadoop cluster on my laptop for exploration and the network topology definitions from my previous post – you’ll need a multi-node cluster to reproduce it.
When a file is written to HDFS, it is split up into big chucks called data blocks, whose size is controlled by the parameter dfs.block.size in the config file hdfs-site.xml (in my case – left as the default which is 64MB). Each block is stored on one or more nodes, controlled by the parameter dfs.replication in the same file (in most of this post – set to 3, which is the default). Each copy of a block is called a replica.
Regarding my setup – I have nine nodes spread over three racks. Each rack has one admin node and two worker nodes (running DataNode and TaskTracker), using Apache Hadoop 1.2.1. Please note that in my configuration, there is a simple mapping between host name, IP address and rack number -host name is hadoop[rack number][node number] and IP address is 10.0.1.1[rack number][node number].
Where should the NameNode place each replica? Here is the theory:
The default block placement policy is as follows:
- Place the first replica somewhere – either a random node (if the HDFS client is outside the Hadoop/DataNode cluster) or on the local node (if the HDFS client is running on a node inside the cluster).
- Place the second replica in a different rack.
- Place the third replica in the same rack as the second replica
- If there are more replicas – spread them across the rest of the racks.
That placement algorithm is a decent default compromise – it provides protection against a rack failure while somewhat minimizing inter-rack traffic.
So, let’s put some files on HDFS and see if it behaves like the theory in my setup – try it on yours to validate.
In this post I’ll cover how to define and debug a simple Hadoop network topology.
Hadoop is designed to run on large clusters of commodity servers – in many cases spanning many physical racks of servers. A physical rack is in many cases a single point of failure (for example, having typically a single switch for lower cost), so HDFS tries to place block replicas on more than one rack. Also, there is typically more bandwidth within a rack than between the racks, so the software on the cluser (HDFS and MapReduce / YARN) can take it into account. This leads to a question:
How does the NameNode know the network topology?
By default, the NameNode has no idea which node is in which rack. It therefore by default assumes that all nodes are in the same rack, which is likely true for small clusters. It calls this rack “/default-rack“.
So, we have to teach Hadoop our cluster network topology – the way the nodes are grouped into racks. Hadoop supports a pluggable rack topology implementation – controlled by the parameter topology.node.switch.mapping.impl in core-site.xml, which specifies a java class implementation. The default implementation is using a user-provided script, specified in topology.script.file.name in the same config file, a script that gets a list of IP addresses or host names and returns a list of rack names (in Hadoop2: net.topology.script.file.name). The script will get up to topology.script.number.args parameters per invocation, by default up to 100 requests per invocation (in Hadoop2: net.topology.script.number.args).