Exploring HDFS Block Placement Policy

In this post I’ll cover how to see where each data block is actually placed. I’m using my fully-distributed Hadoop cluster on my laptop for exploration and the network topology definitions from my previous post – you’ll need a multi-node cluster to reproduce it.

Background

When a file is written to HDFS, it is split up into big chucks called data blocks, whose size is controlled by the parameter dfs.block.size in the config file hdfs-site.xml (in my case – left as the default which is 64MB). Each block is stored on one or more nodes, controlled by the parameter dfs.replication in the same file (in most of this post – set to 3, which is the default). Each copy of a block is called a replica.

Regarding my setup – I have nine nodes spread over three racks. Each rack has one admin node and two worker nodes (running DataNode and TaskTracker), using Apache Hadoop 1.2.1. Please note that in my configuration, there is a simple mapping between host name, IP address and rack number -host name is hadoop[rack number][node number] and IP address is 10.0.1.1[rack number][node number].

Where should the NameNode place each replica?  Here is the theory:

The default block placement policy is as follows:

  • Place the first replica somewhere – either a random node (if the HDFS client is outside the Hadoop/DataNode cluster) or on the local node (if the HDFS client is running on a node inside the cluster).
  • Place the second replica in a different rack.
  • Place the third replica in the same rack as the second replica
  • If there are more replicas – spread them across the rest of the racks.

That placement algorithm is a decent default compromise – it provides protection against a rack failure while somewhat minimizing inter-rack traffic.

So, let’s put some files on HDFS and see if it behaves like the theory in my setup – try it on yours to validate.

Putting a file into HDFS from outside the Hadoop cluster

As my block size is 64MB,  I created a file (called /sw/big) on my laptop for testing. The file is about 125MB – so it consumes two HDFS blocks. It is not inside one of the data node containers but outside the Hadoop cluster. I then copied it to HDFS twice:

# /sw/hadoop-1.2.1/bin/hadoop fs -copyFromLocal /sw/big /big1
# /sw/hadoop-1.2.1/bin/hadoop fs -copyFromLocal /sw/big /big2

In order to see the actual block placement, of a specific file, I ran hadoop fsck filename -files -blocks -racks from the NameNode server and looked at the top of the report:

# bin/hadoop fsck /big1 -files -blocks -racks
FSCK started by root from /10.0.1.111 for path /big1 at Mon Jan 06 11:20:01 UTC 2014
/big1 130633102 bytes, 2 block(s):  OK
0. blk_3712902403633386081_1008 len=67108864 repl=3 [/rack03/10.0.1.133:50010, /rack03/10.0.1.132:50010, /rack02/10.0.1.123:50010]
1. blk_381038406874109076_1008 len=63524238 repl=3 [/rack03/10.0.1.132:50010, /rack03/10.0.1.133:50010, /rack01/10.0.1.113:50010]

Status: HEALTHY
 Total size:    130633102 B
 Total dirs:    0
 Total files:    1
 Total blocks (validated):    2 (avg. block size 65316551 B)
 Minimally replicated blocks:    2 (100.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:    0 (0.0 %)
 Mis-replicated blocks:        0 (0.0 %)
 Default replication factor:    3
 Average block replication:    3.0
 Corrupt blocks:        0
 Missing replicas:        0 (0.0 %)
 Number of data-nodes:        6
 Number of racks:        3
FSCK ended at Mon Jan 06 11:20:01 UTC 2014 in 3 milliseconds

The filesystem under path '/big1' is HEALTHY

You can see near the end that the NameNode correctly reports it know of six data nodes and three racks. If we look specifically at the block report:

0. blk_37129... len=67108864 repl=3 [/rack03/10.0.1.133:50010, /rack03/10.0.1.132:50010, /rack02/10.0.1.123:50010]
1. blk_38103... len=63524238 repl=3 [/rack03/10.0.1.132:50010, /rack03/10.0.1.133:50010, /rack01/10.0.1.113:50010]

We can see:

  • Block 0 of file big1 is on hadoop23 (rack 2), hadoop32 (rack 3), hadoop33 (rack 3)
  • Block 1 of file big1 is on hadoop13 (rack1), hadoop32 (rack 3), hadoop33 (rack 3)

Running the same report on big2:

# bin/hadoop fsck /big2 -files -blocks -racks
FSCK started by root from /10.0.1.111 for path /big2 at Mon Jan 06 11:49:36 UTC 2014
/big2 130633102 bytes, 2 block(s):  OK
0. blk_-2514176538290345389_1009 len=67108864 repl=3 [/rack02/10.0.1.122:50010, /rack02/10.0.1.123:50010, /rack03/10.0.1.133:50010]
1. blk_316929026766197453_1009 len=63524238 repl=3 [/rack01/10.0.1.113:50010, /rack01/10.0.1.112:50010, /rack03/10.0.1.133:50010]
...

We can see that:

  • Block 0 of file big2 is on hadoop33 (rack 3), hadoop22 (rack 2), hadoop23 (rack 2)
  • Block 1 of file big2 is on hadoop33 (rack 3), hadoop12 (rack 1), hadoop13 (rack 1)

Conclusion – we can see that when the HDFS client is outside the Hadoop cluster, for each block, there is a single replica on a random rack and two other replicas on a different, random rack, as expected.

Putting a file into HDFS from within the Hadoop cluster

OK, let’s repeat the exercise, now copying the big file from a local path on one of the data nodes (I used hadoop22) to HDFS:

# /usr/local/hadoop-1.2.1/bin/hadoop fs -copyFromLocal big /big3
# /usr/local/hadoop-1.2.1/bin/hadoop fs -copyFromLocal big /big4

We expect to see the first replica of all blocks to be local – on node hadoop22.
Running the report on big3:

# bin/hadoop fsck /big3 -files -blocks -racks
FSCK started by root from /10.0.1.111 for path /big3 at Mon Jan 06 12:10:03 UTC 2014
/big3 130633102 bytes, 2 block(s):  OK
0. blk_-509446789413926742_1013 len=67108864 repl=3 [/rack03/10.0.1.132:50010, /rack03/10.0.1.133:50010, /rack02/10.0.1.122:50010]
1. blk_1461305132201468085_1013 len=63524238 repl=3 [/rack03/10.0.1.132:50010, /rack03/10.0.1.133:50010, /rack02/10.0.1.122:50010]
...

We can see that:

  • Block 0 of file big3 is on hadoop22 (rack 2), hadoop33 (rack 3), hadoop32 (rack 3)
  • Block 1 of file big3 is on hadoop22 (rack 2), hadoop33 (rack 3), hadoop32 (rack 3)

And for big4:

# bin/hadoop fsck /big4 -files -blocks -racks
FSCK started by root from /10.0.1.111 for path /big4 at Mon Jan 06 12:12:52 UTC 2014
/big4 130633102 bytes, 2 block(s):  OK
0. blk_-8460635609179877275_1014 len=67108864 repl=3 [/rack03/10.0.1.133:50010, /rack03/10.0.1.132:50010, /rack02/10.0.1.122:50010]
1. blk_8256749577534695387_1014 len=63524238 repl=3 [/rack01/10.0.1.112:50010, /rack02/10.0.1.122:50010, /rack01/10.0.1.113:50010]
  • Block 0 of file big4 is on hadoop22 (rack 2), hadoop32 (rack 3), hadoop33 (rack 3)
  • Block 1 of file big4 is on hadoop22 (rack 2), hadoop12 (rack 1), hadoop13 (rack 1)

Conclusion – we can see that when the HDFS client is inside the Hadoop cluster for each block, there is a single replica on the local node (hadoop22 in this case) and two other replicas on a different,random rack, as expected.

Just for fun – create a file with more than three replicas

We can override Hadoop parameters when invoking the shell using -D option. Let’s use it to create a file with four replicas from node hadoop22:

# /usr/local/hadoop-1.2.1/bin/hadoop fs -D dfs.replication=4 -copyFromLocal ~/big /big5

Running the report on big5:

# bin/hadoop fsck /big5 -files -blocks -racks
FSCK started by root from /10.0.1.111 for path /big5 at Mon Jan 06 12:36:28 UTC 2014
/big5 130633102 bytes, 2 block(s):  OK
0. blk_-4060957873241541489_1015 len=67108864 repl=4 [/rack01/10.0.1.112:50010, /rack03/10.0.1.133:50010, /rack03/10.0.1.132:50010, /rack02/10.0.1.122:50010]
1. blk_8361194277567439305_1015 len=63524238 repl=4 [/rack03/10.0.1.133:50010, /rack01/10.0.1.113:50010, /rack01/10.0.1.112:50010, /rack02/10.0.1.122:50010]
...

We can see that:

  • Block 0 of file big3 is on hadoop22 (rack 2), hadoop33 (rack 3), hadoop32 (rack3), hadooop12 (rack1)
  • Block 1 of file big3 is on hadoop22 (rack 2), hadoop13 (rack 1), hadoop12 (rack1), hadooop33 (rack3)

So far, works as expected! That’s a nice surprise…

Finally, let’s try to break something 🙂

What will happen when we’ll try to specify more replicas than data nodes?

While it is “stupid”, it will likely happen in pseudo-distributed config every time you try dfs.replication > 1, as you have only one DataNode.

# /usr/local/hadoop-1.2.1/bin/hadoop fs -D dfs.replication=7 -copyFromLocal ~/big /big7

We do have only six data nodes. Running the report on big7:

# bin/hadoop fsck /big7 -files -blocks -racks
FSCK started by root from /10.0.1.111 for path /big7 at Mon Jan 06 12:40:05 UTC 2014
/big7 130633102 bytes, 2 block(s):  Under replicated blk_1565930173693194428_1016. Target Replicas is 7 but found 6 replica(s).
 Under replicated blk_-2825553771191528109_1016. Target Replicas is 7 but found 6 replica(s).
0. blk_1565930173693194428_1016 len=67108864 repl=6 [/rack01/10.0.1.112:50010, /rack01/10.0.1.113:50010, /rack03/10.0.1.132:50010, /rack03/10.0.1.133:50010, /rack02/10.0.1.123:50010, /rack02/10.0.1.122:50010]
1. blk_-2825553771191528109_1016 len=63524238 repl=6 [/rack01/10.0.1.112:50010, /rack01/10.0.1.113:50010, /rack03/10.0.1.132:50010, /rack03/10.0.1.133:50010, /rack02/10.0.1.123:50010, /rack02/10.0.1.122:50010]

Status: HEALTHY
 Total size:    130633102 B
 Total dirs:    0
 Total files:    1
 Total blocks (validated):    2 (avg. block size 65316551 B)
 Minimally replicated blocks:    2 (100.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:    2 (100.0 %)
 Mis-replicated blocks:        0 (0.0 %)
 Default replication factor:    3
 Average block replication:    6.0
 Corrupt blocks:        0
 Missing replicas:        2 (16.666666 %)
 Number of data-nodes:        6
 Number of racks:        3
FSCK ended at Mon Jan 06 12:40:05 UTC 2014 in 1 milliseconds

We can see that each data node has a single copy of each block, and fsck notifies us that the two blocks are under-replicated. Nice!

Changing the replication factor of an existing file.

Since this warning in fsck might be annoying (and depending on your monitoring scripts, sets various alarms…), we should change the replication factor of this file:

# bin/hadoop fs -setrep 6 /big7
Replication 6 set: hdfs://hadoop11:54310/big7

In this case, the NameNode won’t immediate delete the extra replicas:

# bin/hadoop fsck /big7 -files -blocks -racks
FSCK started by root from /10.0.1.111 for path /big7 at Mon Jan 06 12:51:00 UTC 2014
/big7 130633102 bytes, 2 block(s):  OK
0. blk_1565930173693194428_1016 len=67108864 repl=6 [/rack01/10.0.1.112:50010, /rack01/10.0.1.113:50010, /rack03/10.0.1.132:50010, /rack03/10.0.1.133:50010, /rack02/10.0.1.123:50010, /rack02/10.0.1.122:50010]
1. blk_-2825553771191528109_1016 len=63524238 repl=6 [/rack01/10.0.1.112:50010, /rack01/10.0.1.113:50010, /rack03/10.0.1.132:50010, /rack03/10.0.1.133:50010, /rack02/10.0.1.123:50010, /rack02/10.0.1.122:50010]

Status: HEALTHY
 Total size:    130633102 B
 Total dirs:    0
 Total files:    1
 Total blocks (validated):    2 (avg. block size 65316551 B)
 Minimally replicated blocks:    2 (100.0 %)
 Over-replicated blocks:    2 (100.0 %)
 Under-replicated blocks:    0 (0.0 %)
 Mis-replicated blocks:        0 (0.0 %)
 Default replication factor:    3
 Average block replication:    6.0
 Corrupt blocks:        0
 Missing replicas:        0 (0.0 %)
 Number of data-nodes:        6
 Number of racks:        3
FSCK ended at Mon Jan 06 12:51:00 UTC 2014 in 1 milliseconds

The filesystem under path '/big7' is HEALTHY

If you are in a hurry, adding -w to the command should trigger immediate replication level change:

# bin/hadoop fs -setrep 5 -w /big7
Replication 5 set: hdfs://hadoop11:54310/big7
Waiting for hdfs://hadoop11:54310/big7 .... done
Advertisements

5 thoughts on “Exploring HDFS Block Placement Policy

    • Well, the policy is just an optimization. If an optimized location is not available, any node in the cluster is OK.
      For example, it could be that most of the nodes in rack A ran out of space… So blocks will not be written there etc

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s