In this post I’ll explore how the HDFS default values really work, which I found to be quite surprising and non-intuitive, so there s a good lesson here.
In my case, I have my local virtual Hadoop cluster, and I have a different client outside the cluster. I’ll just put a file from the external client into the cluster in two configurations.
All the nodes of my Hadoop (1.2.1) cluster have the same hdfs-site.xml file with the same (non-default) value for dfs.block.size (renamed to dfs.blocksize in Hadoop 2.x) of 134217728, which is 128MB. In my external node, I also have the hadoop executables with a minimal hdfs-site.xml.
First, I have set dfs.block.size to 268435456 (256MB) in my client hdfs-site.xml and copied a 400MB file to HDFS:
./hadoop fs -copyFromLocal /sw/400MB.file /user/ofir
Checking its block size from the NameNode:
./hadoop fsck /user/ofir/400MB.file -files -blocks -racks FSCK started by root from /10.0.1.111 for path /user/ofir/400MB.file at Thu Jan 30 22:21:29 UTC 2014 /user/ofir/400MB.file 419430400 bytes, 2 block(s): OK 0. blk_-5656069114314652598_27957 len=268435456 repl=3 [/rack03/10.0.1.133:50010, /rack03/10.0.1.132:50010, /rack02/10.0.1.122:50010] 1. blk_3668240125470951962_27957 len=150994944 repl=3 [/rack03/10.0.1.133:50010, /rack03/10.0.1.132:50010, /rack01/10.0.1.113:50010] Status: HEALTHY Total size: 419430400 B Total dirs: 0 Total files: 1 Total blocks (validated): 2 (avg. block size 209715200 B) Minimally replicated blocks: 2 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 1 Average block replication: 3.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 6 Number of racks: 3 FSCK ended at Thu Jan 30 22:21:29 UTC 2014 in 1 milliseconds
So far, looks good – the first block is 256MB.