TL;DR Why and how I created a working 9-node Hadoop Cluster on my laptop
In this post I’ll cover why I wanted to have a decent multi-node Hadoop cluster on my laptop, why I chose not to use virtualbox/VMware player, what is LXC (Linux Containers) and how did I set it up. The last part is a bit specific to my desktop O/S (Ubuntu 13.10).
Why install a fully-distributed Hadoop cluster on my laptop?
Hadoop has a “laptop mode” called pseudo-distributed mode. In that mode, you run a single copy of each service (for example, a single HDFS namenode and a single HDFS datanode), all listening under localhost.
This feature is rather useful for basic development and testing – if all you want is write some code that uses the services and check if it runs correctly (or at all). However, if your focus is the platform itself – an admin perspective, than that mode is quite limited. Trying out many of Hadoop’s feature requires a real cluster – for example replication, failure testing, rack awareness, HA (of every service types) etc.
Why VirtualBox or VMware Player are not a good fit?
When you want to run several / many virtual machines, the overhead of these desktop virtualization sotware is big. Specifically, as they run a full-blown guest O/S inside them, they:
- Consume plenty of RAM per guest – always a premium on laptops.
- Consume plenty of disk space – which is limited, especially on SSD.
- Take quite a while to start / stop
To run many guests on a laptop, a more lightweight infrastructure is needed.
What is LXC (Linux Containers)
LXC is a lightweight virtualization solution for Linux, sometimes called “chroot on steroids”. It leverages a set of Linux kernel features (control groups, kernel namespaces etc) to provide isolation between groups of processes running on the host. In other words, in most cases the containers (virtual machines) are using the host kernel and operating system, but are isolated as if they are running on their own environent (including their own IP addresses etc).
In practical terms, I found that the LXC containers start in a sub-second. They have no visible memory overhead – for example, if a container runs the namenode, only the namenode consumes RAM, based on its java settings. They also consume very little disk space, and their file system is just stored directly in the host file system (under /var/lib/lxc/container_name/rootfs), which is very useful. You don’t need to statically decide how much RAM or virtual cores to allocate each (though you could) – they consume just what they need. Actually, while the guests are isolated from each other, you can see all their processes together in the host using the ps. command as root..
Some other notes – LXC is already used by Apache Mesos and Dockers, amongst others. A “1.0” release is targeted next month (February). If you are using RHEL/CentOS, LXC is provided as a “technical preview” on RHEL 6 (likely best to use a current update).
If you want a detailed introduction, Stéphane Graber which is one of the LXC maintainer started an excellent ten-post series on LXC a couple of weeks ago.
So, how to set it up?
The following are some of my notes. I only tried it on my laptop, with Ubuntu 13.10 64-bit as a guest, so if you have a different O/S flavor, expect some changes.
All commands below are as root.
start by installing LXC:
apt-get install lxc
The installation also creates a default virtual network bridge device for all containers (lxcbr0). Next, create your first Ubuntu container called test1 – it will take a few minutes as some O/S packages will be downloaded (but they will be cached, so the next containers will be created in a few seconds):
lxc-create -t ubuntu -n test1
Start the container with:
lxc-start -d -n test1
List all your containers (and their IP addreses) with:
and open a shell on it with:
lxc-attach -n test1
stop the container running on your host:
lxc-stop -n test1
You can also clone a container (lxc-clone) or freeze/unfreeze a running one (lxc-freeze / lxc-unfreeze). LXC also has a web GUI with some nice visualizations (though some of it was broken for me), install it with:
wget http://lxc-webpanel.github.io/tools/install.sh -O - | bash
and access it with http://localhost:5000 (default user/password is admin/admin):
Now, some specific details for an Hadoop container
1. Install a JVM inside the container. The default LXC Ubuntu container has a minimal set of packages . At first, I installed openjdk in it:
apt-get install openjdk-7-jdk
However, I decided to switched to Oracle JDK as that is what’s typically being used. That required some extra steps (the first step adds the “add-apt-repository” command, the others set a java7 repository to ease installation and updates):
apt-get install software-properties-common add-apt-repository ppa:webupd8team/java apt-get update apt-get install oracle-java7-installer
2. Setting the network
By default, LXC uses DHCP to provide dynamic IP addresses to containers. As Hadoop is very sensitive to IP addresses (it occasionally uses them instead of host names), I decided that I want static IPs for my Hadoop containers.
In Ubuntu, the lxc bridge config file is /etc/default/lxc-net – I uncommented the following line:
and created /etc/lxc/dnsmasq.conf with the following contents:
dhcp-host=hadoop11,10.0.1.111 dhcp-host=hadoop12,10.0.1.112 dhcp-host=hadoop13,10.0.1.113 dhcp-host=hadoop14,10.0.1.114 dhcp-host=hadoop15,10.0.1.115 dhcp-host=hadoop16,10.0.1.116 dhcp-host=hadoop17,10.0.1.117 dhcp-host=hadoop18,10.0.1.118 dhcp-host=hadoop19,10.0.1.119 dhcp-host=hadoop21,10.0.1.121 dhcp-host=hadoop22,10.0.1.122 dhcp-host=hadoop23,10.0.1.123 dhcp-host=hadoop24,10.0.1.124 dhcp-host=hadoop25,10.0.1.125 dhcp-host=hadoop26,10.0.1.126 dhcp-host=hadoop27,10.0.1.127 dhcp-host=hadoop28,10.0.1.128 dhcp-host=hadoop29,10.0.1.129
That should cover more containers than I’ll ever need… My naming convention is that the host name is hadoop[rack number][node number] and the IP is 10.0.1.1[rack number][hostname]. For example, host hadoop13 is located at rack 1 node 3 with IP address of 10.0.113. This way it is easy to figure out what’s going on even if the logs only have IP.
I’ve also added all of the potential containers to my /etc/hosts and copied them to hadoop11, which I use as a template to clone the rest of the nodes:
10.0.1.111 hadoop11 10.0.1.112 hadoop12 10.0.1.113 hadoop13 10.0.1.114 hadoop14 10.0.1.115 hadoop15 10.0.1.116 hadoop16 10.0.1.117 hadoop17 10.0.1.118 hadoop18 10.0.1.119 hadoop19 10.0.1.121 hadoop21 10.0.1.122 hadoop22 10.0.1.123 hadoop23 10.0.1.124 hadoop24 10.0.1.125 hadoop25 10.0.1.126 hadoop26 10.0.1.127 hadoop27 10.0.1.128 hadoop28 10.0.1.129 hadoop29
I ran into one piece of ugliness regarding DHCP. I use hadoop11 container as a template, so I as I develop my environment I have destroyed the rest of the containers and cloned them again a few times. At that point, they got a new dynamic IP – it seems the DHCP server remembered that the static IP that I wanted was already allocated. Eventually, I added this to the end of my “destroy all nodes” script to solve that:
service lxc-net stop rm /var/lib/misc/dnsmasq.lxcbr0.leases service lxc-net start
If you run into networking issues – I recommend checking the Ubuntu-specific docs, as lots of what you find on the internet is scary network hacks that are likely not needed in Ubuntu 13.10 (and didn’t work for me).
Another setup thing – if you run on BTRFS (a relatively new copy-on-write filesystem), the lxc-clone should detect it and leverage it. It doesn’t seem to work for me, but instead copying files directly into /var/lib/lxc/container_name/rootfs using cp –reflink works as expected (you can verify disk space after copying by waiting a few seconds and then issuing btrfs file df /var/lib/lxc/ )
3. Setting Hadoop
This is pretty standard… You should just set up ssh keys, maybe create some O/S users for Hadoop, download Hadoop bits from Apache or a vendor and get going… Personally, I used Apache Hadoop (1.2.1 and 2.2.0) – I have set things up on a single node and then clone it to create a cluster.
In addition, to ease my tests, I’ve been scripting the startup, shutdown and full reset of the environment based on a config file that I created. For example, here is my current config file for Apache Hadoop 1.2.1 (the scripting works nicely but is still evolving – ping me if you are curious):
hadoop11 namenode zookeeper hadoop12 datanode tasktracker hadoop13 datanode tasktracker hadoop14 datanode tasktracker hadoop21 jobtracker zookeeper hadoop22 datanode tasktracker hadoop23 datanode tasktracker hadoop24 datanode tasktracker hadoop29 secondarynamenode hiveserver2 zookeeper