There are plenty tutorials out there about how to install hadoop in single node mode, or in pseudo multi-node, or how to set up your home computers to run a cluster. But why not going for the real thing and unleash the power of the cluster using aws machines?
It turns out that you can setup a cluster of over ten machines on amazon, and if you keep you experiment for a weekend, it will not cost you more than a few tens of bucks. Furthermore, it is an excellent opportunity to play with aws vpc's (virtual private clouds) and a fully distributed hadoop installation. So here we go.
We are going to create a vpc network with a gateway node and a number of node on a private subnet.
First off, we need two keys for the credentials. Let's call them ec2-hadoop and hadoop, respectively for the gateway key credential, and for the private subnet node credentials.
Then, let's create a virtual private cloud on amazon.
We choose for a virtual private cloud with two subnets, one public and one private.
We are going to place the gateway on the public subnet, while the hadoop cluster of machine will be located on the private subnet.
The subnets have a 8bit iprange.
The public subnet is configured on 10.0.0.0/24 (which means ips like 10.0.0.x, for a total of 255 addresses)
The private subnet is configured on 10.0.1.0/24 (which means ips like 10.0.1.x , for a total of 255 addresses)
Completed! We have our vpc.
As part of this automatic configuration, aws will provide a public elastic ip for us to connect to, and will automatically start 1 instance which is going to be used as gateway and NAT.
Switching from the VPC to the EC2 dashboard: we do see one instance running (and its ebs volume) the keypairs and the elastic IP. So far so good. :)
A quick check on the EC2 gateway machine. You can see the address is in the 10.0.0.0/24 range - the private ip address of the gateway is 10.0.0.223.
It's time to spawn some nodes for the cluster! Select the Ubuntu 12.04 LTS 64bit image.
We need specialized security groups for the VPC. Those groups are separated from the standalone EC2 groups.
First let's create a security group for the hadoop nodes on the private subnet. Let's call it the "hadoop" security group
Let's use the defaul vpc security group for the gateway.
"hadoop" security group, means that all the hadoop nodes will be open for inbound and outbound on all ports/interfaces. Only among each other though. This is good since we are in a trusted zone and hadoop opens up many ports.
Let's open up incoming port 22 from anywhere (0.0.0.0/0), for ssh access on the gateway node.
Then, make sure that the two vpc security groups can talk between each other. so all intra-cluster communications are allowed.
Time to spawn the hadoop nodes. Again select the image (ubuntu 12 64bit)
Select ubuntu 12.04 LTS 64bit
Thereafter go ahead and pick as many machines as you can afford :p
By default you would get an ephimeral disk. In this setup, This disk is removed and replaced with a ECB disk.
All ready. Launch and wait for the hadoop nodes in your virtual cluster to be instantiated and provisioned.
Make sure that the gateway accept incoming connections on all port from the hadoop cluster ports.