Principal Data Scientist, Practice Lead for Data Science, AI, Big Data Technologies at Teradata. O’Reilly author on distributed computing and machine learning. ​

Natalino leads the definition, design and implementation of data-driven financial and telecom applications. He has previously served as Enterprise Data Architect at ING in the Netherlands, focusing on fraud prevention/detection, SoC, cybersecurity, customer experience, and core banking processes.

​Prior to that, he had worked as senior researcher at Philips Research Laboratories in the Netherlands, on the topics of system-on-a-chip architectures, distributed computing and compilers. All-round Technology Manager, Product Developer, and Innovator with 15+ years track record in research, development and management of distributed architectures, scalable services and data-driven applications.

Friday, May 31, 2013

Create a hadoop cluster on an aws virtual private cloud

There are plenty tutorials out there about how to install hadoop in single node mode, or in pseudo multi-node, or how to set up your home computers to run a cluster. But why not going for the real thing and unleash the power of the cluster using aws machines?

It turns out that you can setup a cluster of over ten machines on amazon, and if you keep you experiment for a weekend, it will not cost you more than a  few tens of bucks. Furthermore, it is an excellent opportunity to play with aws vpc's (virtual private clouds) and a fully distributed hadoop installation. So here we go.

We are going to create a vpc network with a gateway node and a number of node on a private subnet.
First off, we need two keys for the credentials. Let's call them ec2-hadoop and hadoop, respectively for the gateway key credential, and for the private subnet node credentials.

Then, let's create a virtual private cloud on amazon.

We choose for a virtual private cloud with two subnets, one public and one private.
We are going to place the gateway on the public subnet, while the hadoop cluster of machine will be located on the private subnet.

The subnets have a 8bit iprange.
The public subnet is configured on (which means ips like 10.0.0.x, for a total of 255 addresses)
The private subnet is configured on (which means ips like 10.0.1.x , for a total of 255 addresses)

Aws creates the vpc components

Completed! We have our vpc.

As part of this automatic configuration, aws will provide a public elastic ip for us to connect to, and will automatically start 1 instance which is going to be used as gateway and NAT.

Switching from the VPC to the EC2 dashboard: we do see one instance running (and its ebs volume) the keypairs and the elastic IP. So far so good. :)

A quick check on the EC2 gateway machine. You can see the address is in the range - the private ip address of the gateway is

It's time to spawn some nodes for the cluster! Select the Ubuntu 12.04 LTS 64bit image.
We need specialized security groups for the VPC. Those groups are separated from the standalone EC2 groups.
First let's create a security group for the hadoop nodes on the private subnet. Let's call it the "hadoop" security group
Let's use the defaul vpc security group for the gateway.
"hadoop" security group, means that all the hadoop nodes will be open for inbound and outbound on all ports/interfaces. Only among each other though. This is good since we are in a trusted zone and hadoop opens up many ports.
Let's open up incoming port 22 from anywhere (, for ssh access on the gateway node.
Then, make sure that the two vpc security groups can talk between each other. so all intra-cluster communications are allowed.
Time to spawn the hadoop nodes. Again select the image (ubuntu 12 64bit)
Select ubuntu 12.04 LTS 64bit

Thereafter go ahead and pick as many machines as you can afford :p
By default you would get an ephimeral disk. In this setup, This disk is removed and replaced with a ECB disk.

For the hadoop nodes, select the already configured hadoop security group.
All ready. Launch and wait for the hadoop nodes in your virtual cluster to be instantiated and provisioned.
Make sure that the gateway accept incoming connections on all port from the hadoop cluster ports.
As well as the hadoop node, should accept incomint connection on all ports from any other hadoop node and the gateway node
Now, let's install the cloudera manager on the gatway node. This manager has a web interface which we can (temporarily)  enable on port 7180.

Now, download the clodera manager installer and install it on the gateway node (this was done via ssh and scp, not shown here).

You don't need to install it on all the machines just on the gateway. The installer itself we propagate to the other nodes of the cluster. 

You should access this via your public ip address on port 7180. Such us http://<public-ip-address>:7180

In this screen capture the ip address is ....
PS do not use this address. This aws VPC has been destroyed. I am displaying this only to let you not where and what to look for the public ip address. when you will build your own hadoop cluster.
More hadoop nodes!

11 nodes of hadoop big data crunching power coming up. :p
Instantiating and provisioning with ubuntu 12.04 ...
All set. We have our machine with ubuntu installed on them. All in the private subnet 10.0.1.x
Just installing the latest free edition in this demo.

Provide a patter for all the machines belonging to the hadoop cluster. You can use ranges of ip addresses by using the hyphen notation.

The cloudera manager on the gateway node pings the hadoop nodes. Yep, they are reachable

Install the latest cloudera (4.2.0 using parcels)
The manager starts a parallel installation  of the cloudera management framework on the hadoop nodes.

The hadoop cloudera manager has been successfully propagated on all the nodes.
Hadoop software is now installed on all the nodes.
Select the components you would like to install.
In this case just HDFS and MRv1

The cloudera manager will map all the master services on one node and keep all the others for data cruching as slave nodes.
You can however provide more spreading in the service, for instance by putting the namenode and the secondary namenode services on two physical nodes.
Cloudera Manager will start the various services.
Once this is done, the dashboard will appear.

Happy mapreducing on your very own aws hadoop cluster. Enjoy!