First Walk in the Clouds

During the week I tried the Hadoop framework for the first time. I wrote a proof of concept prototype for an application that we are likely going to develop. I managed to test my code using unit test (mrunit), local integration test starting embedded Hadoop and running it pseudo-distributed on my local Hadoop cluster. The final step was to test it in a real cluster in Amazon EC2.

I had never started any AMI in EC2 before, so everything was brand new for me. Access to your AWS account as well as you instances is well protected. Setting up proper access to my EC2 instances was very bumpy, especially since I made a mistake with one of the private key files. Unfortunately the error message I got, was not very helpful and I spent quite some time finding the problem.

If you want to use EC2 you need the following security credentials. Sign-in Credentials: this is basically a email address and a password protecting your AWS account. You need to keep this really safe.

Access Credentials: they consist of three different sub-groups. First there are the Access Keys. Each EC2 user can have up to two Access Keys. Each Access Key has a Access Key Id and a Secret Access Key. In your system environment variables system environment variables, you add them as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. The next subgroup are the X.509 Certificates. Again, you can generate two X.509 Certificates at a time for each EC2 account. Create a new certificate in the AWS management console and download the public and private key. The public key will be in a file that starts with cert-xxxx, the private key will be in a file that starts with pk-xxxx. Copy these two files somewhere and add the full location to your system environment variables as EC2_PRIVATE_KEY and EC2_CERT. The last subgroup are the Key Pairs. A key pair is used when you install a AMI executing the ec2-run-instances command. This is a additional protection to restrict access to you running instance. A Key Pair also has a private and a public part. Amazon will keep the public key of the key pair and store it with your instance. To connect to the instance you need the private key part of the Key Pair.

Run the command: "ec2-add-keypair foo" to create a Key Pair named foo. This will return you the private key part which you will have to copy into a file. This is where I made a stupid mistake. I copied only the parts between BEGIN and END into the file but the file needs to contain the whole output. So it is much much better to run this instead: "ec2-add-keypair foo > ~/.ssh/foo.keypair.ssh". This will automatically sent the output to a new file in the .ssh directory. Finally give your new file the right permissions using "chmod 0700 ~/.ssh/foo.keypair.ssh". For further reading I recommend this page which I helped me fixing my problem. So if you try to ssh -i into your instance and it asks you for a passphrase, something is not correct with the private key part of your Key Pair. Another manifestation of the same problem if you do not use ssh to connect to your instance but the Cloudera scripts instead. If you are following the Cloudera guide for running Cloudera Distribution AMI for Hadoop, and you are on chapter 2.3 Running Jobs and execute: "hadoop fs -ls /" to get this:

WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
10/03/12 15:09:31 INFO ipc.Client: Retrying connect to server: Already tried 0 time(s).

It was the same problem for me. Having the correct private key part of the Key pair fixed this for me.

The last bit of EC2 protection are the Account Identifiers. I think they are only relevant if you plan to share AWS resources with different accounts - not sure.