Use Stratosphere with Amazon Elastic MapReduce

18 Feb 2014

Get started with Stratosphere within 10 minutes using Amazon Elastic MapReduce.

This step-by-step tutorial will guide you through the setup of Stratosphere using Amazon Elastic MapReduce.

Background

Amazon Elastic MapReduce (Amazon EMR) is part of Amazon Web services. EMR allows to create Hadoop clusters that analyze data stored in Amazon S3 (AWS' cloud storage). Stratosphere runs on top of Hadoop using the recently released cluster resource manager YARN. YARN allows to use many different data analysis tools in your cluster side by side. Tools that run with YARN are, for example Apache Giraph, Spark or HBase. Stratosphere also runs on YARN and that's the approach for this tutorial.

1. Step: Login to AWS and prepare secure access

You need to have SSH keys to access the Hadoop master node. If you do not have keys for your computer, generate them:

  • Select EC2 and click on "Key Pairs" in the "NETWORK & SECURITY" section.
  • Click on "Create Key Pair" and give it a name
  • After pressing "Yes" it will download a .pem file.
  • Change the permissions of the .pem file
  • chmod og-rwx ~/work-laptop.pem

2. Step: Create your Hadoop Cluster in the cloud

  • Select Elastic MapReduce from the AWS console
  • Click the blue "Create cluster" button.
  • Choose a Cluster name
  • You can let the other settings remain unchanged (termination protection, logging, debugging)
  • For the Hadoop distribution, it is very important to choose one with YARN support. We use 3.0.3 (Hadoop 2.2.0) (the minor version might change over time)
  • Remove all applications to be installed (unless you want to use them)
  • Choose the instance types you want to start. Stratosphere runs fine with m1.large instances. Core and Task instances both run Stratosphere, but only core instances contain HDFS data nodes.
  • Choose the EC2 key pair you've created in the previous step!
  • Thats it! You can now press the "Create cluster" button at the end of the form to boot it!

3. Step: Launch Stratosphere

You might need to wait a few minutes until Amazon started your cluster. (You can monitor the progress of the instances in EC2). Use the refresh button in the top right corner.

You see that the master is up if the field Master public DNS contains a value (first line), connect to it using SSH.

ssh hadoop@<your master public DNS> -i <path to your .pem>
# for my example, it looks like this:
ssh hadoop@ec2-54-213-61-105.us-west-2.compute.amazonaws.com -i ~/Downloads/work-laptop.pem
(Windows users have to follow these instructions to SSH into the machine running the master.)

Once connected to the master, download and start Stratosphere for YARN:
  • Download and extract Stratosphere-YARN
  • wget http://stratosphere-bin.s3-website-us-east-1.amazonaws.com/stratosphere-dist-0.5-SNAPSHOT-yarn.tar.gz
    # extract it
    tar xvzf stratosphere-dist-0.5-SNAPSHOT-yarn.tar.gz
  • Start Stratosphere in the cluster using Hadoop YARN
  • cd stratosphere-yarn-0.5-SNAPSHOT/
    ./bin/yarn-session.sh -n 4 -jm 1024 -tm 3000
    The arguments have the following meaning
    • -n number of TaskManagers (=workers). This number must not exeed the number of task instances
    • -jm memory (heapspace) for the JobManager
    • -tm memory for the TaskManagers
Once the output has changed from
JobManager is now running on N/A:6123
to
JobManager is now running on ip-172-31-13-68.us-west-2.compute.internal:6123
Stratosphere has started the JobManager. It will take a few seconds until the TaskManagers (workers) have connected to the JobManager. To see how many TaskManagers connected, you have to access the JobManager's web interface. Follow the steps below to do that ...

4. Step: Launch a Stratosphere Job

This step shows how to submit and monitor a Stratosphere Job in the Amazon Cloud.
  • Open an additional terminal and connect again to the master of your cluster.
  • We recommend to create a SOCKS-proxy with your SSH that allows you to easily connect into the cluster. (If you've already a VPN setup with EC2, you can probably use that as well.)
    ssh -D localhost:2001 hadoop@<your master dns name> -i <your pem file>
    Notice the -D localhost:2001 argument: It opens a SOCKS proxy on your computer allowing any application to use it to communicate through the proxy via an SSH tunnel to the master node. This allows you to access all services in your EMR cluster, such as the HDFS NameNode or the YARN web interface.
  • Configure a browser to use the SOCKS proxy. Open a browser with SOCKS proxy support (such as Firefox). Ideally, do not use your primary browser for this, since ALL traffic will be routed through Amazon.
    • To configure the SOCKS proxy with Firefox, click on "Edit", "Preferences", choose the "Advanced" tab and press the "Settings ..." button.
    • Enter the details of the SOCKS proxy localhost:2001. Choose SOCKS v4.
    • Close the settings, your browser is now talking to the master node of your cluster

Since you're connected to the master now, you can open several web interfaces:
YARN Resource Manager: http://<masterIPAddress>:9026/
HDFS NameNode: http://<masterIPAddress>:9101/

You find the masterIPAddress by entering ifconfig into the terminal:

[hadoop@ip-172-31-38-95 ~]$ ifconfig
eth0      Link encap:Ethernet  HWaddr 02:CF:8E:CB:28:B2  
          inet addr:172.31.38.95  Bcast:172.31.47.255  Mask:255.255.240.0
          inet6 addr: fe80::cf:8eff:fecb:28b2/64 Scope:Link
          RX bytes:166314967 (158.6 MiB)  TX bytes:89319246 (85.1 MiB)

Optional: If you want to use the hostnames within your Firefox (that also makes the NameNode links work), you have to enable DNS resolution over the SOCKS proxy. Open the Firefox config about:config and set network.proxy.socks_remote_dns to true.

The YARN ResourceManager also allows you to connect to Stratosphere's JobManager web interface. Click the ApplicationMaster link in the "Tracking UI" column.

To run the Wordcount example, you have to upload some sample data.

# download a text
wget http://www.gnu.org/licenses/gpl.txt
# upload it to HDFS:
hadoop fs -copyFromLocal gpl.txt /input

To run a Job, enter the following command into the master's command line:

# optional: go to the extracted directory
cd stratosphere-yarn-0.5-SNAPSHOT/
# run the wordcount example
./bin/stratosphere run -w -j examples/stratosphere-java-examples-0.5-SNAPSHOT-WordCount.jar  -a 16 hdfs:///input hdfs:///output

Make sure that the number of TaskManager's have connected to the JobManager.

Lets go through the command in detail:

  • ./bin/stratosphere is the standard launcher for Stratosphere jobs from the command line
  • The -w flag stands for "wait". It is a very useful to track the progress of the job.
  • -j examples/stratosphere-java-examples-0.5-SNAPSHOT-WordCount.jar the -j command sets the jar file containing the job. If you have you own application, place your Jar-file here.
  • -a 16 hdfs:///input hdfs:///output the -a command specifies the Job-specific arguments. In this case, the wordcount expects the following input <numSubStasks> <input> <output>.

You can monitor the progress of your job in the JobManager webinterface. Once the job has finished (which should be the case after less than 10 seconds), you can analyze it there. Inspect the result in HDFS using:

hadoop fs -tail /output

If you want to shut down the whole cluster in the cloud, use Amazon's webinterface and click on "Terminate cluster". If you just want to stop the YARN session, press CTRL+C in the terminal. The Stratosphere instances will be killed by YARN.



Written by Robert Metzger (@rmetzger_).

comments powered by Disqus