Use Stratosphere with Amazon Elastic MapReduce
This step-by-step tutorial will guide you through the setup of Stratosphere using Amazon Elastic MapReduce.
Background
Amazon Elastic MapReduce (Amazon EMR) is part of Amazon Web services. EMR allows to create Hadoop clusters that analyze data stored in Amazon S3 (AWS' cloud storage). Stratosphere runs on top of Hadoop using the recently released cluster resource manager YARN. YARN allows to use many different data analysis tools in your cluster side by side. Tools that run with YARN are, for example Apache Giraph, Spark or HBase. Stratosphere also runs on YARN and that's the approach for this tutorial.
1. Step: Login to AWS and prepare secure access
- Log in to the AWS Console
You need to have SSH keys to access the Hadoop master node. If you do not have keys for your computer, generate them:
- Select EC2 and click on "Key Pairs" in the "NETWORK & SECURITY" section.
- Click on "Create Key Pair" and give it a name
- After pressing "Yes" it will download a .pem file.
- Change the permissions of the .pem file
chmod og-rwx ~/work-laptop.pem
2. Step: Create your Hadoop Cluster in the cloud
- Select Elastic MapReduce from the AWS console
- Click the blue "Create cluster" button.
- Choose a Cluster name
- You can let the other settings remain unchanged (termination protection, logging, debugging)
- For the Hadoop distribution, it is very important to choose one with YARN support. We use 3.0.3 (Hadoop 2.2.0) (the minor version might change over time)
- Remove all applications to be installed (unless you want to use them)
- Choose the instance types you want to start. Stratosphere runs fine with m1.large instances. Core and Task instances both run Stratosphere, but only core instances contain HDFS data nodes.
- Choose the EC2 key pair you've created in the previous step!
- Thats it! You can now press the "Create cluster" button at the end of the form to boot it!
3. Step: Launch Stratosphere
You might need to wait a few minutes until Amazon started your cluster. (You can monitor the progress of the instances in EC2). Use the refresh button in the top right corner.
You see that the master is up if the field Master public DNS contains a value (first line), connect to it using SSH.
ssh hadoop@<your master public DNS> -i <path to your .pem>
# for my example, it looks like this:
ssh hadoop@ec2-54-213-61-105.us-west-2.compute.amazonaws.com -i ~/Downloads/work-laptop.pem
- Download and extract Stratosphere-YARN
- Start Stratosphere in the cluster using Hadoop YARN
-n
number of TaskManagers (=workers). This number must not exeed the number of task instances-jm
memory (heapspace) for the JobManager-tm
memory for the TaskManagers
wget http://stratosphere-bin.s3-website-us-east-1.amazonaws.com/stratosphere-dist-0.5-SNAPSHOT-yarn.tar.gz
# extract it
tar xvzf stratosphere-dist-0.5-SNAPSHOT-yarn.tar.gz
cd stratosphere-yarn-0.5-SNAPSHOT/
./bin/yarn-session.sh -n 4 -jm 1024 -tm 3000
JobManager is now running on N/A:6123
JobManager is now running on ip-172-31-13-68.us-west-2.compute.internal:6123
4. Step: Launch a Stratosphere Job
This step shows how to submit and monitor a Stratosphere Job in the Amazon Cloud.- Open an additional terminal and connect again to the master of your cluster. We recommend to create a SOCKS-proxy with your SSH that allows you to easily connect into the cluster. (If you've already a VPN setup with EC2, you can probably use that as well.)
- Configure a browser to use the SOCKS proxy. Open a browser with SOCKS proxy support (such as Firefox). Ideally, do not use your primary browser for this, since ALL traffic will be routed through Amazon.
ssh -D localhost:2001 hadoop@<your master dns name> -i <your pem file>
-D localhost:2001
argument: It opens a SOCKS proxy on your computer allowing any application to use it to communicate through the proxy via an SSH tunnel to the master node. This allows you to access all services in your EMR cluster, such as the HDFS NameNode or the YARN web interface.
Since you're connected to the master now, you can open several web interfaces:
YARN Resource Manager: http://<masterIPAddress>:9026/
HDFS NameNode: http://<masterIPAddress>:9101/
You find the masterIPAddress
by entering ifconfig
into the terminal:
[hadoop@ip-172-31-38-95 ~]$ ifconfig
eth0 Link encap:Ethernet HWaddr 02:CF:8E:CB:28:B2
inet addr:172.31.38.95 Bcast:172.31.47.255 Mask:255.255.240.0
inet6 addr: fe80::cf:8eff:fecb:28b2/64 Scope:Link
RX bytes:166314967 (158.6 MiB) TX bytes:89319246 (85.1 MiB)
Optional: If you want to use the hostnames within your Firefox (that also makes the NameNode links work), you have to enable DNS resolution over the SOCKS proxy. Open the Firefox config about:config
and set network.proxy.socks_remote_dns
to true
.
The YARN ResourceManager also allows you to connect to Stratosphere's JobManager web interface. Click the ApplicationMaster link in the "Tracking UI" column.
To run the Wordcount example, you have to upload some sample data.
# download a text
wget http://www.gnu.org/licenses/gpl.txt
# upload it to HDFS:
hadoop fs -copyFromLocal gpl.txt /input
To run a Job, enter the following command into the master's command line:
# optional: go to the extracted directory
cd stratosphere-yarn-0.5-SNAPSHOT/
# run the wordcount example
./bin/stratosphere run -w -j examples/stratosphere-java-examples-0.5-SNAPSHOT-WordCount.jar -a 16 hdfs:///input hdfs:///output
Make sure that the number of TaskManager's have connected to the JobManager.
Lets go through the command in detail:
./bin/stratosphere
is the standard launcher for Stratosphere jobs from the command line- The
-w
flag stands for "wait". It is a very useful to track the progress of the job. -j examples/stratosphere-java-examples-0.5-SNAPSHOT-WordCount.jar
the-j
command sets the jar file containing the job. If you have you own application, place your Jar-file here.-a 16 hdfs:///input hdfs:///output
the-a
command specifies the Job-specific arguments. In this case, the wordcount expects the following input<numSubStasks> <input> <output>
.
You can monitor the progress of your job in the JobManager webinterface. Once the job has finished (which should be the case after less than 10 seconds), you can analyze it there. Inspect the result in HDFS using:
hadoop fs -tail /output
If you want to shut down the whole cluster in the cloud, use Amazon's webinterface and click on "Terminate cluster". If you just want to stop the YARN session, press CTRL+C in the terminal. The Stratosphere instances will be killed by YARN.
Written by Robert Metzger (@rmetzger_).