---++ Contents
* Onboarding Steps
* Sample Pipeline
* [[HiveIntegration][Hive Examples]]
---+++ Onboarding Steps
* Create cluster definition for the cluster, specifying name node, job tracker, workflow engine endpoint, messaging endpoint. Refer to [[EntitySpecification][cluster definition]] for details.
* Create Feed definitions for each of the input and output specifying frequency, data path, ownership. Refer to [[EntitySpecification][feed definition]] for details.
* Create Process definition for your job. Process defines configuration for the workflow job. Important attributes are frequency, inputs/outputs and workflow path. Refer to [[EntitySpecification][process definition]] for process details.
* Define workflow for your job using the workflow engine(only oozie is supported as of now). Refer [[http://oozie.apache.org/docs/3.1.3-incubating/WorkflowFunctionalSpec.html][Oozie Workflow Specification]]. The libraries required for the workflow should be available in lib folder in workflow path.
* Set-up workflow definition, libraries and referenced scripts on hadoop.
* Submit cluster definition
* Submit and schedule feed and process definitions
---+++ Sample Pipeline
---++++ Cluster
Cluster definition that contains end points for name node, job tracker, oozie and jms server:
The cluster locations MUST be created prior to submitting a cluster entity to Falcon.
*staging* must have 777 permissions and the parent dirs must have execute permissions
*working* must have 755 permissions and the parent dirs must have execute permissions
---++++ Input Feed
Hourly feed that defines feed path, frequency, ownership and validity:
grouphours(1)
---++++ Output Feed
Daily feed that defines feed path, frequency, ownership and validity:
groupdays(1)
---++++ Process
Sample process which runs daily at 6th hour on corp cluster. It takes one input - !SampleInput for the previous day(24 instances). It generates one output - !SampleOutput for previous day. The workflow is defined at /projects/bootcamp/workflow/workflow.xml. Any libraries available for the workflow should be at /projects/bootcamp/workflow/lib. The process also defines properties queueName, ssh.host, and fileTimestamp which are passed to the workflow. In addition, Falcon exposes the following properties to the workflow: nameNode, jobTracker(hadoop properties), input and output(Input/Output properties).
days(1)
---++++ Oozie Workflow
The sample user workflow contains 3 actions:
* Pig action - Executes pig script /projects/bootcamp/workflow/script.pig
* concatenator - Java action that concatenates part files and generates a single file
* file upload - ssh action that gets the concatenated file from hadoop and sends the file to a remote host
${jobTracker}${nameNode}mapred.job.queue.name${queueName}mapreduce.fileoutputcommitter.marksuccessfuljobstrue
input=${input}
output=${output}
lib/dependent.jar${jobTracker}${nameNode}mapred.job.queue.name${queueName}com.wf.Concatenator${output}${nameNode}/projects/bootcamp/concat/data-${fileTimestamp}.csvlocalhost/tmp/fileupload.sh${nameNode}/projects/bootcamp/concat/data-${fileTimestamp}.csv${wf:conf("ssh.host")}
${wf:actionData('fileupload')['output'] == '0'}
Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]
---++++ File Upload Script
The script gets the file from hadoop, rsyncs the file to /tmp on remote host and deletes the file from hadoop
#!/bin/bash
trap 'echo "output=$?"; exit $?' ERR INT TERM
echo "Arguments: $@"
SRCFILE=$1
DESTHOST=$3
FILENAME=`basename $SRCFILE`
rm -f /tmp/$FILENAME
hadoop fs -copyToLocal $SRCFILE /tmp/
echo "Copied $SRCFILE to /tmp"
rsync -ztv --rsh=ssh --stats /tmp/$FILENAME $DESTHOST:/tmp
echo "rsynced $FILENAME to $DESTUSER@$DESTHOST:$DESTFILE"
hadoop fs -rmr $SRCFILE
echo "Deleted $SRCFILE"
rm -f /tmp/$FILENAME
echo "output=0"