---++ Contents * Onboarding Steps * Sample Pipeline * [[HiveIntegration][Hive Examples]] ---+++ Onboarding Steps * Create cluster definition for the cluster, specifying name node, job tracker, workflow engine endpoint, messaging endpoint. Refer to [[EntitySpecification][cluster definition]] for details. * Create Feed definitions for each of the input and output specifying frequency, data path, ownership. Refer to [[EntitySpecification][feed definition]] for details. * Create Process definition for your job. Process defines configuration for the workflow job. Important attributes are frequency, inputs/outputs and workflow path. Refer to [[EntitySpecification][process definition]] for process details. * Define workflow for your job using the workflow engine(only oozie is supported as of now). Refer [[http://oozie.apache.org/docs/3.1.3-incubating/WorkflowFunctionalSpec.html][Oozie Workflow Specification]]. The libraries required for the workflow should be available in lib folder in workflow path. * Set-up workflow definition, libraries and referenced scripts on hadoop. * Submit cluster definition * Submit and schedule feed and process definitions ---+++ Sample Pipeline ---++++ Cluster Cluster definition that contains end points for name node, job tracker, oozie and jms server: The cluster locations MUST be created prior to submitting a cluster entity to Falcon. *staging* must have 777 permissions and the parent dirs must have execute permissions *working* must have 755 permissions and the parent dirs must have execute permissions ---++++ Input Feed Hourly feed that defines feed path, frequency, ownership and validity: group hours(1)

---++++ Output Feed Daily feed that defines feed path, frequency, ownership and validity: group days(1)

---++++ Process Sample process which runs daily at 6th hour on corp cluster. It takes one input - !SampleInput for the previous day(24 instances). It generates one output - !SampleOutput for previous day. The workflow is defined at /projects/bootcamp/workflow/workflow.xml. Any libraries available for the workflow should be at /projects/bootcamp/workflow/lib. The process also defines properties queueName, ssh.host, and fileTimestamp which are passed to the workflow. In addition, Falcon exposes the following properties to the workflow: nameNode, jobTracker(hadoop properties), input and output(Input/Output properties). days(1)

---++++ Oozie Workflow The sample user workflow contains 3 actions: * Pig action - Executes pig script /projects/bootcamp/workflow/script.pig * concatenator - Java action that concatenates part files and generates a single file * file upload - ssh action that gets the concatenated file from hadoop and sends the file to a remote host

${jobTracker}

${nameNode}

mapred.job.queue.name ${queueName} mapreduce.fileoutputcommitter.marksuccessfuljobs true input=${input} output=${output} lib/dependent.jar

${jobTracker}

${nameNode}

mapred.job.queue.name ${queueName}

com.wf.Concatenator

${output} ${nameNode}/projects/bootcamp/concat/data-${fileTimestamp}.csv localhost /tmp/fileupload.sh ${nameNode}/projects/bootcamp/concat/data-${fileTimestamp}.csv ${wf:conf("ssh.host")}

${wf:actionData('fileupload')['output'] == '0'} Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]

---++++ File Upload Script The script gets the file from hadoop, rsyncs the file to /tmp on remote host and deletes the file from hadoop #!/bin/bash trap 'echo "output=$?"; exit $?' ERR INT TERM echo "Arguments: $@" SRCFILE=$1 DESTHOST=$3 FILENAME=`basename $SRCFILE` rm -f /tmp/$FILENAME hadoop fs -copyToLocal $SRCFILE /tmp/ echo "Copied $SRCFILE to /tmp" rsync -ztv --rsh=ssh --stats /tmp/$FILENAME $DESTHOST:/tmp echo "rsynced $FILENAME to $DESTUSER@$DESTHOST:$DESTFILE" hadoop fs -rmr $SRCFILE echo "Deleted $SRCFILE" rm -f /tmp/$FILENAME echo "output=0"