Class SparkPipeline

  extended by org.apache.crunch.impl.dist.DistributedPipeline
      extended by org.apache.crunch.impl.spark.SparkPipeline
All Implemented Interfaces:

public class SparkPipeline
extends DistributedPipeline

Constructor Summary
SparkPipeline( sparkContext, String appName)
SparkPipeline(String sparkConnect, String appName)
SparkPipeline(String sparkConnect, String appName, Class<?> jarClass)
Method Summary
<T> void
cache(PCollection<T> pcollection, CachingOptions options)
          Caches the given PCollection so that it will be processed at most once during pipeline execution.
 PipelineResult done()
          Run any remaining jobs required to generate outputs and then clean up any intermediate data files that were created in this run or previous calls to run.
<S> PCollection<S>
emptyPCollection(PType<S> ptype)
<K,V> PTable<K,V>
emptyPTable(PTableType<K,V> ptype)
<T> Iterable<T>
materialize(PCollection<T> pcollection)
          Create the given PCollection and read the data it contains into the returned Collection instance for client use.
 PipelineResult run()
          Constructs and executes a series of MapReduce jobs in order to write data to the output targets.
 PipelineExecution runAsync()
          Constructs and starts a series of MapReduce jobs in order ot write data to the output targets, but returns a ListenableFuture to allow clients to control job execution.
Methods inherited from class org.apache.crunch.impl.dist.DistributedPipeline
cleanup, createIntermediateOutput, createTempPath, enableDebug, getConfiguration, getFactory, getMaterializeSourceTarget, getName, getNextAnonymousStageId, read, read, readTextFile, setConfiguration, write, write, writeTextFile
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail


public SparkPipeline(String sparkConnect,
                     String appName)


public SparkPipeline(String sparkConnect,
                     String appName,
                     Class<?> jarClass)


public SparkPipeline( sparkContext,
                     String appName)
Method Detail


public <T> Iterable<T> materialize(PCollection<T> pcollection)
Description copied from interface: Pipeline
Create the given PCollection and read the data it contains into the returned Collection instance for client use.

pcollection - The PCollection to materialize
the data from the PCollection as a read-only Collection


public <S> PCollection<S> emptyPCollection(PType<S> ptype)
Specified by:
emptyPCollection in interface Pipeline
emptyPCollection in class DistributedPipeline


public <K,V> PTable<K,V> emptyPTable(PTableType<K,V> ptype)
Specified by:
emptyPTable in interface Pipeline
emptyPTable in class DistributedPipeline


public <T> void cache(PCollection<T> pcollection,
                      CachingOptions options)
Description copied from interface: Pipeline
Caches the given PCollection so that it will be processed at most once during pipeline execution.

pcollection - The PCollection to cache
options - The options for how the cached data is stored


public PipelineResult run()
Description copied from interface: Pipeline
Constructs and executes a series of MapReduce jobs in order to write data to the output targets.


public PipelineExecution runAsync()
Description copied from interface: Pipeline
Constructs and starts a series of MapReduce jobs in order ot write data to the output targets, but returns a ListenableFuture to allow clients to control job execution.



public PipelineResult done()
Description copied from interface: Pipeline
Run any remaining jobs required to generate outputs and then clean up any intermediate data files that were created in this run or previous calls to run.

Specified by:
done in interface Pipeline
done in class DistributedPipeline

Copyright © 2014 The Apache Software Foundation. All Rights Reserved.