Fork me on GitHub

Concepts

Component

Components are the classes that do stuff within a stream. Components are assembled into pipelines and executed using a runtime. There are several core types of Components, each using a specific java interface:

Provider

A Provider is a component that provides data to the stream from external systems.

Processor

A Processor is a component that processes data flowing through the stream - transformations, filters, and enrichments are common processors.

PersistWriter

A PersistWriter is a component that writes data exiting the stream.

PersistReader

A PersistReader is a component that reads data, often previously written by a PersistWriter.

Schema

A Schema defines the expected shape of the documents that will passed from step to step within a stream. Defining the schema for a type of document allows source files and resource files to be generated by the build process, relieving your team of the need to maintain these files by hand.

Schemas can include other schemas, whether in the same repo or available via HTTP, allowing for full or partial reuse within or across organizations.

Datum

A Datum is a single piece of data within a stream. A datum typically has an identifier, a timestamp, a document (which may be any java object), and additional metadata kept apart from the document related to upstream or downstream processing..

Activity

Apache Streams has a preference for ActivityStreams formatted messages. These messages may be passed using the ‘Activity’ class or one of it’s sub-classes.

ActivityObject

An activity has several sub-object fields:

  • actor (required)
  • object (optional)
  • target (optional)
  • generator (optional)
  • provider (optional)

Streams containing details of actors, objects, etc… may be created using the ‘ActivityObject’ class or one of it’s sub-classes.

Pipeline

A Pipeline is a set of collection, processing, and storage components structured in a directed graph (cycles may be permitted) which is packaged, deployed, started, and stopped together.

Runtime

A Runtime is a module containing bindings that help setup and run a pipeline. Runtimes may submit pipeline binaries to an existing cluster, or may launch the process(es) to execute the stream directly.