CAS Push-Pull-Framework

Introduction

Project Description

Architecture

Extension Points

Current Extension Point Implementations

Conclusion

This is the developer guide for the Apache OODT Catalog and Archive Service (CAS) Push Pull framework, or Push Pull for short. Primarily, this guide will explain the Push Pull architecture and interfaces, including its tailorable extension points. For information on installation, configuration, and examples, please see our User Guides.

The remainder of this guide is separated into the following sections:

Project Description

The Push Pull framework is responsible for downloading remote content (pull), or accepting the delivery of remote content (push) to a local system staging area for use by the CAS Crawler Framework to ingest into the CAS File Manager. The Push Pull framework is extensible and provides a fully tailorable Java-based API for the acquisition of remote content.

Architecture

In this section, we will describe the architecture of the Push Pull framework, including its constituent components, object model, and key capabilities.

Components

The major components of the Push Pull Framework are the Daemon Launcher, the Daemon, the Protocol Layer, and the File Retrieval System, to name a few. The relationship between all of these components are shown in the diagram below:

Push Pull Framework Architecture

The Push Pull Framework provides a Daemon Launcher, responsible for creating new Daemon instances. Each Daemon has an associated Daemon Configuration, and has the ability to use a File Retrieval Setup extension point. This class is responsible for leveraging both a Protocol and a File Retrieval System to obtain ProtocolFiles, based on a File Restrictions Parser, that yields eventually a VirtualFileStructure (VFS) model. The VFS defines what files to accept and pull down from a remote site.

Object Model

The critical objects managed by the Push Pull Framework include:

  • Protocol - A pluggable means of obtaining content over some file acquisition method, e.g., FTP, SCP, HTTP, etc.
  • Protocol File - Metadata information about a remote file, including its ProtocolPath.
  • Protocol Path - A pointer to a remote Product file's (or files') location, which can be used to derive metadata and determine where to place the file in the local staging area built by the Push Pull Framework.
  • Remote Site - Descriptive information about a remote site, including the username/password combination, as well as a origin directory to start interrogating.

Each Protocol delivers one or more Protocol Files. Each ProtocoFile is associated with a single RemoteSite, and each ProtocolFile is associated with a single ProtocolPath. These relationships are shown in the below figure.

Push Pull Framework Object Model

Key Capabilities

The Push Pull Framework has been designed with a new of key capabilities in mind. These capabilities include:

Flexibility - ability to plug in different Metadata Extractors, Data Protocols, Content Types, etc.

Support Push/Pull - Support of both "Push" and "Pull" style data transfers.

Extensibility - ability to add new, previously undiscovered Data Protocols, and "plug" them into the framework.

Java-based - Use of Java programming language and development kit for Multi-Platform deployment (using the Java Virtual Machine).

Fast Data-transfer - Support of Parallel File Transfers and Data Downloads.

Email-based Push - Support for Email-based Push Data Acceptance using IMAP, SMTP protocols.

Modeling of remote data sites - Ability to configure “Virtual” remote directories (based on Metadata such as Date/Time) to download files from.

Integration with other CAS components - Ability to "plug-in" to the CAS File Management and CAS Crawl Framework components for Data Ingestion.

Extension Points

We have constructed the Push Pull Framework making use of the factory method pattern to provide multiple extension points for the Push Pull Framework. An extension point is an interface within the Push Pull Framework that can have many implementations. This is particularly useful when it comes to software component configuration because it allows different implementations of an existing interface to be selected at deployment time.

The factory method pattern is a creational pattern common to object oriented design. Each Push Pull Framework extension point involves the implementation of two interfaces: an extension factory and an extension implementation. At run-time, the Push Pull Framework loads a properties file specifies a factory class to use during extension point instantiation. For example, the Push Pull Framework may communicate with a remote FTP site to obtain content, or it may use an IMAPS protocol plugin to accept email-push notifications of available files.

Using extension points, it is fairly simple to support many different types of what are typically referred to as "plug-in architectures". Each of the core extension points for the Push Pull Framework is described below:

Protocol The Protocol extension point is the heart of the Push Pull framework, responsible for modeling remote sites, and for obtaining their content via different Retrieval Methods, using different File Restrictions Parsers.
Retrieval Method The Retrieval Method extension point is responsible for orchestrating download (pull) and acceptance (push) of remote content.
File Restrictions Parser The File Restrictions Parser extension point is responsible for defining how to accept or decline files encountered by a Retrieval Method, in essence modeling remote file and directory structures.
System The extension point that provides the external interface to the Push Pull Framework services. This includes the Daemon Launcher interface, as well as the associated Daemon interface, that is managed by with the Daemon Launcher.

Current Extension Point Implementations

There are at least two implementations of all of the aforementioned extension points for the Push Pull Framework. Each extension point implementation is detailed in this section.

Protocol

  • Cog JGlobus FTP. An implementation of the Protocol extension point for FTP using CoG jglobus.
  • Commons Net FTP. An implementation of the of the Protocol extension point for FTP using Commons Net FTP client.
  • HTTP. An implementation of the Protocol extension point using Java's URL class, as well as Apache Tika's HTMLParser.
  • IMAPS. An implementation of the Protocol extension point using IMAPS javax.mail classes from Apache Geronimo and HTML parsing from Apache Tika.
  • Local. An implementation of the Protocol extension point using Java NIO for local data acquisition.
  • SFTP. An implementation of the Protocol extension point using JCraft's JSch library.

Retrieval Method

  • Remote Crawler. An implementation of the Retrieval Method interface that uses an XML based set of policy files to determine which remote directories and files to crawl and obtain.
  • List Retriever. An implementation of the Retrieval Method interface that accepts a list of URLs that point to content to obtain.

File Restrictions Parser

  • DirStructXml Parser. An implementation of the File Restrictions Parser interface that interprets an XML file specifying the remote directories and files to obtain.
  • FileList Parser. An implementation of the File Restrictions Parser interface that specifies an ASCII newline separated list of URLs pointing to remote directories and files to obtain.
  • Class NOAA Email Parser. An implementation of the File Restrictions Parser interface that reads email files from NOAA's CLASS archive which specify lists of directory and file URLs to obtain.

Daemon Launcher (Daemon client and Daemon server)

  • Java RMI based server. An implementation of the external server interface for the Push Pull Framework that uses RMI as the transportation medium to launch Push Pull Daemons.
  • Push Pull Daemon. An implementation of the client interface for the Java RMI-based server that uses RMI as the transportation medium to manage and control the Push Pull services.

Conclusion

The aim of this document is to provide information relevant to developers about the CAS Push Pull Framework. Specifically, this document has described the Push Pull Framework's architecture, including its constituent components, object model and key capabilities. Additionally, the this document provides an overview of the current implementations of the Push Pull Framework's extension points.

In the Basic User Guide and Advanced User Guide, we will cover topics like installation, configuration, and example uses as well as advanced topics like scaling and other tips and tricks.