Scope of Project

There are different kinds of import/export problem. One class of problem is defining a set of SequenceFile formats that a "Mahout Job" will import and export. This page is limited to the SequenceFile problem.

Purpose

This feature would make the suite of Mahout jobs far more useful, as they can cross-connect with each other. Right now each job is a large complex beast that does everything a use case might need. This would allow smaller modular job designs.

The feature should not create more "I am a confused beginner" traffic on the mahout-user list.

Use Cases

Lucene "Bag-of-words" vector

This is a NamedVector file containing a String key and a sparse-encoded vector. There may be an external dictionary defining documents and/or terms.

Import

The various Bayes text classification jobs like Wikipedia import Lucene bag-of-words Vector files.

Export

Feature vectors derived from text vectors are useful to text-oriented machine learning research. An example:

Compare a feature vector to all of the original text vectors. This searches for "exemplar" documents which seem to most comprehensively match the given feature. A bunch of papers discuss this for creating document abstracts from sentence vectors.

Confusion Matrix

A classification job creates among other things a Confusion Matrix. The current example jobs log a text version of the confusion matrix.

Import

Comparing confusion matrices from different classification runs lets you evaluate tuning knobs for a classifier.

Export

Classification jobs export a confusion matrix defining misclassification events. (Recommender jobs have an analogous output: the user/item matrix of preference deltas when comparing training and test data. I would use the same tool to visualize both matrices.)

Contract

All "Mahout Jobs" have to honor a contract around SequenceFile types.

Proposal #1

There is a small list of simple SequenceFiles. All jobs are required to accept at least some of these file types. The job must have its own interpretation of what it means to import each one. There can be several interpretations for a particular type.

The job must log a list of what file types it imports and exports, and descriptions of each interpretation.

Limits

All jobs still have the current parameters and their meanings. Participating in the Import/Export feature occurs outside of the "native" file formats for input & output.
A job is not limited to the list of file types. It can import and export any other types. The FPGrowth job exports a complex tree structure.

Types of SequenceFiles

There should be a very short list of SequenceFile types.

Matrix with optional row&column labels.
NamedVector
??

Information structure

Data exported to the common formats are not expected to include full information. Under this proposal, it would create various "flattened" versions of the main data structure.

Parameters

There is a common set of parameters for the Import/Export service. It should be as simple as possible.

Use the import service
Where is the file or Hadoop directory?
Which interpretation should the job use?
Same for export