Apache Mahout > Mahout Wiki > Collections > Collection(De-)Serialization > Import Export Sequence File Formats |
This is a talk page.
There are different kinds of import/export problem. One class of problem is defining a set of SequenceFile formats that a "Mahout Job" will import and export. This page is limited to the SequenceFile problem.
This feature would make the suite of Mahout jobs far more useful, as they can cross-connect with each other. Right now each job is a large complex beast that does everything a use case might need. This would allow smaller modular job designs.
The feature should not create more "I am a confused beginner" traffic on the mahout-user list.
This is a NamedVector file containing a String key and a sparse-encoded vector. There may be an external dictionary defining documents and/or terms.
The various Bayes text classification jobs like Wikipedia import Lucene bag-of-words Vector files.
Feature vectors derived from text vectors are useful to text-oriented machine learning research. An example:
A classification job creates among other things a Confusion Matrix. The current example jobs log a text version of the confusion matrix.
Comparing confusion matrices from different classification runs lets you evaluate tuning knobs for a classifier.
Classification jobs export a confusion matrix defining misclassification events. (Recommender jobs have an analogous output: the user/item matrix of preference deltas when comparing training and test data. I would use the same tool to visualize both matrices.)
All "Mahout Jobs" have to honor a contract around SequenceFile types.
There is a small list of simple SequenceFiles. All jobs are required to accept at least some of these file types. The job must have its own interpretation of what it means to import each one. There can be several interpretations for a particular type.
The job must log a list of what file types it imports and exports, and descriptions of each interpretation.
All jobs still have the current parameters and their meanings. Participating in the Import/Export feature occurs outside of the "native" file formats for input & output.
A job is not limited to the list of file types. It can import and export any other types. The FPGrowth job exports a complex tree structure.
There should be a very short list of SequenceFile types.
Data exported to the common formats are not expected to include full information. Under this proposal, it would create various "flattened" versions of the main data structure.
There is a common set of parameters for the Import/Export service. It should be as simple as possible.