ParserParsingParser Tool
The easiest way to try out the Parser is the command line tool.
The tool is only intended for demonstration and testing.
Download the English chunking parser model from the our website and start the Parse
Tool with the following command.
Loading the big parser model can take several seconds, be patient.
Copy this sample sentence to the console.
The parser should now print the following to the console.
With the following command the input can be read from a file and be written to an output file.
article-parsed.txt.]]>
The article-tokenized.txt file must contain one sentence per line which is
tokenized with the English tokenizer model from our website.
See the Tokenizer documentation for further details.
Parsing API
The Parser can be easily integrated into an application via its API.
To instantiate a Parser the parser model must be loaded first.
Unlike the other components to instantiate the Parser a factory method
should be used instead of creating the Parser via the new operator.
The parser model is either trained for the chunking parser or the tree
insert parser the parser implementation must be chosen correctly.
The factory method will read a type parameter from the model and create
an instance of the corresponding parser implementation.
Right now the tree insert parser is still experimental and there is no pre-trained model for it.
The parser expect a whitespace tokenized sentence. A utility method from the command
line tool can parse the sentence String. The following code shows how the parser can be called.
The topParses array only contains one parse because the number of parses is set to 1.
The Parse object contains the parse tree.
To display the parse tree call the show method. It either prints the parse to
the console or into a provided StringBuffer. Similar to Exception.printStackTrace.
TODO: Extend this section with more information about the Parse object.
Parser Training
The OpenNLP offers two different parser implementations, the chunking parser and the
treeinsert parser. The later one is still experimental and not recommended for production use.
(TODO: Add a section which explains the two different approaches)
The training can either be done with the command line tool or the training API.
In the first case the training data must be available in the OpenNLP format. Which is
the Penn Treebank format, but with the limitation of a sentence per line.
Penn Treebank annotation guidelines can be found on the
Penn Treebank home page.
A parser model also contains a pos tagger model, depending on the amount of available
training data it is recommend to switch the tagger model against a tagger model which
was trained on a larger corpus. The pre-trained parser model provided on the website
is doing this to achieve a better performance. (TODO: On which data is the model on
the website trained, and say on which data the tagger model is trained)
Training Tool
OpenNLP has a command line tool which is used to train the models available from the
model download page on various corpora. The data must be converted to the OpenNLP parser
training format, which is shortly explained above.
To train the parser a head rules file is also needed. (TODO: Add documentation about the head rules file)
Usage of the tool:
The model on the website was trained with the following command:
Its also possible to specify the cutoff and the number of iterations, these parameters
are used for all trained models. The -parserType parameter is an optional parameter,
to use the tree insertion parser, specify TREEINSERT as type. The TaggerModelReplacer
tool replaces the tagger model inside the parser model with a new one.
Note: The original parser model will be overwritten with the new parser model which
contains the replaced tagger model.
Additionally there are tools to just retrain the build or the check model.
Training API
The Parser training API supports the training of a new parser model.
Four steps are necessary to train it:
A HeadRules class needs to be instantiated: currently EnglishHeadRules and AncoraSpanishHeadRules are available.The application must open a sample data stream.Call a Parser train method: This can be either the CHUNKING or the TREEINSERT parser.Save the ParseModel to a file or database.
The following code snippet shows how to instantiate the HeadRules:
The following code illustrates the three other steps, namely, opening the data, training
the model and saving the ParserModel into an output file.
stringStream = new PlainTextByLineStream(inputStreamFactory, "UTF-8");
ObjectStream sampleStream = new ParseSample(stringStream);
ParserType type = parseParserType(params.getParserType());
if (ParserType.CHUNKING.equals(type)) {
model = opennlp.tools.parser.chunking.Parser.train(
params.getLang(), sampleStream, rules,
mlParams);
} else if (ParserType.TREEINSERT.equals(type)) {
model = opennlp.tools.parser.treeinsert.Parser.train(params.getLang(), sampleStream, rules,
mlParams);
}
}
catch (IOException e) {
throw new TerminateToolException(-1, "IO error while reading training data or indexing data: "
+ e.getMessage(), e);
}
finally {
try {
sampleStream.close();
}
catch (IOException e) {
// sorry that this can fail
}
}
CmdLineUtil.writeModel("parser", modelOutFile, model);
]]>
Parser Evaluation
The built in evaluation can measure the parser performance. The
performance is measured
on a test dataset.
Parser Evaluation Tool
The following command shows how the tool can be run:
A sample of the command considering you have a data sample named
en-parser-chunking.eval
and you trained a model called en-parser-chunking.bin:
and here is a sample output:
The Parser Evaluation tool reimplements the PARSEVAL scoring method
as implemented by the
EVALB
script, which is the most widely used evaluation
tool for constituent parsing. Note however that currently the Parser
Evaluation tool does not allow
to make exceptions in the constituents to be evaluated, in the way
Collins or Bikel usually do. Any
contributions are very welcome. If you want to contribute please contact us on
the mailing list or comment
on the jira issue
OPENNLP-688.
Evaluation API
The evaluation can be performed on a pre-trained model and a test dataset or via cross validation.
In the first case the model must be loaded and a Parse ObjectStream must be created (see code samples above),
assuming these two objects exist the following code shows how to perform the evaluation:
In the cross validation case all the training arguments must be
provided (see the Training API section above).
To perform cross validation the ObjectStream must be resettable.
stringStream = new PlainTextByLineStream(inputStreamFactory, "UTF-8");
ObjectStream sampleStream = new ParseSample(stringStream);
ParserCrossValidator evaluator = new ParserCrossValidator("en", trainParameters, headRules, \
parserType, listeners.toArray(new ParserEvaluationMonitor[listeners.size()])));
evaluator.evaluate(sampleStream, 10);
FMeasure result = evaluator.getFMeasure();
System.out.println(result.toString());]]>