opennlp.tools.tokenize
Class TokenizerME
java.lang.Object
opennlp.tools.tokenize.TokenizerME
- All Implemented Interfaces:
- Tokenizer
public class TokenizerME
- extends Object
A Tokenizer for converting raw text into separated tokens. It uses
Maximum Entropy to make its decisions. The features are loosely
based off of Jeff Reynar's UPenn thesis "Topic Segmentation:
Algorithms and Applications.", which is available from his
homepage: .
This tokenizer needs a statistical model to tokenize a text which reproduces
the tokenization observed in the training data used to create the model.
The TokenizerModel
class encapsulates the model and provides
methods to create it from the binary representation.
A tokenizer instance is not thread safe. For each thread one tokenizer
must be instantiated which can share one TokenizerModel
instance
to safe memory.
To train a new model {train(String, ObjectStream, boolean, TrainingParameters)
method
can be used.
Sample usage:
InputStream modelIn;
...
TokenizerModel model = TokenizerModel(modelIn);
Tokenizer tokenizer = new TokenizerME(model);
String tokens[] = tokenizer.tokenize("A sentence to be tokenized.");
- See Also:
Tokenizer
,
TokenizerModel
,
TokenSample
Method Summary |
double[] |
getTokenProbabilities()
Returns the probabilities associated with the most recent
calls to AbstractTokenizer.tokenize(String) or tokenizePos(String) . |
String[] |
tokenize(String s)
Splits a string into its atomic parts |
Span[] |
tokenizePos(String d)
Tokenizes the string. |
static TokenizerModel |
train(ObjectStream<TokenSample> samples,
TokenizerFactory factory,
TrainingParameters mlParams)
Trains a model for the TokenizerME . |
static TokenizerModel |
train(String languageCode,
ObjectStream<TokenSample> samples,
boolean useAlphaNumericOptimization)
Deprecated. Use
#train(String, ObjectStream, TokenizerFactory, TrainingParameters)
and pass in a TokenizerFactory |
static TokenizerModel |
train(String languageCode,
ObjectStream<TokenSample> samples,
boolean useAlphaNumericOptimization,
int cutoff,
int iterations)
Deprecated. Use
#train(String, ObjectStream, TokenizerFactory, TrainingParameters)
and pass in a TokenizerFactory |
static TokenizerModel |
train(String languageCode,
ObjectStream<TokenSample> samples,
boolean useAlphaNumericOptimization,
TrainingParameters mlParams)
Deprecated. Use
#train(String, ObjectStream, TokenizerFactory, TrainingParameters)
and pass in a TokenizerFactory |
static TokenizerModel |
train(String languageCode,
ObjectStream<TokenSample> samples,
Dictionary abbreviations,
boolean useAlphaNumericOptimization,
TrainingParameters mlParams)
Deprecated. Use
#train(String, ObjectStream, TokenizerFactory, TrainingParameters)
and pass in a TokenizerFactory |
boolean |
useAlphaNumericOptimization()
Returns the value of the alpha-numeric optimization flag. |
SPLIT
public static final String SPLIT
- Constant indicates a token split.
- See Also:
- Constant Field Values
NO_SPLIT
public static final String NO_SPLIT
- Constant indicates no token split.
- See Also:
- Constant Field Values
alphaNumeric
@Deprecated
public static final Pattern alphaNumeric
- Deprecated. As of release 1.5.2, replaced by
Factory.getAlphanumeric(String)
- Alpha-Numeric Pattern
TokenizerME
public TokenizerME(TokenizerModel model)
TokenizerME
public TokenizerME(TokenizerModel model,
Factory factory)
- Deprecated. use
TokenizerFactory
to extend the Tokenizer
functionality
getTokenProbabilities
public double[] getTokenProbabilities()
- Returns the probabilities associated with the most recent
calls to
AbstractTokenizer.tokenize(String)
or tokenizePos(String)
.
- Returns:
- probability for each token returned for the most recent
call to tokenize. If not applicable an empty array is
returned.
tokenizePos
public Span[] tokenizePos(String d)
- Tokenizes the string.
- Parameters:
d
- The string to be tokenized.
- Returns:
- A span array containing individual tokens as elements.
train
public static TokenizerModel train(ObjectStream<TokenSample> samples,
TokenizerFactory factory,
TrainingParameters mlParams)
throws IOException
- Trains a model for the
TokenizerME
.
- Parameters:
samples
- the samples used for the training.factory
- a TokenizerFactory
to get resources frommlParams
- the machine learning train parameters
- Returns:
- the trained
TokenizerModel
- Throws:
IOException
- it throws an IOException
if an IOException
is
thrown during IO operations on a temp file which is created
during training. Or if reading from the ObjectStream
fails.
train
public static TokenizerModel train(String languageCode,
ObjectStream<TokenSample> samples,
boolean useAlphaNumericOptimization,
TrainingParameters mlParams)
throws IOException
- Deprecated. Use
#train(String, ObjectStream, TokenizerFactory, TrainingParameters)
and pass in a TokenizerFactory
- Trains a model for the
TokenizerME
.
- Parameters:
languageCode
- the language of the natural textsamples
- the samples used for the training.useAlphaNumericOptimization
- - if true alpha numerics are skippedmlParams
- the machine learning train parameters
- Returns:
- the trained
TokenizerModel
- Throws:
IOException
- it throws an IOException
if an IOException
is thrown during IO operations on a temp file which is created during training.
Or if reading from the ObjectStream
fails.
train
public static TokenizerModel train(String languageCode,
ObjectStream<TokenSample> samples,
Dictionary abbreviations,
boolean useAlphaNumericOptimization,
TrainingParameters mlParams)
throws IOException
- Deprecated. Use
#train(String, ObjectStream, TokenizerFactory, TrainingParameters)
and pass in a TokenizerFactory
- Trains a model for the
TokenizerME
.
- Parameters:
languageCode
- the language of the natural textsamples
- the samples used for the training.abbreviations
- an abbreviations dictionaryuseAlphaNumericOptimization
- - if true alpha numerics are skippedmlParams
- the machine learning train parameters
- Returns:
- the trained
TokenizerModel
- Throws:
IOException
- it throws an IOException
if an IOException
is thrown during IO operations on a temp file which is created during training.
Or if reading from the ObjectStream
fails.
train
@Deprecated
public static TokenizerModel train(String languageCode,
ObjectStream<TokenSample> samples,
boolean useAlphaNumericOptimization,
int cutoff,
int iterations)
throws IOException
- Deprecated. Use
#train(String, ObjectStream, TokenizerFactory, TrainingParameters)
and pass in a TokenizerFactory
- Trains a model for the
TokenizerME
.
- Parameters:
languageCode
- the language of the natural textsamples
- the samples used for the training.useAlphaNumericOptimization
- - if true alpha numerics are skippedcutoff
- number of times a feature must be seen to be considerediterations
- number of iterations to train the maxent model
- Returns:
- the trained
TokenizerModel
- Throws:
IOException
- it throws an IOException
if an IOException
is thrown during IO operations on a temp file which is created during training.
Or if reading from the ObjectStream
fails.
train
public static TokenizerModel train(String languageCode,
ObjectStream<TokenSample> samples,
boolean useAlphaNumericOptimization)
throws IOException,
ObjectStreamException
- Deprecated. Use
#train(String, ObjectStream, TokenizerFactory, TrainingParameters)
and pass in a TokenizerFactory
- Trains a model for the
TokenizerME
with a default cutoff of 5 and 100 iterations.
- Parameters:
languageCode
- the language of the natural textsamples
- the samples used for the training.useAlphaNumericOptimization
- - if true alpha numerics are skipped
- Returns:
- the trained
TokenizerModel
- Throws:
IOException
- it throws an IOException
if an IOException
is thrown during IO operations on a temp file which is
ObjectStreamException
- if reading from the ObjectStream
fails
created during training.
useAlphaNumericOptimization
public boolean useAlphaNumericOptimization()
- Returns the value of the alpha-numeric optimization flag.
- Returns:
- true if the tokenizer should use alpha-numeric optimization, false otherwise.
tokenize
public String[] tokenize(String s)
- Description copied from interface:
Tokenizer
- Splits a string into its atomic parts
- Specified by:
tokenize
in interface Tokenizer
- Parameters:
s
- The string to be tokenized.
- Returns:
- The String[] with the individual tokens as the array
elements.
Copyright © 2013 The Apache Software Foundation. All Rights Reserved.