|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.nutch.analysis.lang.NGramProfile
public class NGramProfile
This class runs a ngram analysis over submitted text, results might be used for automatic language identifiaction. The similarity calculation is at experimental level. You have been warned. Methods are provided to build new NGramProfiles profiles.
Field Summary | |
---|---|
static org.apache.commons.logging.Log |
LOG
|
Constructor Summary | |
---|---|
NGramProfile(String name,
int minlen,
int maxlen)
Construct a new ngram profile |
Method Summary | |
---|---|
void |
add(StringBuffer word)
Add ngrams from a single word to this profile |
void |
add(Token t)
Add ngrams from a token to this profile |
void |
analyze(StringBuilder text)
Analyze a piece of text |
static NGramProfile |
create(String name,
InputStream is,
String encoding)
Create a new Language profile from (preferably quite large) text file |
String |
getName()
|
float |
getSimilarity(NGramProfile another)
Calculate a score how well NGramProfiles match each other |
List<org.apache.nutch.analysis.lang.NGramProfile.NGramEntry> |
getSorted()
Return a sorted list of ngrams (sort done by 1. |
void |
load(InputStream is)
Loads a ngram profile from an InputStream (assumes UTF-8 encoded content) |
static void |
main(String[] args)
main method used for testing only |
protected void |
normalize()
Normalize the profile (calculates the ngrams frequencies) |
void |
save(OutputStream os)
Writes NGramProfile content into OutputStream, content is outputted with UTF-8 encoding |
String |
toString()
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
public static final org.apache.commons.logging.Log LOG
Constructor Detail |
---|
public NGramProfile(String name, int minlen, int maxlen)
name
- is the name of the profileminlen
- is the min length of ngram sequencesmaxlen
- is the max length of ngram sequencesMethod Detail |
---|
public String getName()
public void add(Token t)
t
- is the Token to be addedpublic void add(StringBuffer word)
word
- is the word to addpublic void analyze(StringBuilder text)
text
- the text to be analyzedprotected void normalize()
public List<org.apache.nutch.analysis.lang.NGramProfile.NGramEntry> getSorted()
public String toString()
toString
in class Object
public float getSimilarity(NGramProfile another)
another
- ngram profile to compare against
public void load(InputStream is) throws IOException
is
- the InputStream to read
IOException
public static NGramProfile create(String name, InputStream is, String encoding)
name
- is thename of profileis
- is the stream to readencoding
- is the encoding of streampublic void save(OutputStream os) throws IOException
os
- the Stream to output to
IOException
public static void main(String[] args)
args
-
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |