org.apache.nutch.analysis.lang
Class NGramProfile

java.lang.Object
  extended by org.apache.nutch.analysis.lang.NGramProfile

public class NGramProfile
extends Object

This class runs a ngram analysis over submitted text, results might be used for automatic language identifiaction. The similarity calculation is at experimental level. You have been warned. Methods are provided to build new NGramProfiles profiles.

Author:
Sami Siren, Jerome Charron - http://frutch.free.fr/

Field Summary
static org.apache.commons.logging.Log LOG
           
 
Constructor Summary
NGramProfile(String name, int minlen, int maxlen)
          Construct a new ngram profile
 
Method Summary
 void add(StringBuffer word)
          Add ngrams from a single word to this profile
 void add(Token t)
          Add ngrams from a token to this profile
 void analyze(StringBuilder text)
          Analyze a piece of text
static NGramProfile create(String name, InputStream is, String encoding)
          Create a new Language profile from (preferably quite large) text file
 String getName()
           
 float getSimilarity(NGramProfile another)
          Calculate a score how well NGramProfiles match each other
 List<org.apache.nutch.analysis.lang.NGramProfile.NGramEntry> getSorted()
          Return a sorted list of ngrams (sort done by 1.
 void load(InputStream is)
          Loads a ngram profile from an InputStream (assumes UTF-8 encoded content)
static void main(String[] args)
          main method used for testing only
protected  void normalize()
          Normalize the profile (calculates the ngrams frequencies)
 void save(OutputStream os)
          Writes NGramProfile content into OutputStream, content is outputted with UTF-8 encoding
 String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

LOG

public static final org.apache.commons.logging.Log LOG
Constructor Detail

NGramProfile

public NGramProfile(String name,
                    int minlen,
                    int maxlen)
Construct a new ngram profile

Parameters:
name - is the name of the profile
minlen - is the min length of ngram sequences
maxlen - is the max length of ngram sequences
Method Detail

getName

public String getName()
Returns:
Returns the name.

add

public void add(Token t)
Add ngrams from a token to this profile

Parameters:
t - is the Token to be added

add

public void add(StringBuffer word)
Add ngrams from a single word to this profile

Parameters:
word - is the word to add

analyze

public void analyze(StringBuilder text)
Analyze a piece of text

Parameters:
text - the text to be analyzed

normalize

protected void normalize()
Normalize the profile (calculates the ngrams frequencies)


getSorted

public List<org.apache.nutch.analysis.lang.NGramProfile.NGramEntry> getSorted()
Return a sorted list of ngrams (sort done by 1. frequency 2. sequence)

Returns:
sorted vector of ngrams

toString

public String toString()
Overrides:
toString in class Object

getSimilarity

public float getSimilarity(NGramProfile another)
Calculate a score how well NGramProfiles match each other

Parameters:
another - ngram profile to compare against
Returns:
similarity 0=exact match

load

public void load(InputStream is)
          throws IOException
Loads a ngram profile from an InputStream (assumes UTF-8 encoded content)

Parameters:
is - the InputStream to read
Throws:
IOException

create

public static NGramProfile create(String name,
                                  InputStream is,
                                  String encoding)
Create a new Language profile from (preferably quite large) text file

Parameters:
name - is thename of profile
is - is the stream to read
encoding - is the encoding of stream

save

public void save(OutputStream os)
          throws IOException
Writes NGramProfile content into OutputStream, content is outputted with UTF-8 encoding

Parameters:
os - the Stream to output to
Throws:
IOException

main

public static void main(String[] args)
main method used for testing only

Parameters:
args -


Copyright © 2006 The Apache Software Foundation