org.apache.nutch.analysis.lang
Class LanguageIdentifier

java.lang.Object
  extended by org.apache.nutch.analysis.lang.LanguageIdentifier

public class LanguageIdentifier
extends Object

Identify the language of a content, based on statistical analysis.

Author:
Sami Siren, Jérôme Charron
See Also:
ISO 639 Language Codes

Constructor Summary
LanguageIdentifier(Configuration conf)
          Constructs a new Language Identifier.
 
Method Summary
 String identify(InputStream is)
          Identify language from input stream.
 String identify(InputStream is, String charset)
          Identify language from input stream.
 String identify(String content)
          Identify language of a content.
 String identify(StringBuilder content)
          Identify language of a content.
static void main(String[] args)
          Main method used for command line process.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

LanguageIdentifier

public LanguageIdentifier(Configuration conf)
Constructs a new Language Identifier.

Method Detail

main

public static void main(String[] args)
Main method used for command line process.
Usage is:
 LanguageIdentifier [-identifyrows filename maxlines]
                    [-identifyfile charset filename]
                    [-identifyfileset charset files]
                    [-identifytext text]
                    [-identifyurl url]
 

Parameters:
args - arguments.

identify

public String identify(String content)
Identify language of a content.

Parameters:
content - is the content to analyze.
Returns:
The 2 letter ISO 639 language code (en, fi, sv, ...) of the language that best matches the specified content.

identify

public String identify(StringBuilder content)
Identify language of a content.

Parameters:
content - is the content to analyze.
Returns:
The 2 letter ISO 639 language code (en, fi, sv, ...) of the language that best matches the specified content.

identify

public String identify(InputStream is)
                throws IOException
Identify language from input stream. This method uses the platform default encoding to read the input stream. For using a specific encoding, use the identify(InputStream, String) method.

Parameters:
is - is the input stream to analyze.
Returns:
The 2 letter ISO 639 language code (en, fi, sv, ...) of the language that best matches the content of the specified input stream.
Throws:
IOException - if something wrong occurs on the input stream.

identify

public String identify(InputStream is,
                       String charset)
                throws IOException
Identify language from input stream.

Parameters:
is - is the input stream to analyze.
charset - is the charset to use to read the input stream.
Returns:
The 2 letter ISO 639 language code (en, fi, sv, ...) of the language that best matches the content of the specified input stream.
Throws:
IOException - if something wrong occurs on the input stream.


Copyright © 2006 The Apache Software Foundation