Class LanguageIdentifier

java.lang.Object
org.apache.tika.langdetect.tika.LanguageIdentifier

public class LanguageIdentifier extends Object
Identifier of the language that best matches a given content profile. The content profile is compared to generic language profiles based on material from various sources.
Since:
Apache Tika 0.5
See Also:
  • Constructor Details

    • LanguageIdentifier

      public LanguageIdentifier(LanguageProfile profile)
      Constructs a language identifier based on a LanguageProfile
      Parameters:
      profile - the language profile
    • LanguageIdentifier

      public LanguageIdentifier(String content)
      Constructs a language identifier based on a String of text content
      Parameters:
      content - the text
  • Method Details

    • addProfile

      public static void addProfile(String language, LanguageProfile profile)
      Adds a single language profile
      Parameters:
      language - an ISO 639 code representing language
      profile - the language profile
    • initProfiles

      public static void initProfiles()
      Builds the language profiles. The list of languages are fetched from a property file named "tika.language.properties" If a file called "tika.language.override.properties" is found on classpath, this is used instead The property file contains a key "languages" with values being comma-separated language codes
    • initProfiles

      public static void initProfiles(Map<String,LanguageProfile> profilesMap)
      Initializes the language profiles from a user supplied initialized Map. This overrides the default set of profiles initialized at startup, and provides an alternative to configuring profiles through property file
      Parameters:
      profilesMap - map of language profiles
    • clearProfiles

      public static void clearProfiles()
      Clears the current map of language profiles
    • hasErrors

      public static boolean hasErrors()
      Tests whether there were errors initializing language config
      Returns:
      true if there are errors. Use getErrors() to retrieve.
    • getErrors

      public static String getErrors()
      Returns a string of error messages related to initializing language profiles
      Returns:
      the String containing the error messages
    • getSupportedLanguages

      public static Set<String> getSupportedLanguages()
      Returns what languages are supported for language identification
      Returns:
      A set of Strings being the ISO 639 language codes
    • getLanguage

      public String getLanguage()
      Gets the identified language
      Returns:
      an ISO 639 code representing the detected language
    • getRawScore

      public float getRawScore()
      1 - vector distance between the language model and the content
      Returns:
    • isReasonablyCertain

      public boolean isReasonablyCertain()
      Tries to judge whether the identification is certain enough to be trusted. WARNING: Will never return true for small amount of input texts.
      Returns:
      true if the distance is smaller then 0.022, false otherwise
    • toString

      public String toString()
      Overrides:
      toString in class Object