Package org.apache.solr.update.processor
Class LanguageIdentifierUpdateProcessor
- java.lang.Object
-
- org.apache.solr.update.processor.UpdateRequestProcessor
-
- org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor
-
- All Implemented Interfaces:
Closeable
,AutoCloseable
,LangIdParams
- Direct Known Subclasses:
LangDetectLanguageIdentifierUpdateProcessor
,OpenNLPLangDetectUpdateProcessor
,TikaLanguageIdentifierUpdateProcessor
public abstract class LanguageIdentifierUpdateProcessor extends UpdateRequestProcessor implements LangIdParams
Identifies the language of a set of input fields. Also supports mapping of field names based on detected language. See Detecting Languages During Indexing in reference guide- Since:
- 3.5
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
-
Field Summary
Fields Modifier and Type Field Description protected HashSet<String>
allMapFieldsSet
protected String
docIdField
protected boolean
enabled
protected boolean
enableMapping
protected boolean
enforceSchema
protected String[]
fallbackFields
protected String
fallbackValue
protected String[]
inputFields
protected HashSet<String>
langAllowlist
protected String
langField
protected Pattern
langPattern
protected String
langsField
protected HashMap<String,String>
lcMap
protected String[]
mapFields
protected boolean
mapIndividual
protected HashSet<String>
mapIndividualFieldsSet
protected boolean
mapKeepOrig
protected HashMap<String,String>
mapLcMap
protected boolean
mapOverwrite
protected Pattern
mapPattern
protected String
mapReplaceStr
protected int
maxFieldValueChars
protected int
maxTotalChars
protected boolean
overwrite
protected IndexSchema
schema
protected double
threshold
protected Pattern
tikaSimilarityPattern
-
Fields inherited from class org.apache.solr.update.processor.UpdateRequestProcessor
next
-
Fields inherited from interface org.apache.solr.update.processor.LangIdParams
DOCID_FIELD_DEFAULT, DOCID_LANGFIELD_DEFAULT, DOCID_LANGSFIELD_DEFAULT, DOCID_PARAM, DOCID_THRESHOLD_DEFAULT, ENFORCE_SCHEMA, FALLBACK, FALLBACK_FIELDS, FIELDS_PARAM, LANG_ALLOWLIST, LANG_FIELD, LANG_WHITELIST, LANGS_FIELD, LANGUAGE_ID, LCMAP, MAP_ENABLE, MAP_FL, MAP_INDIVIDUAL, MAP_INDIVIDUAL_FL, MAP_KEEP_ORIG, MAP_LCMAP, MAP_OVERWRITE, MAP_PATTERN, MAP_PATTERN_DEFAULT, MAP_REPLACE, MAP_REPLACE_DEFAULT, MAX_FIELD_VALUE_CHARS, MAX_FIELD_VALUE_CHARS_DEFAULT, MAX_TOTAL_CHARS, MAX_TOTAL_CHARS_DEFAULT, OVERWRITE, THRESHOLD
-
-
Constructor Summary
Constructors Constructor Description LanguageIdentifierUpdateProcessor(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next)
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description protected String
concatFields(org.apache.solr.common.SolrInputDocument doc)
Concatenates content from input fields defined in langid.fl.protected abstract List<DetectedLanguage>
detectLanguage(Reader solrDocReader)
Detects language(s) from a reader, typically based on some fields in SolrInputDocument Classes wishing to implement their own language detection module should override this method.protected List<DetectedLanguage>
detectLanguage(org.apache.solr.common.SolrInputDocument doc)
Detects language(s) from all configured fieldsprotected String
getMappedField(String currentField, String language)
Returns the name of the field to map the current contents into, so that they are properly analyzed.boolean
isEnabled()
Tells if this processor is enabled or notprotected String
normalizeLangCode(String langCode)
Looks up language code in map (langid.lcmap) and returns mapped valueprotected void
process(org.apache.solr.common.SolrInputDocument doc)
This is the main process method called from processAdd()void
processAdd(AddUpdateCommand cmd)
protected String
resolveLanguage(String language, String fallbackLang)
Chooses a language based on the list of candidates detectedprotected String
resolveLanguage(List<DetectedLanguage> languages, String fallbackLang)
Chooses a language based on the list of candidates detectedvoid
setEnabled(boolean enabled)
protected SolrInputDocumentReader
solrDocReader(org.apache.solr.common.SolrInputDocument doc, String[] fields)
Returns a reader that streams String content from fields.-
Methods inherited from class org.apache.solr.update.processor.UpdateRequestProcessor
close, doClose, finish, processCommit, processDelete, processMergeIndexes, processRollback
-
-
-
-
Field Detail
-
enabled
protected boolean enabled
-
inputFields
protected String[] inputFields
-
mapFields
protected String[] mapFields
-
mapPattern
protected Pattern mapPattern
-
mapReplaceStr
protected String mapReplaceStr
-
langField
protected String langField
-
langsField
protected String langsField
-
docIdField
protected String docIdField
-
fallbackValue
protected String fallbackValue
-
fallbackFields
protected String[] fallbackFields
-
enableMapping
protected boolean enableMapping
-
mapKeepOrig
protected boolean mapKeepOrig
-
overwrite
protected boolean overwrite
-
mapOverwrite
protected boolean mapOverwrite
-
mapIndividual
protected boolean mapIndividual
-
enforceSchema
protected boolean enforceSchema
-
threshold
protected double threshold
-
schema
protected IndexSchema schema
-
maxFieldValueChars
protected int maxFieldValueChars
-
maxTotalChars
protected int maxTotalChars
-
tikaSimilarityPattern
protected final Pattern tikaSimilarityPattern
-
langPattern
protected final Pattern langPattern
-
-
Constructor Detail
-
LanguageIdentifierUpdateProcessor
public LanguageIdentifierUpdateProcessor(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next)
-
-
Method Detail
-
processAdd
public void processAdd(AddUpdateCommand cmd) throws IOException
- Overrides:
processAdd
in classUpdateRequestProcessor
- Throws:
IOException
-
process
protected void process(org.apache.solr.common.SolrInputDocument doc)
This is the main process method called from processAdd()- Parameters:
doc
- the SolrInputDocument to modify
-
detectLanguage
protected List<DetectedLanguage> detectLanguage(org.apache.solr.common.SolrInputDocument doc)
Detects language(s) from all configured fields- Parameters:
doc
- The solr document- Returns:
- List of detected language(s) according to RFC-3066
-
detectLanguage
protected abstract List<DetectedLanguage> detectLanguage(Reader solrDocReader)
Detects language(s) from a reader, typically based on some fields in SolrInputDocument Classes wishing to implement their own language detection module should override this method.- Parameters:
solrDocReader
- A reader serving the text from the document to detect- Returns:
- List of detected language(s) according to RFC-3066
-
resolveLanguage
protected String resolveLanguage(String language, String fallbackLang)
Chooses a language based on the list of candidates detected- Parameters:
language
- language code as a stringfallbackLang
- the language code to use as a fallback- Returns:
- a string of the chosen language
-
resolveLanguage
protected String resolveLanguage(List<DetectedLanguage> languages, String fallbackLang)
Chooses a language based on the list of candidates detected- Parameters:
languages
- a List of DetectedLanguages with certainty scorefallbackLang
- the language code to use as a fallback- Returns:
- a string of the chosen language
-
normalizeLangCode
protected String normalizeLangCode(String langCode)
Looks up language code in map (langid.lcmap) and returns mapped value- Parameters:
langCode
- the language code string returned from detector- Returns:
- the normalized/mapped language code
-
getMappedField
protected String getMappedField(String currentField, String language)
Returns the name of the field to map the current contents into, so that they are properly analyzed. For instance if the currentField is "text" and the code is "en", the new field would by default be "text_en". This method also performs custom regex pattern replace if configured. If enforceSchema=true and the resulting field name doesn't exist, then null is returned.- Parameters:
currentField
- The current field namelanguage
- the language code- Returns:
- The new schema field name, based on pattern and replace, or null if illegal
-
isEnabled
public boolean isEnabled()
Tells if this processor is enabled or not- Returns:
- true if enabled, else false
-
setEnabled
public void setEnabled(boolean enabled)
-
solrDocReader
protected SolrInputDocumentReader solrDocReader(org.apache.solr.common.SolrInputDocument doc, String[] fields)
Returns a reader that streams String content from fields. This is more memory efficient than building a full string buffer- Parameters:
doc
- the solr documentfields
- the field names to read- Returns:
- a reader over the fields
-
concatFields
protected String concatFields(org.apache.solr.common.SolrInputDocument doc)
Concatenates content from input fields defined in langid.fl. For test purposes only
-
-