Contents - Introduction - Description of resources - lvg.properties - LVG database - Running the LVG annotator - LvgAnnotator.xml - AggregateAE.xml ############ Introduction ############ This annotator wraps the National Library of Medicine (NLM) SPECIALIST lexical tools. See the documentation for the SPECIALIST lexical tools at http://lexsrv2.nlm.nih.gov/LexSysGroup/Projects/ctakes-lvg/current/index.html And also see the documentation for Lvg and Norm at http://lexsrv2.nlm.nih.gov/LexSysGroup/Projects/ctakes-lvg/current/docs/userDoc/index.html This annotator generates a canonical form for each word and also generates a list of lemma entries with Penn Treebank tags. These tags could be useful for a part of speech (POS) tagger. However, for the OpenNLP POS tagger, we use a tag dictionary rather than lemma information. See the documentation for the POS tagger annotator. ######################## Description of resources ######################## %%%%%%%%%%%%%%%% lvg.properties %%%%%%%%%%%%%%%% The LVG config file resources/ctakes-lvg/data/config/lvg.properties defines the location and attributes of the LVG database and the jdbc driver used. %%%%%%%%%%%%%%%% LVG database %%%%%%%%%%%%%%%% The database engine used is hsqldb. The LVG database available from the NLM is hundreds of megabytes. To keep this project relatively small, the database tables included with this project have a relatively small number of rows. To use the full lvg2008 database, download either lvg2008lite.tgz or lvg2008.tgz from http://lexsrv3.nlm.nih.gov/SPECIALIST/Projects/ctakes-lvg/2008/index.html and extract data/HSqlDb from that download to a temporary location. Then replace the resources/ctakes-lvg/data/HSqlDb directory in this project with the data/HSqlDb directory from that download. ######################### Running the LVG annotator ######################### %%%%%%%%%%%%%%%% LvgAnnotator.xml %%%%%%%%%%%%%%%% The parameters are: UseSegments - controls whether only certain sections will be annotated by this annotator SegmentsToSkip - list of sections not to be processed by this annotator UseCmdCache - controls whether to look up information in a cache before using norm CmdCacheFileLocation - location of norm cache file CmdCacheFrequencyCutoff - ExclusionSet - words for which canonicalForm is never set and Lemma entries are never posted XeroxTreebankMap - mapping of part of speech tags, used to POS tags from lexical tools to Penn Treebank tags PostLemmas - controls whether any lemma entries are posted to the CAS UseLemmaCache - controls whether to look up lemma information in a cache before using lvg LemmaCacheFileLocation - the location of the cache file LemmaCacheFileFrequencyCutoff - Note: as distributed, PostLemmas is set to false. This is done to reduce the size of the CAS. Set PostLemmas to true to have org.apache.ctakes.typesystem.type.Lemma annotations added to the CAS.