cTakes Dictionary Lookup Module 2 Help 3-04-2014 Lookup Workflow No Lookup Initializer Unlike the current dictionary module, the new one has no lookup initializer that sits apart from the lookup annotator. All functionality of the current module’s Initializers resides in the new module’s Annotator. This removes some of the “pluggability” that exists for the current module, but greatly reduces complexity and overhead. Lookup per Registered Dictionary per Lookup Window For each proper lookup token encountered in the lookup window, the new dictionary lookup code runs through a loop that performs lookup per each registered Dictionary, finding terms based upon the lookup token per dictionary. All Lookups are performed in the same manner, just using different term repositories (dictionaries). Term Match Given the full text for all candidate terms in a dictionary, a match is attempted using that text on the full text of the lookup window. Single Lookup Consumer After all terms have been appropriated from all dictionaries, a single Consumer takes the collection of terms and performs some action(s) upon them, such as storing them in the Cas. A Consumer (or delegates) can be created to differently handle terms with different Cuis, Tuis, words, entity types, schemes, etc. if that is necessary, but there is one consumer entry point, called once with the entire collection of discovered terms from all dictionaries. Lookup Methods Lookup by Rare Word Each dictionary entry, UMLS or other, contains the rare word used for the lookup key, a count of words (tokens) within the term, and an index of the rare word within the term. For each lookup token (word of proper part of speech and span length) in the note, the lookup code queries the dictionary for terms with a matching rare word. For each returned term, a match is attempted using the term and the appropriate token coverage in the note based upon the offset indicated by the rare word index and length of the term. This creates a forward and backward matching method instead of the forward-only matching method of the current dictionary (first word) lookup. Using a term’s rare word as an index instead of a first word decreases, on average, the number of terms for which matching must be performed. Text Exact Match Because the UMLS dictionary contains rows with different combinations of lexical elements per term, using a direct string match of text in note to text of term is a valid candidate for term matching. This is different from the complex mechanism in the current (first word) lookup, and makes for simpler code and greater accuracy. This precise specification (and improved lookup speed) enables the use of an entire sentence as a lookup window rather than just a noun phrase. Usage of Sentence as a lookup window allows all possible tokens to be used for not only lookup keys, but also for term matching. For proper accuracy, custom dictionaries should also contain multiple entries for variations of term syntax. Note that term matching is attempted using the actual text in the note and also per-token cTakes-generated lexical variants of the text in the note. Text Overlap Match To better approximate the current (first word) lookup annotator, one lookup method finds overlapping terms in addition to exact matching terms. This allows matches on discontiguous spans. For instance, for the text “blood, urine test” the exact match will find only one procedure: “urine test”. The overlap match will find both “urine test” and “blood test”. Persistence Methods (Consumer) All Terms Persistence All terms discovered by the matchers can be stored in the CAS by a consumer, regardless of any property of the term. This means that for the text “lung cancer” the specific disease term “lung cancer” and broader term “cancer”. This can be useful for future searches on general concepts, e.g. searching via the CUI for “cancer” and getting all instances of “cancer” found in texts “lung cancer”, “skin cancer”, “stomach cancer”, etc. Most Precise Term Persistence Matched terms can be stored only by the longest overlapping span discovered for a semantic group. This keeps, for instance, the disease “lung cancer” but not “cancer”. Using semantic groups means that both the disease “lung cancer” and the anatomical site “lung” are persisted even though the spans overlap. When using the overlap matching method, any discontiguous spans are accounted for. So, for “blood, urine test” both the discontiguous spanned term “blood test” and the contiguous spanned term “urine test” are valid. Dictionary Indexing Lookup by Rare Word No matter what text Match algorithm is used, it must run for every Term’s text returned from the dictionary lookup of a Lookup Token. To minimize the number of Terms returned from the dictionary lookup, a Dictionary should be indexed by a Rare Word for each Term. The Rare Word would be a token of proper Lookup Token Part of Speech in the Term Text which has the minimum number of instances in the entire Dictionary. Some examples (‘-‘ denotes a non-word): Term Text Word Counts in Dictionary Texts Rare Word in Index Rare Word Count in Index Medical care vascular complications 11 3 1 1 vascular 1 Medical / dental care 11 - 2 3 dental 1 Medical / dental care treatments and procedures 11 - 2 3 1 - 1 treatments 1 Medical imaging 11 2 imaging 1 Medical imaging , ultrasound 11 2 - 1 ultrasound 1 Medical identification band on 11 3 1 - band 1 Medical identification bracelet on 11 1 2 - bracelet 1 Medical identification bracelet present 11 3 2 1 present 1 Medical procedure on the nervous system 11 3 - - 3 3 procedure 1 Medical procedure on the central nervous system 11 3 - - 1 3 3 central 1 Medical procedure on the peripheral nervous system 11 3 - - 1 3 3 peripheral 1For any Lookup Window and the given Dictionary, any Lookup Token “Medical” would return zero candidate terms and require no Match processing. Only when a Lookup Token exists as a Rare Word in a Term will it exist in the Dictionary index and return one or more candidate Terms upon which to attempt a Match. Sometimes this will be the first Token in a Term, but often it is not. Because a Lookup Token can be in the middle of a Term’s text, the Match must be attempted by checking words both after and before the Lookup Token in the Lookup Window text. Example Annotator Descriptor XML The following Descriptor XML is in the module’s example directory and can be used to perform lookup using a Bar-Separated Value flat file as the dictionary. org.apache.uima.java true org.apache.ctakes.dictionary.lookup2.ae.DefaultJCasTermAnnotator The above specifies that the default lookup annotator implementation should be used. The default lookup only uses contiguous text spans. To specify usage of the lookup annotator that allows terms on discontiguous spans, use: org.apache.ctakes.dictionary.lookup2.ae.OverlapJCasTermAnnotator In the previous Dictionary Lookup Module, the type of lookup window and the excluded parts of speech were not specified in the analysis engine descriptor, they were specified in a secondary lookup specification file. For the new module this information is in the analysis engine descriptor. CustomLookupAnnotatorBSV Dictionary Lookup Annotator descriptor for dictionaries which actually lie in a BSV file windowAnnotations Type of window to use for lookup String false true exclusionTags Parts of speech to ignore when considering lookup tokens String false false minimumSpan Minimum required span length of tokens to use for lookup. Default is 3 Integer false false windowAnnotations org.apache.ctakes.typesystem.type.textspan.LookupWindowAnnotation exclusionTags VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$, RP,TO,WDT,WP,WPS,WRB minimumSpan 3 The above specifies that the lookup window annotation should be a LookupWindowAnnotation, which is supposed to represent noun phrases in the text. A larger, more inclusive lookup window would be a sentence. To specify usage of a sentence as the lookup window, use: windowAnnotations org.apache.ctakes.typesystem.type.textspan.Sentence If the discontiguous span annotator (OverlapJCasTermAnnotator) is used, then two additional configuration parameters are available. These parameters are optional and modify the total number of non-matching tokens that can be skipped and the number of consecutive non-matching tokens that can be skipped. consecutiveSkips Number of consecutive tokens that can be skipped when matching terms. Default is 2 Integer false false totalTokenSkips Total number of tokens that can be skipped when matching terms. Default is 4 Integer false false consecutiveSkips 2 totalTokenSkips 4 What follows is exactly the same as the description used for the previous dictionary lookup module descriptor. org.apache.ctakes.typesystem.type.syntax.BaseToken org.apache.ctakes.typesystem.type.textspan.LookupWindowAnnotation org.apache.ctakes.typesystem.type.textsem.IdentifiedAnnotation true true false The resource manager configuration sets up the external resources used by the dictionary lookup module. There must be a url to a dictionary specification descriptor and a url to each dictionary resource. What follows is exactly as it was for the previous lookup module, except that the external resource dependency LookupDescriptor has been replaced by DictionaryDescriptor. DictionaryDescriptor org.apache.ctakes.core.resource.FileResource optional>false CustomBsvDictionary org.apache.ctakes.core.resource.FileResource false DictionaryDescriptorFile file:org/apache/ctakes/dictionary/lookup2/example/bsv/RareWord_BSV.xml org.apache.ctakes.core.resource.FileResourceImpl BsvDictionaryResource file:org/apache/ctakes/dictionary/lookup2/example/bsv/CustomBsvDictionary.bsv org.apache.ctakes.core.resource.FileResourceImpl DictionaryDescriptor DictionaryDescriptorFile CustomBsvDictionary BsvDictionaryResource Example Dictionary Specification Descriptor XML The following Descriptor XML is in the module’s resource example directory and can be used to perform lookup using a Bar-Separated Value flat file as the dictionary. To only persist only the longest overlapping term of any given span for each semantic group, use the PrecisionTermConsumer.