Contents - Disclaimer - Introduction - Overview of PAD term spotter pipeline - Prerequisite steps - Getting Started (test sample run) - Verify Results (test sample run) - Collection reader (input vehicle for pipeline) - RadiologyRecordsCollectionReader.xml - Analysis engines (Text Analyis Engine <> annotators) - DxStatusAnnotator.xml - NegationDxAnnotator.xml - PAD_Hit.xml - Radiology_DictionaryLookupCSVAnnotator.xml - Radiology_TermSpotterAnnoatorTAE.xml - Radiology_TermSpotterAnnoatorTAEStyleMap.xml - SubSectionBoundaryAnnotator.xml - CAS consumers (post consumption processor) - PADOffSetsRecord.xml - Debugging issues ########## Disclaimer ########## This should be considered a beta release of this annotator. See the 'Clinical Text Analysis and Knowledge Extraction System User Guide' documentation located in the cTAKES '/docs/userguide/cTAKES_userguide.htm' for detailed install and setup information pertaining to all the cTAKES components. ############ Introduction ############ This is version 1.0 of the cTAKES PAD term spotter annotator. The pipeline has been tested on flat text files and the sample provided uses a tilde (`) delimited file with multiple records (and patients) per file. The dictionary lookup annotator is limited to PAD term dictionaries provided in the '/resources/lookup/radiology/pad_*.csv' files. Specific details can be found in the 'NLP_OF_RADIOLOGY_REPORTSv1.pdf' file. ##################################### Overview of PAD term spotter pipeline ##################################### The �PAD term spotter� pipeline processes radiology note textual extractions specifically pertaining to the diagnosis, treatment, etc. of lower limb Peripheral Artery Disease (PAD)(e.g. stenosis/occlusion paired with popliteal/femoral). The main feature is classifying each document for the presence of PAD. Descriptive text of diagnosis and illness terms are paired with the site designated terms to build a relational tie, indicating a 'hit'. The pipeline assesses presence of phrases indicative of peripheral arterial disease (PAD) in one or more sentences contained in radiology related documents. There are four possible classification provided for PAD (positive numeric value (e.g. "3") => 'yes', "0" => 'no', "prob" => 'probable' and "-1" => 'unknown) and two levels of classification, namely document and patient level. The precedence for classifying the patient level is PAD-Yes > PAD-prob > PAD-no > unknown (combined category). The algorithm for classifying at the document level is as follows: If there is no related exam type OR no positive or negative evidence, classify as UNKNOWN; If there is positive evidence, classify as POS; If there is explicit evidence of negation of positive evidence OR if a lower solo extremity exam with the discovery of no stenosis, probable or negated related PAD term exists, classify as NEG; If there is no positive or negative evidence OR if there is negation of positive evidence in an Ultrasound or Vascular Interventional Radiology report, classify as UNKNOWN; Otherwise, classify as PROB based on severity modifier; See the 'NLP_OF_RADIOLOGY_REPORTSv1.pdf' for a more detailed discussion. ################## Prerequisite steps ################## - Install all the cTAKES components, dependencies (e.g. JVM, UIMA, and Eclipse(optional)). - Make sure the ctakes-clinical-pipeline of cTAKES and the Apache UIMA are setup correctly by running the CAS Visual Debugger (CVD) as described in the 'Clinical Text Analysis and Knowledge Extraction System User Guide' 1.2. Install from PEAR packages. - Press the "Run your AE in the CAS Visual Debugger" from the "Local PEAR Installation, Verification and Testing" panel after installing the pipeline via the same process described for the other components. - Enter related text in panel 'Text' panel and Select 'Run PAD term spotter' from the 'Run' pull-down menu (Ctrl+R). You should now be able to navigate in the upper left 'Analysis Results' panel by browsing under the CAS Index Repository via the AnnotationIndex tab. ################################# Getting Started (test sample run) ################################# Reference the '/PAD term spotter/doc/PAD_Overview.html' and '/PAD term spotter/doc/NLP OF RADIOLOGY REPORTSv1.pdf' for additional documentation for this pipeline. This pipeline utilizes three main flows inherent in the UIMA infrastructure to parse, analyze and summarize the radiology notes PAD classification. The Component Processing Engine (CPE) responsible for putting the different flows together to create, modify, and deploy the system. Step 0: From command prompt set the UIMA_HOME class path to the location where you installed the Apache UIMA api: 'set UIMA_HOME=<>' 'export UIMA_HOME=<>' Step 1: From the on linux run the following command to bring up the CPE environment: java -cp ^ "%UIMA_HOME%/lib/uima-core.jar;^ %UIMA_HOME%/lib/uima-cpe.jar;^ %UIMA_HOME%/lib/uima-tools.jar;^ %UIMA_HOME%/lib/uima-document-annotations.jar;^ %UIMA_HOME%/lib/uima-examples.jar;^ chunker/bin;^ ctakes-clinical-pipeline/bin;^ context dependent tokenizer/bin;^ core/bin;^ dictionary lookup/bin;^ document preprocessor/bin;^ LVG/bin;^ NE contexts/bin;^ POS tagger/bin;^ PAD term spotter/bin;^ core/lib/log4j-1.2.8.jar;^ core/lib/jdom.jar;^ core/lib/lucene-1.4-final.jar;^ core/lib/opennlp-tools-1.4.3.jar;^ core/lib/maxent-2.5.0.jar;^ core/lib/OpenAI_FSM.jar;^ core/lib/trove.jar;^ LVG/lib/lvg2008dist.jar;^ document preprocessor/lib/xercesImpl.jar;^ document preprocessor/lib/xml-apis.jar;^ document preprocessor/lib/xmlParserAPIs.jar;^ chunker/resources;^ ctakes-clinical-pipeline/resources;^ context dependent tokenizer/resources;^ core/resources;^ dictionary lookup/resources;^ document preprocessor/resources;^ LVG/resources;^ PAD term spotter/resources;^ POS tagger/resources;^ NE contexts/resources" org.apache.uima.tools.cpm.CpmFrame java -cp \ $UIMA_HOME/lib/uima-core.jar:\ $UIMA_HOME/lib/uima-cpe.jar:\ $UIMA_HOME/lib/uima-tools.jar:\ $UIMA_HOME/lib/uima-examples.jar:\ $UIMA_HOME/lib/uima-document-annotations.jar:\ chunker/bin:\ clinical\ documents\ pipeline/bin:\ context\ dependent\ tokenizer/bin:\ core/bin:\ dictionary\ lookup/bin:\ document\ preprocessor/bin:\ LVG/bin:\ NE\ contexts/bin:\ POS\ tagger/bin:\ PAD\ term\ spotter/bin:\ core/lib/log4j-1.2.8.jar:\ core/lib/jdom.jar:\ core/lib/lucene-1.4-final.jar:\ core/lib/opennlp-tools-1.4.3.jar:\ core/lib/maxent-2.5.0.jar:\ core/lib/OpenAI_FSM.jar:\ core/lib/trove.jar:\ LVG/lib/lvg2008dist.jar:\ document\ preprocessor/lib/xercesImpl.jar:\ document\ preprocessor/lib/xml-apis.jar:\ document\ preprocessor/lib/xmlParserAPIs.jar:\ chunker/resources:\ clinical\ documents\ pipeline/resources:\ context\ dependent\ tokenizer/resources:\ core/resources:\ dictionary\ lookup/resources:\ document\ preprocessor/resources:\ LVG/resources:\ NE\ contexts/resources:\ PAD\ term\ spotter/resources:\ POS\ tagger/resources \ org.apache.uima.tools.cpm.CpmFrame Step 2: From the pull down menu open the �Radiology_sample.xml� via 'File' => 'Open CPE Descriptor' => 'PAD term spotter/desc/collection_processing _engine/Radiology_sample.xml'. This will load the sample Collection Processing Engine (CPE) �Radiology_sample.xml�. A sample CPE has been provided along with de-identified patient information to provide a means to test the pipeline after you have set up the environment. FYI. '{inst-root-dir}/data/SampleInputRadiologyNotes.txt' contains 6 sample radiology notes representing two different patients. Step 3: In order to run this sample you will need to specify the path where you installed the cTAKES projects or the Eclipse workspace in place of the �{inst-root-dir}' () within the existing paths provided in the �Input File Name:� and �Filter Exam Types:� fields. Step 4: Additionally, within the 'Output File Name' field specify the path and file name to output the record level radiology on your system ('{inst-root-dir}/PAD term spotter/output/Sample_PAD_record_level.txt' => outfile path and name ). ################################# Verify Results (test sample run) ################################# After running 'Step 4' from the section above there should be two files in the output directory specified in the 'Output File Name'; including the name you indicated for the file name (e.g. 'sample_pad_run.txt') and summary results file called 'Summary_PADRadiology.csv'. The output file you specified will contain a 6 rows of data corresponding to each of the lines of data in the '{inst-root-dir}/data/SampleInputRadiologyNotes.txt' as follows: 54321|prob|(419-433:419-439)(513-520:521-536)(-1:506-536)(513-520:445-453)|CT_EXAM|aortobifemoral|aortobifemoral graft|CT_EXAM|femoral|crossover graft|CT_EXAM| |Patent femoral crossover graft|CT_EXAM|femoral|occluded 54321|3|(66-83:49-52)(128-144:101-122)(160-175:145-154)(218-233:322-330)(128-144:145-154)(101-114:101-122)|US_LOWER_EXT|femoral-popliteal|ASO|US_LOWER_EXT|posterior tibial|tibial artery disease|US_LOWER_EXT|anterior tibial|occlusion|US_LOWER_EXT|peroneal artery|stenosis|US_LOWER_EXT|posterior tibial|occlusion|US_LOWER_EXT|tibial artery|tibial artery disease 54321|-1| 12345|-1|(-1:298-314)(-1:298-307)|SIMPLE_SEGMENT|**NO TERM**|popliteal artery|SIMPLE_SEGMENT|**NO TERM**|popliteal 12345|-1| 12345|-1| Each field is delimited using the vertical bar (|). The first and second columns show the patient number and PAD classification, respectively. The third column provides offset information for the terms discovered in the text and the remaining columns are the elements that made up a potential relational tie, which are typically the section where the terms were discovered, the site term, and disorder term, respectively. Note that there is not necessarly a direct correlation between the number of terms found and the classification represented. Factors, such as negation, context words and other features will play a role in how the 'hits' are interpreted. See the 'NLP_OF_RADIOLOGY_REPORTSv1.pdf' for a more detailed discussion of how these factors are determined. The 'Summary_PADRadiology.csv' should contain the following: clinic,case_type 54321,yes 12345,unknown ############################################## Collection reader (input vehicle for pipeline) ############################################## Radiology notes can be input into the system as a simple flat file, containing one or more radiology notes or in a wrapped/encapsulated CDA or similar clinical note structure (see the ctakes-clinical-pipeline for examples and direction for writing Collection Readers and/or CAS Initializers for this purpose). This project includes one specialized input collector, located under 'desc/collection_reader' called 'RadiologyRecordsCollectionReader.xml'. Parameters: Input File Name (Required): the name and path of the file which contains the records which make up the radiology notes being processed Language (Optional): will explicitly set a language if used Comment String (Optional): will filter/skip lines that begin with this case sensitive literal string Ignore Blank Lines (Optional): will prevent blank rows from being processed which may cause interruptions to the subsequent processes Id Delimiter (Optional): specifies what character will be used to delimit the identification column (first column) or all fields, depending upon if values are specified for the remaining fields in this panel Column Count (Optional): indicates the number of columns, delimited using the value in 'Id Delimiter', should be skipped to locate the actual contents of the radiology examination Filter Exam Types (Optional): provides a list of valid examination codes to act as a filter to eliminate the need to parse records not related to PAD Filter Exam Column Number (Optional): column count of the radiology record, delimited using the value in 'Id Delimiter', used as input to compare the 'Filter Exam Types' provided above ############################# Analysis engines (annotators) ############################# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Radiology_TermSpotterAnnotatorTAE.xml This annotator assesses presence of phrases indicative of peripheral arterial disease (PAD) in one or more sentences contained in radiology related documents. - The file "desc/analysis_engine/Radiology_TermSpotterAnnotatorTAE.xml" which provides a working example of the PAD term spotter pipeline utilizing the aggregate TAE's: o SimpleSegmentAnnotator (see core project) o TokenizerAnnotator (see core project) o SentenceDetectorAnnotator (see core project) o SubSectionBoundaryAnnotator (this project - see below) o ContextDependentTokenizerAnnotator (see context dependent tokenizer) o POS Tagger (see POS Tagger project) o Chunker (see Chunker project) o Radiology_DictionaryLookupCSVAnnotator (this project - see below) o NegationAnnotator (see NE contexts) o PAD_Hit (this project - see below) o DxStatusAnnotator (this project - see below) o NegationDxAnnotator (this project - see below) *Parameters*:: ChunkerCreatorClass;; the full class name of an implementation of the interface edu.mayo.bmi.uima.chunker.ChunkerCreator. See the Chunker project for information pertaining to the Chunker analysis engine %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% SubSectionBoundaryAnnotator.xml is to identify sections within the text that indicate paragraphs, subsections, and groups of text that should be handled as one entity. The file resource 'lookup/radiology/ExamTitleWords.txt' provides the site specific terms used to indicate the start of subsections. The subsections will span until either another subsection beginning is discover or the end of the document. The file 'desc/analysis_engine/SubSectionBoundaryAnnotator.xml' drives the java class 'org.apache.ctakes.padtermspotter.ae.SubSectionAnnotator' which the primary task [NOTE] =============================================================================================================== The class 'core/src/edu/mayo/bmi/fsm/machine/SubSectionPadIdFSM' will identified section headers and classify as CONFIRMED_STATUS, NEGATED_STATUS, or PROBABLE_STATUS. However, only 'NEGATED_STATUS' is being implemented within the logic ('PAD term spotter/src/main/java/org/apache/ctakes/pad/impl/PADConsumerImpl.java' will not add 'hits' in sections tagged with this status). ================================================================================================================ [NOTE] ================================================================================================================ - The following classes and files have Mayo specific site and terminology terms that are being leveraged, especially as it pertains to the subsection handling: 1)'PAD term spotter/src/edu/mayo/bmi/fsm/machine/SubSectionPadIdFSM.java' - terms; "smh","rmh","gonda","romayo" are indicative of names of buildings on the Mayo campus which are used to mark subsection begin/end - terms; "indications","bleindications","exam","showing" are special terms which often contain the terms being screened for relating to PAD, but since they are titles of examinations, revision sections, and generalized screenings they are to be ignored in the Mayo cohort. 2) 'PAD term spotter/src/main/java/org/apache/ctakes/pad/impl/PADConsumerImpl.java' - terms; "indications:" and "showing" are special terms which often contain the terms being screened for relating to PAD, but since they are titles of examinations, revision sections, and generalized screenings they are to be ignored in the Mayo cohort. "maxSubsectionSize" is used to limit the overall scope of where the subsection tokens will be searched. It has been hard-coded to 300 in the shipped class. 3) 'PAD term spotter/resources/lookup/radiology/ExamTitleWords.txt' - comma delimited terms which represent key values to distinguish the type of radiology examination being utilized: US_EXAM (ultrasound), LOWER_EXT (lower extremity), US_LOWER_EXT (ultrasound lower extremity), US_LOWER_SOLO (ultrasound lower extremity one side only), CT_EXAM (CAT scan), CT_EXAM_SOLO (CAT Scan one side only). 4) 'PAD term spotter/resources/lookup/radiology/ExamsForPAD.csv' - provides a list of valid examination codes to act as a filter to eliminate the need to parse records not related to PAD. ======================================================================= %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Radiology_DictionaryLookupCSVAnnotator.xml The file 'desc/analysis_engine/Radiology_DictionaryLookupCSVAnnotator.xml' drives the java class 'org.apache.ctakes.dictionary.lookup.ae.DictionaryLookupAnnotator' which resides in the 'dictionary lookup' project. It is setup to access the '/resources/lookup/radiology/LookupDesc_PAD.xml' file which specifies what resources will be accessed, which are the PAD Term dictionaries in this case. Specifically, the files �PAD term spotter\resources\lookup\radiology\pad_anatomical_sites.csv� and �PAD term spotter\resources\lookup\radiology\pad_disorders.csv� are loaded into memory. The terms being looked up have been added to dictionary e.g: �PAD term spotter\resources\lookup\radiology\pad_anatomical_sites.csv� Using format as follows: "first-word|first-word plus terms|value" 'value' indicates how the term should be used when paired with the relational term, a context value, or some other special cases which is discussed in more detail in the 'doc/NLP OF RADIOLOGY REPORTSv1.txt'. For example, the entry "common|common iliac artery|6" would be a valid entry in the 'pad_anatomical_sites.csv' file. This entry would be used to pair up with terms contained in the sister term dictionary, 'pad_disorders.csv' file. Keep in mind that the 'first-word' term can be hyphenated, but that term must exist in the "core\resources\tokenizer\hyphenated.txt" file contained in the 'core' project [Note] ================================================================================= If the hyphenated version didn't exist then the above example should be expressed as "first|first-word plus terms|value" ================================================================================== %%%%%%%%%%% PAD_Hit.xml The file 'desc/analysis_engine/PAD_Hit.xml' drives the java class 'org.apache.ctakes.padtermspotter.ae.PADHitAnnotator' which operates on the terms discovered in the dictionary annotator. Params: WINDOW_SIZE, ANNOTATION_TYPE, ANNOTATION_PART_ONE_OF_PAIR, ANNOTATION_PART_TWO_OF_PAIR, may be others a) If the annotations of type part_one and part_two fall in the window_size of type annotation_type, it is considered a hit. b) If the annotation is defined as stand alone, then it does not require to be part of a pair to be considered a hit. %%%%%%%%%%%%%%%%%%%%% DxStatusAnnotator.xml This is similar to the NE context (see 'Clinical Text Analysis and Knowledge Extraction System User Guide' 4.8. NE contexts). Context terms relating to disease and illness are identified via 'PAD term spotter/src/edu/mayo/bmi/fsm/pad/machine/DxIndicatorFSM.java'. %%%%%%%%%%%%%%%%%%%%%%% NegationDxAnnotator.xml This is similar to the NE context (see 'Clinical Text Analysis and Knowledge Extraction System User Guide' 4.8. NE contexts). Negation terms relating to disease and illness are identified via 'PAD term spotter/src/edu/mayo/bmi/fsm/pad/machine/NegDxIndicatorFSM.java'. ########################################## CAS consumers (post consumption processor) ########################################## The CAS consumer provided in '/desc/cas_consumper/PADOffSetsRecord.xml' is provided as a means to post process the results and provides the following features: 1 - Record by record level classification for PAD 2 - Site and disorder terms along with offset information (useful for debugging) 3 - Overall patient level classification based on record classification Parameters: Output File Name (required): Specifies the location of the detail and summary report. UsingAlternateAlgorithm(false by default): Boolean value which indicates if a alternate algorithm should be used for the post processing (PAD>NEG>POS>UNK) Record by Record level classification for PAD : The private method 'calculateRecordLevelClassification' is responsible for providing the PAD classification given the factors; * Type of examination US_EXAM (ultrasound), LOWER_EXT (lower extremity), US_LOWER_EXT (ultrasound lower extremity), US_LOWER_SOLO (ultrasound lower extremity one side only), CT_EXAM (CAT scan), CT_EXAM_SOLO (CAT Scan one side only) * Number of hits, and, if applicable, the difference if both exist ("numberOfHits=total number of confirmed positive PAD hits" and "differentialHitCount=total positive PAD minus total negative PAD hits or if , no PAD hits, the number of terms only plus number of sites only found") * Number of term only mentions, site only mentions, probable evidence, mentions of vein related terms, and mention of stent related terms ("numberOfTermOnly=terms found outside of hit", "numberOfLocationOnly=sites found outside of hit", "numberOfProbable=count probable elements", "numberVeinMentions=count of vein type terms", and "noMentionOfStents=count of stent type terms") * Whether there is a mention of veins or stenosis related terms ("noMentionOfVeins=void of mention of vein type terms" "locationTermsOnly=location terms only", and "noMentionOfStenosis=void of stenosis type terms") * Whether there is evidence of probable, negation and possible negation ("foundEvidenceOfProbable=at least one probably element found", "haveNegativeCases=at least one negative term or site found", "possibleNegativeCases=negation is found, but not adjacent to stenosis, patent, or a site excluded term"); * noPadMentionThreshold is a somewhat arbitrary value used to measure a relative weight of evidence against other counts. Algorithm for 'calculateRecordLevelClassification'': Site and disorder terms along with offset information (useful for debugging) Provides offset information for the 'hits' found related to PAD which is helpful to implement evaluation tools to highlight the actual text 'hits'. o If no matching term is found then it is represented as a '-1' For example the following indicates a site term was found with no corresponding disorder: (-1:325-344)|SIMPLE_SEGMENT|**NO TERM**|right thigh and leg Overall patient level classification based on record classification Since each patient will typically have several radiology notes processed by the system, the precedence order in the event of conflicts between two or more classifications needs to be handled. A post processing step is also provided that shows how the classifier combines several record level classifications into one patient level one. Thus the set precedence is as PAD-Yes > PAD-prob > PAD-no/unknown (combined category). ################ Debugging issues ################ Description: CPE gives 'null pointer exception'. Issue: AWT and browser component become out of scope or corrupted. Workaround: Use the CPE 'File' => 'Clear All' option and reload all the three panels respective annotators and values Description: Get "CASRuntimeException: JCas type "org.apache.uima.examples.SourceDocumentInformation" used in Java code, but was not declared in the XML type descriptor." type of error when deploying from CPE. Issue: Accessing typesytems, annotators and/or resources outside of the project class path. Workaround: Make sure the CPE is kicked off with the corresponding projects 'bin' and 'resources' specified in the class path Description: A disproportionate amount of unknowns are being classified by the system. Issue: Since the unknown classification is determined first in the algorithm (If there is no related exam type OR no positive or negative evidence, classify as UNKNOWN) there needs to be a thorough representation of the examination types that are valid. Workaround 1: Populate 'resources/lookup/ExamTitleWords.txt' with pertainent section/subsection header information for your radiology clinical domain. Workaround 2: From the CPE panel place a check in the 'UsingAlternateAlgorithm' to have the post logic consider other classifications prior to the unknown classification.