Contents - Disclaimer - Introduction - Overview of spoking status pipeline - Prerequisite steps - Getting Started (test sample run) - Verify Results (test sample run) - How to use the trained libSVM model - Description of shipped resources - Debugging issues - Release Notes ########## Disclaimer ########## This should be considered a beta release of this annotator. See the 'Clinical Text Analysis and Knowledge Extraction System User Guide' documentation located in the cTAKES '/docs/userguide/cTAKES_userguide.htm' for detailed install and setup information pertaining to all the cTAKES components. ############ Introduction ############ This is version 1.2 of the cTAKES smoking status annotator. The pipeline has been tested on flat text files and CDA documents. The sample provided uses a bar (|) delimited file with multiple records (and patients) per file. The dictionary lookup annotator is limited to smoking status dictionaries provided in the '/resources/ss/data/*.dictionary' files. Specific details can be found in the 'smokingStatus_AMIA2009.pdf' and the 'smokingStatus_Overview.html' files. ##################################### Overview of spoking status pipeline ##################################### The smoking status pipeline processes patient records into five pre-determined categories - past smoker (P), current smoker (C), smoker (S), non-smoker (N), and unknown (U). The definition of smoking status was adapted from I2B2 Natural Language Processing Challenges for Clinical Records [1]: " PAST SMOKER (P): A patient whose record asserts either that they are a past smoker or that they were a smoker a year or more ago but who have not smoked for at least one year. " CURRENT SMOKER (C): A patient whose record asserts that they are a current smoker (or that they smoked without indicating that they stopped more than a year ago) or that they were a smoker within the past year. " SMOKER (S): A patient who is either a CURRENT or a PAST smoker but, whose medical record does not provide enough information to classify the patient as either a CURRENT or a PAST smoker. " NON-SMOKER (N): A patient whose record indicates that they have never smoked. " UNKNOWN (U): The patient's record does not mention anything about smoking. The algorithm for classifying at the document level is as follows: If (exist any sentence classified as C) Label that doc as C Else If (exist a sentence classified as P) Label that doc as P Else If (exist a sentence classified as S) Label that doc as S Else If (exist a sentence classified as N) Label that doc as N Else (i.e., all sentences are classified as U) Label that doc as U The algorithm for classifying at the patient level is as follows: If (exist a doc classified as C or P) If (exist C but no P) Label that patient as C Else If (exist P but no C) Label that patient as P Else (i.e., exist both C and P) If (freq of C >= freq of P) Label that patient as C Else Label that patient as P Else If (exist a doc classified as S) Label that patient as S Else If (exist a doc classified as N) Label that patient as N Else (i.e., all docs are classified as U) Label that patient as U See the 'smokingStatus_AMIA2009.pdf' for a more detailed discussion. ################## Prerequisite steps ################## - Install all the cTAKES components, dependencies (e.g. JVM, UIMA, and Eclipse(optional)). - Make sure the clinical documents pipeline of cTAKES and the Apache UIMA are setup correctly by running the CAS Visual Debugger (CVD) as described in the 'Clinical Text Analysis and Knowledge Extraction System User Guide' 1.2. Install from PEAR packages. - Install the 'smoking status.pear' after completing the installation of the core and related pears, as per above. NOTE: Do not utilize the 'Run your AE in the CAS Visual Debugger'(CVD) step as in other cTAKES module installations. Instead use the 'Getting Started' section below for verification purposes. The AE infrastructure of the 'SimulatedProdSmokingAE' indirectly utilizes other AEs not directly in the flow, therefore, the results of running the CVD from the pear installer are very unpredictable. ################################# Getting Started (test sample run) ################################# Reference the '/smoking status/doc/smokingStatus_Overview.html' and '/smoking status/doc/smokingStatus_AMIA2009.pdf.pdf' for additional documentation for this pipeline. This pipeline utilizes three main flows inherent in the UIMA infrastructure to parse, analyze and summarize the radiology notes PAD classification. The Component Processing Engine (CPE) responsible for putting the different flows together to create, modify, and deploy the system. Step 0: From command prompt set the UIMA_HOME class path to the location where you installed the Apache UIMA api: 'set UIMA_HOME=<>' 'export UIMA_HOME=<>' Step 1: From the in a terminal or command window run the following command to bring up the CPE environment: java -cp ^ "%UIMA_HOME%/lib/uima-core.jar;^ %UIMA_HOME%/lib/uima-cpe.jar;^ %UIMA_HOME%/lib/uima-tools.jar;^ %UIMA_HOME%/lib/uima-document-annotation.jar;^ %UIMA_HOME%/lib/uima-examples.jar;^ smoking status/bin;^ chunker/bin;^ clinical documents pipeline/bin;^ context dependent tokenizer/bin;^ core/bin;^ dictionary lookup/bin;^ document preprocessor/bin;^ LVG/bin;^ NE contexts/bin;^ POS tagger/bin;^ core/lib/log4j-1.2.8.jar;^ core/lib/jdom.jar;^ core/lib/lucene-1.4-final.jar;^ core/lib/opennlp-tools-1.4.3.jar;^ core/lib/maxent-2.5.0.jar;^ core/lib/OpenAI_FSM.jar;^ core/lib/trove.jar;^ smoking status/lib/libsvm-2.91.jar;^ smoking status/lib/lucene-core-3.0.2.jar;^ LVG/lib/lvg2008dist.jar;^ document preprocessor/lib/xercesImpl.jar;^ document preprocessor/lib/xml-apis.jar;^ document preprocessor/lib/xmlParserAPIs.jar;^ chunker/resources;^ clinical documents pipeline/resources;^ context dependent tokenizer/resources;^ core/resources;^ dictionary lookup/resources;^ document preprocessor/resources;^ LVG/resources;^ smoking status/resources;^ POS tagger/resources;^ NE contexts/resources" org.apache.uima.tools.cpm.CpmFrame java -cp \ $UIMA_HOME/lib/uima-core.jar:\ $UIMA_HOME/lib/uima-cpe.jar:\ $UIMA_HOME/lib/uima-tools.jar:\ $UIMA_HOME/lib/uima-examples.jar:\ $UIMA_HOME/lib/uima-document-annotation.jar:\ smoking\ status/bin:\ chunker/bin:\ clinical\ documents\ pipeline/bin:\ context\ dependent\ tokenizer/bin:\ core/bin:\ dictionary\ lookup/bin:\ document\ preprocessor/bin:\ LVG/bin:\ NE\ contexts/bin:\ POS\ tagger/bin:\ core/lib/log4j-1.2.8.jar:\ core/lib/jdom.jar:\ core/lib/lucene-1.4-final.jar:\ core/lib/opennlp-tools-1.4.3.jar:\ core/lib/maxent-2.5.0.jar:\ core/lib/OpenAI_FSM.jar:\ core/lib/trove.jar:\ LVG/lib/lvg2008dist.jar:\ smoking\ status/lib/libsvm-2.91.jar:\ smoking\ status/lib/lucene-core-3.0.2.jar:\ document\ preprocessor/lib/xercesImpl.jar:\ document\ preprocessor/lib/xml-apis.jar:\ document\ preprocessor/lib/xmlParserAPIs.jar:\ chunker/resources:\ clinical\ documents\ pipeline/resources:\ context\ dependent\ tokenizer/resources:\ core/resources:\ dictionary\ lookup/resources:\ document\ preprocessor/resources:\ LVG/resources:\ NE\ contexts/resources:\ smoking\ status/resources:\ POS\ tagger/resources \ org.apache.uima.tools.cpm.CpmFrame Step 2: From the pull down menu open the “Sample_SmokingStatus_output_flatfile.xml” via 'File' => 'Open CPE Descriptor' => 'smoking status/desc/collection_processing _engine/Sample_SmokingStatus_output_flatfile.xml'. This will load the sample Collection Processing Engine (CPE) 'Sample_SmokingStatus_output_flatfile.xml'. A sample CPE has been provided along with de-identified patient information to provide a means to test the pipeline after you have set up the environment. FYI. '{inst-root-dir}/data/test' contains 5 sample notes representing one patients multiple visits to a health service provider. Step 3: In order to run this sample you will need to specify the Eclipse workspace in place of the '{inst-root-dir}' () or the path where you installed 'smoking status' project. Step 4: Additionally, within the 'Output File:' field specify the path and file name to write the record level smoking status output on your system ('{inst-root-dir}/smoking status/data/test/output/record_resolution.txt' => outfile path and name ) or the path where you installed will be used by default. Step 5: By taking defaults for the rest of the values and, specifically, making sure the 'Run Patient Level Classification:' box is checked, an output file called 'record_resolution.txt_patientLevel.txt' ('{inst-root-dir}/ 'Clear All' option and reload all the three panels respective annotators and values Description: Get "CASRuntimeException: JCas type "org.apache.uima.examples.SourceDocumentInformation" used in Java code, but was not declared in the XML type descriptor." type of error when deploying from CPE. Issue: Accessing typesytems, annotators and/or resources outside of the project class path. Workaround: Make sure the CPE is kicked off with the corresponding projects 'bin' and 'resources' specified in the class path ############# Release Notes ############# One of the next releases will look into addressing the following: - class 'edu.mayo.bmi.fsm.smoking.machine.NegationFSM.java' must inherit from 'core/edu.mayo.bmi.fsm.machine.NegationFSM.java' - edu.mayo.bmi.smoking.context.negation.NegationContextAnalyzer - needs to use 'NE Contexts/edu.mayo.bmi.uima.context.negation.NegationContextAnalyzer' - ResolutionAnnotator.java references 'TypeSystemConst.NE_CERTAINTY_NEGATED' which has been commented out, uncomment in both places for next major release - FIXED -Parameters probably need to be changed? /smoking status/desc/analysis_engine/ClassifiableEntriesAnnotator.xml - FIXED - CdaCasInitializer.xml must be used from 'document preprocessor' - FIXED - SimulatedProdSmokingTAE.xml -> ExternalBaseAggregateTAE.xml -> ./SimpleSegmentAnnotator.xml -- need to reference the one in core edu.mayo.bmi.smoking.ae.SimpleSegmentAnnotator - must use core/edu.mayo.bmi.uima.core.ae.SimpleSegmentAnnotator.java Version 1.2 Addendum - FIXED - The LibSVM features in the previous cTAKES version are not fed into the system in correct order and cause incorrect "past smoker", "current smoker", and "smoker" classification. The new version fixed this problem. - FIXED - Subsections not being correctly handled causing misclassifications. - FIXED - Remove the type, SmokingSentence in SmokingProductionTypeSystem.xml.