Document Analyzer User's Guide

The Document Analyzer is a tool provided by the UIMA SDK for testing annotators and TAEs. It reads text files from your disk, processes them using a TAE, and allows you to view the results. The Document Analyzer is designed to work with text files and cannot be used with Analysis Engines that process other types of data.

For an introduction to developing annotators and Analysis Engines, read Chapter 4, Annotator and Analysis Engine Developer’s Guide. This chapter is a user's guide for using the Document Analyzer tool, and does not describe the process of developing annotators and Analysis Engines.

To run the Document Analyzer, execute the documentAnalyzer script that is in the bin directory of your UIMA SDK installation, or, if you are using the example Eclipse project, execute the "UIMA Document Analyzer" run configuration supplied with that project.

Note that if you're planning to run an Analysis Engine other than one of the examples included in the UIMA SDK, you'll first need to update your CLASSPATH environment variable to include the classes needed by that Analysis Engine.

When you first run the Document Analyzer, you should see a screen that looks like this:

To run a TAE, you must first configure the six fields on the main screen of the Document Analyzer.

Input Directory: Browse to or type the path of a directory containing text files that you want to analyze. Some sample documents are provided in the UIMA SDK under the docs/examples/data directory.

Output Directory: Browse to or type the path of a directory where you want output to be written. (As we'll see later, you won't normally need to look directly at these files, but the Document Analyzer needs to know where to write them.) The files written to this directory will be an XML representation of the analyzed documents. If this directory doesn't exist, it will be created. If you leave this field blank, your TAE will be run but no output will be generated.

Location of TAE XML Descriptor: Browse to or type the path of the descriptor for the TAE that you want to run. There are some example descriptors provided in the UIMA SDK under the docs/examples/descriptors/analysis_engine and docs/examples/descriptors/tutorial directories.

XML Tag containing Text: This is an optional feature. If you enter a value here, it specifies the name of an XML tag, expected to be found within the input documents, that contains the text to be analyzed. For example, the value TEXT would cause the TAE to only analyze the portion of the document enclosed within <TEXT>...</TEXT> tags.

Language: Specify the language in which the documents are written. Some Analysis Engines, but not all, require that this be set correctly in order to do their analysis. You can select a value from the drop-down list or type your own. The value entered here must be an ISO language identifier, the list of which can be found here: http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt

Character Encoding: The character encoding of the input files. The default, UTF-8, also works fine for ASCII text files. If you have a different encoding, enter it here. For more information on character sets and their names, see the JavaDocs for java.nio.charset.Charset.

Once you've filled in the appropriate values, press the "Run" button.

If an error occurs, a dialog will appear with the error message. (A stack trace will also be printed to the console, which may help you if the error was generated by your own annotator code.) Otherwise, an "Analysis Results" window will appear.

After a successful analysis, the "Analysis Results" window will appear.

The "Results Display Format" options at the bottom of this window show the different ways you can view your analysis – the Java Viewer, Java Viewer (JV) with User Colors, HTML, and XML. The default, Java Viewer, is recommended.

Once you have selected your desired Results Display Format, you can double-click on one of the files in the list to view the analysis done on that file.

For the Java viewer, the results display looks like this (for the TAE descriptor docs/examples/descriptors/tutorial/ex4/MeetingDetectorTAE.xml):

You can click the mouse on one of the highlighted annotations to see a list of all its features in the frame on the right.

If there are multiple annotation types in the view, you can control which ones are selected by using the checkboxes in the legend, the Select All button, or the Deselect All button.

If you are viewing a CAS that contains multiple subjects of analysis, then a selector will appear at the bottom right of the Annotation Viewer window. This will allow you to choose the Sofa that you wish to view. Note that only text Sofas containing a non-null document are available for viewing.

The "JV User Colors" and the HTML viewer allow you to specify exactly which colors are used to display each of your annotation types. For the Java Viewer, you can also specify which types should be initially selected, and you can hide types entirely.

To configure the viewer, click the "Edit Style Map" button on the "Analysis Results" dialog. You should see a dialog that looks like this:

To change the color assigned to a type, simply click on the colored cell in the "Background" column for the type you wish to edit. This will display a dialog that allows you to choose the color. For the HTML viewer only, you can also change the foreground color.

If you would like the type to be initially checked (selected) in the legend when the viewer is first launched, check the box in the "Checked" column. If you would like the type to never be shown in the viewer, click the box in the "Hidden" column. These settings only affect the Java Viewer, not the HTML view.

When you are done editing, click the "Save" button. This will save your choices to a file in the same directory as your TAE descriptor. From now on, when you view analysis results produced by this TAE using the "JV User Colors" or "HTML" options, the viewer will be configured as you have specified.

Interactive Mode allows you to analyze text that you type or cut-and-paste into the tool, rather than requiring that the documents be stored as files.

In the main Document Analyzer window, you can invoke Interactive Mode by clicking the "Interactive" button instead of the "Run" button. This will display a dialog that looks like this:

You can type or cut-and-paste your text into this window, then choose your Results Display Format and click the "Analyze" button. Your TAE will be run on the text that you supplied and the results will be displayed as usual.

If you have previously run a TAE and saved its analysis results, you can use the Document Analyzer's View mode to view those results, without re-running your analysis. To do this, on the main Document Analyzer window simply select the location of your analyzed documents in the "Output Directory" dialog and click the "View" button. You can then view your analysis results as described in Section 17.3, Viewing the Analysis Results.

The documentation for this component is found in a separate file in the docs/ directory, called CASVisualDebugger.pdf.