// Define some global attributes
include::_globattr.adoc[]

[[cd_dep_parser]]
=== Dependency Parser (optional) ===

Dependency parsers provide syntactic information about sentences.
Unlike deep parsers, they do not explicitly find phrases (e.g., NP or VP);
rather, they find the dependencies between words.  
For example, ``hormone replacement therapy'' would 
have deep structure:

	(NP (NML (NN hormone) (NN replacement)) (NN therapy))

but its dependency structure would show that ``hormone'' depends on
``replacement'' and ``replacement'' in turn depends on ``therapy''
Below, the first column of numbers indicates the ID of the word, and
the second number indicates what it is dependent on.

	23 hormone     hormone     NN 24 NMOD
	24 replacement replacement NN 25 NMOD
	25 therapy     therapy     NN 22 PMOD

Dependency parses can be _labeled_ as well, e.g., we could specify
that ``hormone'' is in a noun-modifier (i.e., NMOD) relationship with ``therapy'' in
the example above (the last column).

This project provides a UIMA wrapper and some utilities for
ClearParser (http://code.google.com/p/clearparser/), a
transition-based dependency parser that achieves state-of-the-art
accuracy and speed.
footnoteref:[clearparser-code,
The implementation in {osp-short} v{rev-short} is based on revision 75.]
****************************************************
ClearParser is described in:

K-best, Locally Pruned, Transition-based Dependency Parsing 
Using Robust Risk Minimization. Jinho D. Choi, Nicolas Nicolov, 
Collections of Recent Advances in Natural Language Processing V, 
205-216, John Benjamins, Amsterdam & Philadelphia, 2009.
****************************************************

Dependency parses often assume lemmas (normalized word forms) and
POS tags as input.  This {osp-short} component infers lemmas
and POS tags from upstream LVG and POS tagger components. 

[[sec_depparser_ae]]
==== Analysis Engines and other Descriptors ====

- *analysis_engine/ClearParserAE.xml*
+
This analysis engine wraps the dependency parser's prediction
function (i.e., finding dependency trees from text).
It takes lemmas and POS tags from the 
+normalizedForm+ and +partOfSpeech+ attributes
of BaseTokens that have been found in {osp-short} (i.e., are in the CAS).
This is the analysis engine that should be dropped into 
any new pipelines to get dependency parses.  
+
*Parameters*::
  DependencyModelFile;; the file that contains the ClearParser transition model
  FeatureTemplateFile;; the file that contains the features (leave this as +feature.xml+ unless you want to modify the internals of the dependency parser)
  LexiconDirectory;; the directory that contains ClearParser tagsets, etc
  MorphDictionaryDirectory;; leave empty to use POS tags from {osp-short}.  Enter +en_dict+ to use the Clear Morphological Analyzer.
+
- *analysis_engine/ClearParserPlaintextAggregate.xml*
+
An aggregate engine appropriate for use with CVD or other tools that act directly on plain text.
+
- *analysis_engine/ClearParserTokenizedAggregate.xml* and *ClearParserTokenizedInfPosAggregate.xml*
+
Aggregate engines appropriate for use in CPEs.  
The first of these assumes that POS tags have been
given in an upstream component or directly from data.  
The second infers POS tags using the {osp-short} POS tagger.
+
- *analysis_engine/ClearTrainerAE.xml* and *ClearTrainerAggregate.xml*
+
These analysis engines train models for use in ClearParserAE.  
See <<sec_depparser_train>> for further details.
+
- *analysis_engine/LemAssigner.xml*, *LvgBaseTokenAnnotator.xml*, and *PosAssigner.xml*
+
These analysis engines are upstream components that complement ClearTrainerAE.
See <<sec_depparser_train>> for further details.
+
- *collection_reader/DependencyFileCollectionReader.xml*
+
Reads in a single file with dependency data in the formats described
in <<sec_depparser_format>>.  The file is treated as a single document
with many sentences that are separated by blank lines.
+
- *cas_consumer/DependencyNodeWriter.xml*
+
Writes ConllDependencyNode objects (the internal form used for dependency parses)
to the .dep format (see <<sec_depparser_format>>).


[[sec_depparser_resources]]
Resources and Models
^^^^^^^^^^^^^^^^^^^^

- *clinques.mod*
+
The main ClearParser model packaged with {osp-short} v{rev-short}.
This is trained on a corpus of 1600 clinical questions.
+
- *pass:[lexicon*/]*
+
A directory of additional files for a ClearParser model.  
For example, ``deprel.txt'' contains the set of dependency labels;
``pos.txt'' contains the set of POS tags.
IMPORTANT: When doing training within this project, 
the specified lexicon directory
must first be created separately.
+
- *en_dict/*
+
A directory used by the Clear Morphological Analyzer to create lemmas.
This analyzer is an alternative to using LVG output from {osp-short}.
Descriptor files in the project are set up to use the
Clear Morphological Analyzer if this valid location is passed
as the ``Morph Dictionary Directory'' parameter to Analysis Engines. 
+
- *feature.xml*
+
Tells the dependency parser what features to base its dependency 
decisions on.
+
- *config_en.xml*
+
This file is not used when {osp-short} is running ClearParser.
You can follow the manual on the ClearParser website to
run tests or training from the command line, which would make use of this file.
+
- *en_clinques.headrules*
+
In order to convert from standard phrase structure trees, head rules tell you
which child in a tree is the _head_ (loosely, the most important).
These are used by +clear.engine.PhraseToDep+ .
 
 
[[sec_depparser_format]]
==== Data Format ====
The format of training data into DependencyFileCollectionReader
and of output data from DependencyNodeWriter is the same.
Files should have one word per line alongside several other tab-delimited attributes.
Sentences are separated by a blank line.

An example snippet from +data/sample.dep+ of dependency data is shown in
<<dep_parser_data_eg>> (the first line is for reference).

[[dep_parser_data_eg]]
.Dependency parser data: +.dep+ format
===================================================================
------------------------------------------------------------------
ID  FORM         LEMMA        POS   HEAD DEPREL
1   The          the          DT    3    NMOD
2   study        study        NN    3    NMOD
3   physician    physician    NN    4    SBJ
4   called       call         VBD   0    ROOT
5   a            a            DT    6    NMOD
6   hematologist hematologist NN    4    OBJ
------------------------------------------------------------------
===================================================================

Description of +.dep+ format fields::
	ID;; The word number within a sentence.  There is an implied ROOT word that has and ID of 0.
	FORM;; The word itself.
	LEMMA;; A normalized form of the word, stripping suffixes, etc.
	POS;; The part-of-speech associated with the word.
	HEAD;; The ID of the word that the word on the current line is dependent on.
	DEPREL;; The relationship between the current word and its head.  This may be syntactic or semantic.

The popular CONLL format is also supported for input into {osp-short}; this requires several
extra columns.  However, not all of those columns will be used for ClearParser parsing.

===== Derivative Formats =====
The parser will use formats derived from this +.dep+ format, as well.

Alternative formats::
	*+.mlem+*;; ID, FORM, LEMMA, HEAD, DEPREL
	*+.mpos+*;; ID, FORM, POS, HEAD, DEPREL
	*+.min+*;; ID, FORM, HEAD, DEPREL
	*+.tok+*;; ID, FORM

Of these, +.tok+ is typically used for testing the actual parses, and the rest are used for training new models.

[[dep_data_conversion]]
===== Conversion between formats =====
Data from resources such as WSJ or Genia typically come as trees
that were originally used for deep parsers (as in the example above).
The dependency parser project comes with a tool to convert between those trees
and the +.dep+ format.

Run:
_________________________________________________________________________
+*java -cp lib/args4j-2.0.12.jar:bin clear.engine.PhraseToDep \
                        -h resources/en_clinques.headrules \
                        -i <input-file> \ <1>
                        -m resources/en_dict \
                        -o <output-file>*+ <2>
_________________________________________________________________________

<1> A file in bracketed tree format, e.g, +data/sample.tree+ 
<2> A file to be written out in +.dep+ format

To down-convert between +.dep+ and other formats, use +cut+ in linux:
_____
*+cut -f 1,2,3,5,6 data/sample.dep > data/sample.mlem+*
*+cut -f 1,2,4,5,6 data/sample.dep > data/sample.mpos+*
*+cut -f 1,2,5,6 data/sample.dep > data/sample.min+*
*+cut -f 1,2 data/sample.dep > data/sample.tok+*
_____ 

In Windows, this operation can be done in spreadsheet applications like Excel by:

- using Data -> Import -> From Text File,
- selecting the +.dep+ file, 
- identifying tabs as the delimiter,
- deleting unwanted columns
- File -> Export in a tab-delimited format.


[[sec_depparser_train]]
==== Training a model ====


===== Training data =====
The packaged model is one trained on Clinical Questions,
in part because it is small enough to package with {osp-short}.
If this is not your domain, you may want to train
other models.  At the time of {osp-short} release {rev-short},
however, there are no clinical document Treebanks to train on.

The other options are:

- GENIA
+
link:{genia-home}[GENIA] is a literature mining project in molecular
biology from University of Tokyo. Its corpus, a collection of
biomedical literature, has been annotated with phrase-structure trees. 
You can download a copy at
http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/corpus/GENIA_treebank_v1.tar.gz
+
- [[get_ptb]] Penn Treebank
+
The link:{ptb-home}[Penn Treebank] project annotates
naturally-occuring text for linguistic structure. 
To obtain a copy of Release 2 is non-trivial, please read
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T7.


Training a model in Eclipse
+++++++++++++++++++++++++++
There are many types of models that can be trained, based on
how much you want to rely on {osp-short} components and how much
you want to rely on other components.

. Download and install the C++ version of liblinear at http://www.csie.ntu.edu.tw/~cjlin/liblinear/; 
this requires much less memory than the default Java version.

. Train a model
** To create a model using {osp-short} POS tags and lemmas with Eclipse:
	.. Create a +<your-data>.min+ file from +<your-data>.dep+ (see <<dep_data_conversion>>)
	.. Use the +UIMA_CPE_GUI--dependency parser+ launch.
	.. Load +desc/collection_processing_engine/ClearTrainerPosLemTestCPE.xml+
	.. Put your filename under ``Dependency File''
	.. Make sure ``Training Mode'' is checked
	.. Rename the ``Dependency Model File'' and ``Lexicon Directory'' according to what you want.
	.. Make sure ``Trainer Path'' is a valid relative path from
	    +{inst-root-dir}/dependency parser+ to a vaid liblinear binary +train+ file.

** To create a model using gold standard POS tags and Clear Morphological Analyzer lemmas,
run:

____________________________________________________________
+*java -cp lib/hppc-0.3.1.jar:bin:resources clear.engine.DepTrain \
                     -t <training_data> \ <1>
                     -c <configuration_file>*+ <2>
____________________________________________________________

<1> an OpenNLP training data file.
<2> the file name of the resulting model. The name should end with
    either +.txt+ (for a plain text model) or +.bin.gz+ (for a
    compressed binary model).