>java -cp DictionaryTool.jar org.apache.ctakes.dictionarytool.DictionaryCreator2

Dictionary Creator2: Creates a flat file Cui|Text or Cui|Tui|Text or Database Dictionary from UMLS
Database Dictionary can be indexed by each Text's First Word or Rarest Word (for the dictionary)
Minimal Usage: DictionaryCreator -umls pathToUmlsRoot -ol pathToFlatFileOutput

-fw             Create First Word Index
-umls           Umls Root Directory
-ob             Orangebook Path
-fd             Format Data Directory
-tui            Input Tui List Path
-src            Source Type List Path
-ol             Output Cui and Term List Path
-bsv		Output Cui and Tui and Term List Path
-db             Output Database Url
-tbl            Output Database Table

The UMLS Root Directory must be specified
One form of output must be specified using either -ol or -db and -tbl
The default index type for databases is Rare Word Index
If an Orangebook Path is not specified then (orangebook) medication terms are not written
If a Format Data Directory is not specified then the default is used: ./data/default
If an Input Tui List Path is not specified then the cTakes Tuis are used: ./data/default/CtakesAllTuis.txt
If a Source Type List Path is not specified then Snomed is used: ./data/default/CtakesSources.txt

Important: Dictionary entries are appended to the output file or database.  
Running the same command twice will result in a database with all terms existing twice.

The data/default/ directory does include non-default possibilities, such as files listing only single cTakes groups:
e.g. CtakesAnatTuis.txt
and all UMLS groups:
UmlsAllTuis.txt
that can be used with the option -tui ./data/default/UmlsAlltuis.txt

There is also a file with all UMLS sources:
UmlsAllSources.txt
that can be used with the option -src ./data/default/UmlsAllSources.txt

Remember that if you want to output to a database you must specify both the url and table name:
-db jdbc:hsqldb:file:pathToMyDatabase -tbl myTableName

Also remember that hsqldb requires the entire url to be lowercase.

"Format Data" refers to the data that is used to format the end-result dictionary by trimming or expanding the umls entries.
It is recommended that the defaults are used, but you are welcome to experiment with your own.


If you are unfamiliar with hsqldb, there are two template / starting point databases in the resource/ directory.
cacheddbtemplate/ contains a template for a disk-cached dictionary, and memdbtemplate one for a fully in-memory dictionary.
Using an in-memory dictionary is orders of magnitude faster than using a disk-cached, but not a good idea for very large (.5GB?) databases.


There are a few other toys that can be found by perusing the source, such as a tool that creates a mapping of codes 
for like terms in different dictionaries:
ICD10|ICD9|RXNORM|SNOMEDCT
Usage: java -cp DictionaryTool.jar org.apache.ctakes.dictionarytool.CodeMapCreator -umls pathToUmlsRoot -ol pathToFlatFileOutput

Some of these extra utilities may be experimental or unfinished, so user beware.


At this time the code could use some javadocs and unit tests, plus a little cleanup.  I'm very busy, so volunteer works is appreciated.

Enjoy