
How to create a custom dictionary from UMLS dataset for cTakes – Step by Step Guide by Technosoft
The purpose of this article is to provide guidelines on how you can download the UMLS dataset and use that to configure a dictionary for cTAKES with custom coding systems.
1. Creating Subset of Codes Using MetamorphoSys Tool
This section is dedicated to the tool named MetamorphoSys (published by UMLS). It allows us to create a subset of codes as per our requirement from the UMLS dataset.
1.1. Download UMLS Release
- Download the latest full UMLS release from https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html
- Extract the archive to a separate folder e.g. umls-2019AA-full
1.2. Running MetamorphoSys
- Open the folder called 2019AA-full inside the extracted
- Extract the archive mmsys.zip inside this folder.
- Execute the following file per your operating system:
-> Windows (32-bit): run.bat
-> Windows (64-bit): run64.bat
-> Linux: run_linux.sh
-> Macintosh: run_mac.sh
- This would launch the MetamorphoSys Click on Install UMLS
- Choose the source directory as the same directory in which the archive mmsys.zip was extracted
- Choose the destination directory as per your own
- Select the following options:
Metathesaurus
Semantic Network
SPECIALIST Lexicon & Lexical Tools
- Click on OK
- Click on New Configuration..
- Click on Accept
- Select Active Subset is the default subset.
- Make sure that the source folder matches with the same directory in which the archive
mmsys.zip was extracted before proceeding to the next tab.
- Click on Output Options and make sure that the beginning text of the destination folder is the same as it was chosen in step 8 except that \2019AA\META has been appended to
1.3. Selecting Code Sources for the Subset
- Click on Source List and select the checkbox Select sources to INCLUDE in the subset
- Click on any source to get rid of the existing selection(s)
- Sort the column Source Family by clicking on This would ensure that sources that belong to the same source family appear next to each other. This makes things a bit easier for anyone to not forget the selection of a particular source if they are interested in choosing the whole source family for creating their own subset.
- As for this example, we decided to go with the following source families (You may choose other source families if you would like to but the succeeding steps would still be applicable either way):
RXNORM
SNOMEDCT
ICD10 (all forms of it)
- Click on the first entry that matches with the ICD10 source family.
- A dialog box would appear suggesting other sources for selection that are typically included with the ICD10 source Click on OK to include them as well.
- Hold the CTRL key on your keyboard and do the same for these sources
families: ICD10AM, ICD10CM, and ICD10PCS.
- Make sure that all sources belonging to this source family have been
- Hold the CTRL key on your keyboard and choose the only single entry that
matches with the RXNORM family.
- Hold the CTRL key on your keyboard and choose the first entry that matches
with the SNOMEDCT family.
- The same dialog box would appear again for suggesting other sources for this family. Click on OK
- Make sure that all sources belonging to this source family have been
- Scroll through this list and ensure that you have selected all the sources that you intended to do
- Click on Done in the menu bar.
- Click on Begin Subset
- Click on Yes to save this configuration If anyone would like to replicate this configuration, they may do so by choosing the other option in step 11 and load the same configuration by choosing this file.
- The tool would initiate the process now. You may see the progress in the new window. Please wait patiently as this might take some
2. Creating Dictionary Using cTAKES Dictionary Creator
This section is dedicated to cTAKES Dictionary Creator. The files generated by MetamorphoSys are passed as an input to the tool and the tool creates a custom dictionary in the form of an SQL script.
2.1. Download cTAKES
- Download the latest version of Apache cTAKES from https://ctakes.apache.org/downloads.cgi
- Extract the archive to a separate folder e.g. apache-ctakes-4.0.0
- Open the bin
- Execute the batch file mentioned in the following per your operating system:
-> Windows (32/64 bit): runDictionaryCreator.bat
-> Linux/Macintosh: runDictionaryCreator.sh
2.2 Creating cTAKES Dictionary
- Select the UMLS installation directory by clicking on Select Directory. Change the selected directory to the directory that was chosen in step 8 of the previous section, choose 2019AA and click on Open.
- The tool would now be parsing the vocabulary types. This would take some time depending upon the size of the
- Select every checkbox manually in both of the lists
- Choose a custom name for the dictionary and type it in the textbox e.g.
my-custom-dictionary.
- Click on Build Dictionary
- You would be notified when the process is completed. The SQL script would be generated in the folder:
apache-ctakes-4.0.0\resources\org\apache\ctakes\dictionary\loo kup\fast\my-custom-dictionary