
Extraction of Patient Information using cTakes
Clinical notes contain massive amounts of clinical data which is usually not available in a format that is easily understood by any graphing tool or other tools for the matter due to the fact that this data is in plain text interspersed with key clinical values. For example, a hospital may receive narrative data from an MRI assessment in an HL7 R01 result message. This data may contain diagnosis info but that diagnosis info may not be actionable because it will be buried in all the text of the narrative report. In the past, a human was needed to parse this text and convert this data into actionable discrete values. This provides a great opportunity for a Machine Learning solution. Luckily, cTAKES (clinical Text Analysis and Knowledge Extraction System) is a technology available that can be used for such purpose. cTAKES is an open-source natural language processing system to extract information from electronic medical record clinical free-text. It recognizes diseases, disorders, signs, symptoms, medications, and procedures from given free text. cTAKES outputs structured data that can be mapped to a relevant FHIR resource containing codes from various coding systems; subsequently that FHIR data can be consumed by any EHR, graphing tool or an alerting system.
cTakes Overview
Currently, cTAKES is available as a GUI application. For those who like to use cTAKES as an API (Application Programming Interface), there are some limitations and we show here how to set it up for API use. cTAKES, by default, does not provide a web service for clients to consume. It only provides a GUI application named “Casual Visual Debugger” for processing clinical notes, which allows the user to observe the concepts (and their details) identified within clinical notes in the form of nodes as it can be seen in the image below:
Hence, in order to utilize cTAKES as an API, Technosoft created a Spring boot application that exposes a single REST API endpoint. The endpoint requires a clinical note in its request body and returns a tailored response consisting of conditions, observations, medications, and procedures. It uses the exact same underlying mechanism for the processing which is employed in the Casual Visual Debugger (CVD).
As soon as a request is received, the application loads the default UMLS analysis engine (employed by cTAKES) into memory. The engine verifies the UMLS credentials passed to this application, processes the clinical note by associating codes from the database with the concepts identified within the clinical note and returns its response back to the application. The application modifies the returned response tailored to FHIR resources and returns it to the client.
When a client receives the response from the REST API, as discrete Values in Nodes, it can create FHIR resources by going through the array of resources in the response nodes and push them into an FHIR server.
cTAKES Components:
cTAKES has components with unique qualities and capabilities. Every component includes at least one annotator (analysis engine).
These components include:
- Named Section identifier
- Sentence boundary detector
- Rule-based tokenizer
- Formatted list identifier
- Normalizer
- Context-dependent tokenizer
- Part-of-speech tagger
- Phrasal chunker
- Dictionary lookup annotator
- Context annotator
- Negation detector
- Uncertainty detector
- Subject detector
- Dependency parser
- Patient smoking status identifier
- Drug mention annotator
Process for creating a Custom Dictionary
MetamorphoSys is the Metathesaurus customization tool included in each UMLS release, which allows users to create their own custom dictionaries to be used in cTAKES. Data from these dictionaries are used to spot discrete values of interest from the given text.
One can use any of the UMLS vocabularies to create a custom dictionary. (UMLS, Unified Medical Language System, is a set of files and software that brings together many health and biomedical vocabularies and standards to enable interoperability between computer systems.) Technosoft suggests using the tool to create a subset of the codes for custom dictionaries. Since UMLS’ complete dataset offers a wide range of coding vocabularies, one will have to choose the desired coding vocabularies from it, such as:
- RxNorm
- ICD10 – The International Classification of Diseases, Tenth Revision (ICD-10)
- CPT – Current Procedural Terminology
- SNOMED-CT – The Systematized Nomenclature of Medicine-Clinical Terms
Once the vocabularies are selected, the tool will produce text files in a specific format to be used as a custom dictionary.
Now, to make use of these codes in cTAKES, cTAKES offers a tool named “Dictionary Creator GUI” for creating a dictionary database out of these files. In “Dictionary Creator GUI”, one specifies the location of the root directory of these files and it creates a custom dictionary tailored to cTAKES underlying implementation. The tool basically writes a configuration XML file outlining the database’s details and a SQL script for the database.
Screencast for demonstration of the API
A screencast has been created to demonstrate the API and its functionality. Here is the video below: