An XML Medical Knowledge Lexicon
David J. Rothwell, MD, Richard Wheeler, MD, and Ngô Thanh Nhàn, Ph.D.
David J. Rothwell
The Structured Health Markup Language (SHML) consists of a set of tags and accompanying lexicon, constructed within the eXtensible Markup Language (XML) formalism, designed to capture the medical, administrative and psychosocial elements of a patient encounter. A markup language consists primarily of a set of labels (tags) developed within XML rules.
SHML consists of a collection of tags that capture and describe the entire content of a medical document across each medical domain. Tags are applied to each term found in a document. SHML tags describe both the traditional biological elements of a medical encounter as well as the psychosocial aspects, conformant with the biopsychosocial model of care. In addition, tags for many of the administrative elements of care are also included.
SHML Tag Type
The lexicon of SHML-tagged terms, the XML Medical Knowledge Lexicon , works in conjunction with the English Medical Language Lexicon for Natural Language Processing, the lexicon developed for natural language processing of clinical documents, used by the Medical Language Processor (MLP). The two lexicons are in concordance with one another, comprising, in effect, a single combined lexicon. This effort is a maturation and marriage of two developments, both aimed at improving accessibility to relevant patient data found in text. See Figure 1.
Figure 1: English medical lexicon in the medical language processor
MLP in conjunction with SHML-tagging functions to transform the content of clinical documents into individual clinical facts which are referred to as Health Information Units (HIU's). SHML assists in identifying and structuring the medically relevant content of documents. MLP provides linguistic and broad medical characterization of each term while SHML tagging provide more precise medical characterization of these terms.
A challenge and primary task in developing an Electronic Medical Record (EMR) is to provide both immediate and long range access to information in the clinical record. Since significant parts of current medical records consist of transcribed or written notes, access to this information demands Natural Language Processing (NLP) techniques, (and for our purposes MLP), to isolate and retrieve (i.e. unravel) the informational units from that text. In short, to transform all information captured, whether dictated or written, into retrievable clinical facts.
The goals of this MLP/SHML effort are to retrieve clinical information for display from all previous encounters in a succinct, user defined manner; to make available that information for subsequent data analysis; and to support clinical prompts-and-alerts software. With the use of Viewer software, displays of the HIU's can be integrated with and conformant to HL-7's Clinical Document Architecture format (CDA).
SHML tags are specifically designed to characterize and label medical knowledge. Tags act as an initial sort and retrieval device for the EMR; they get at the core of what was stated. They provide uniformity to data elements. SHML tagging, when used to its full potential, defines what is worth accessing, viewing, and counting in a medical document. Tags are views of data, not attributes or properties of the data. They are intended to be inclusive of all parts of the record. Tagging provides access to raw (original) data; they are not interpretive, e.g. if BUN is elevated, it captures only this data element, not the possibility of "renal failure". SHML tagging does not require clinical language to be forced into categories of a predetermined data model, i.e. Structured Data Entry menus. Tag classes are illustrated and brief examples are shown below:
The combined lexicon used to underlie and drive the MLP/SHML system is sorted by class and terms are labeled for their linguistic and clinical properties. Classes are shown in the accompanying tables. The lexicon includes all terms encountered in a document, nouns, verbs, and all classes of modifiers. Traditional terminologies are known to contain only a small percentage of the terms encountered in medical text; their focus has been on nouns. The combined lexicon addresses and classes/tags each term for their linguistic and medical content.
Of great importance are the terms expressing uncertainty, negation and time. Each of these classes consists of several hundred terms. When present, they modify the data element to which they refer and are included in the appropriate HIU.
Ambiguity is a particularly difficult issue in medicine. Ambiguous language can be of several types. The first is intentional ambiguity, that is to be intentionally uncertain. The second is non-intentional (can't use the language properly). The third is within the language itself, e.g. homonyms (foot, depression). It has been estimated that up to 60% of statements found in medical records include terms expressing ambiguity. All must be recognized and accounted for to achieve an accurate rendition of a record. MLP/SHML does this.
The combined lexicon is derived from terms found in actual records and from publicly available sources. It is possible that the lexicon could be extended to include all terms found in the UMLS, available from NLM (National Library of Medicine). Classing and tagging such a source could be a formidable, but worthwhile task. SHML tagging of this or other terminology sources would in effect make them more operational.
The enumeration of the current version of SHML tags is shown in the accompanying table (see Medical Tag Hierarchy). Each tag class is defined and brief examples shown. Each term is given a primary class tag and when deemed useful additional support tags are added (See Notes on Use of this Lexicon below).
Documents when processed by MLP/SHML technology are both human and machine readable. Tools needed for adoption of MLP/SHML are few, readily available and inexpensive. MLP/SHML markup provides a digital representation of a medical document. The software outlined above can store, display, process, transmit, search, and print each identified informational element.
NOTES ON USE OF THE XML MEDICAL KNOWLEDGE LEXICON
The XML Medical Knowledge Lexicon is a list of 67,642 English lexical entries compiled in conjunction with the English Medical Lexicon for Natural Language Processing. Each entry contains:
For example, the term congestive heart failure is listed in the Lexicon as
The term congestive heart failure can be found in a patient document sentence such as