An XML Medical Knowledge Lexicon

David J. Rothwell, MD, Richard Wheeler, MD, and Ngô Thanh Nhàn, Ph.D.


Many colleagues have contributed to the development of the XML Medical Knowledge Lexicon and the Delphi Knowledge dBMS, Delphi and PHP Viewers that use it. Naomi Sager has been central in coordinating this work with the English Medical Language Lexicon for Natural Language Processing. We wish to thank Ronald Tarrant, Nancy Wheeler and Jorge Roccatagliata for their significant contributions. We also wish to thank Anita Parmalee, Clara Hager, and Richard Wheeler's family: Michele, Nathan, Ben and Shauna for their support and generosity.

Full Lexicon in Acrobat pdf format
Volume 1Introduction
Symbols & A - C
D - I
Volume 2J - Q
R - Z


David J. Rothwell
July 15,2005

The Structured Health Markup Language (SHML) consists of a set of tags and accompanying lexicon, constructed within the eXtensible Markup Language (XML) formalism, designed to capture the medical, administrative and psychosocial elements of a patient encounter. A markup language consists primarily of a set of labels (tags) developed within XML rules.

SHML consists of a collection of tags that capture and describe the entire content of a medical document across each medical domain. Tags are applied to each term found in a document. SHML tags describe both the traditional biological elements of a medical encounter as well as the psychosocial aspects, conformant with the biopsychosocial model of care. In addition, tags for many of the administrative elements of care are also included.

SHML Tag Type
  • Anatomic structure
  • Body region
  • Sign-symptom
  • Diagnosis
  • Dx-process
  • Dx group by system
  • Procedures
  • Organism
  • Allergies
  • Pt. social behavior
  • Health status (adl...)
  • Activities (sports,...)
  • Medications: (Multum), med-class
  • Chemicals
  • Time: freq, repitition, exact, begin, end
  • Links
  • Modifiers: modal, negation, changes, amount, desc, s-q
  • Person: kin, civil
  • Demographic
  • Socio-cultural
  • Patient direction
  • Patient preference
  • Patient understanding
  • Relationships
  • Beliefs
  • Values
  • Living situation

The lexicon of SHML-tagged terms, the XML Medical Knowledge Lexicon , works in conjunction with the English Medical Language Lexicon for Natural Language Processing, the lexicon developed for natural language processing of clinical documents, used by the Medical Language Processor (MLP). The two lexicons are in concordance with one another, comprising, in effect, a single combined lexicon. This effort is a maturation and marriage of two developments, both aimed at improving accessibility to relevant patient data found in text. See Figure 1.

Figure 1: English medical lexicon in the medical language processor

MLP in conjunction with SHML-tagging functions to transform the content of clinical documents into individual clinical facts which are referred to as Health Information Units (HIU's). SHML assists in identifying and structuring the medically relevant content of documents. MLP provides linguistic and broad medical characterization of each term while SHML tagging provide more precise medical characterization of these terms.

A challenge and primary task in developing an Electronic Medical Record (EMR) is to provide both immediate and long range access to information in the clinical record. Since significant parts of current medical records consist of transcribed or written notes, access to this information demands Natural Language Processing (NLP) techniques, (and for our purposes MLP), to isolate and retrieve (i.e. unravel) the informational units from that text. In short, to transform all information captured, whether dictated or written, into retrievable clinical facts.

The goals of this MLP/SHML effort are to retrieve clinical information for display from all previous encounters in a succinct, user defined manner; to make available that information for subsequent data analysis; and to support clinical prompts-and-alerts software. With the use of Viewer software, displays of the HIU's can be integrated with and conformant to HL-7's Clinical Document Architecture format (CDA).

Mission of the SHML
  • Define a granular representation of terms and phrases that within a given language (domain) unambiguously define clinical concepts
  • Provide for an adequate representation of these terms and concepts in a simple and easily understand architecture
  • Provide for discrete mapping to any other "nomenclature" and/or "code set"
  • Utilize easily available, inexpensive and widely supported tools for authoring, maintenance and use
  • Provide this as a non-proprietary standard under the auspices of a private not-for-profit entity

SHML tags are specifically designed to characterize and label medical knowledge. Tags act as an initial sort and retrieval device for the EMR; they get at the core of what was stated. They provide uniformity to data elements. SHML tagging, when used to its full potential, defines what is worth accessing, viewing, and counting in a medical document. Tags are views of data, not attributes or properties of the data. They are intended to be inclusive of all parts of the record. Tagging provides access to raw (original) data; they are not interpretive, e.g. if BUN is elevated, it captures only this data element, not the possibility of "renal failure". SHML tagging does not require clinical language to be forced into categories of a predetermined data model, i.e. Structured Data Entry menus. Tag classes are illustrated and brief examples are shown below:

Body region<b-r>

SHML Tag System
Diagnostic process<dx-prcss>
Infectious diagnostic process<dx-prcss_infect>
Immunologic diagnostic process<dx-prcss_imm>
Neoplastic diagnostic process<dx-prcss_neopl>
Diagnostic group<dx-kind>
Neurologic disease<dx-kind_neuro>
Reactive Airway Disease<dx-kind_d-k-resp_r-a-d>

Congestive cardiomyopathy
      <dx- kind_cardiov_cardmy>
      Congestive cardiomyopathy
      </a-s_cv_hrt_myc >
      </dx- kind_cardiov_cardmy >
      </dx-kind_d-k-resp_r-a-d >
Pneumonia, right lower lobe
      Pneumonia ,
            right lower lobe
Pneumonia, right lower lobe, superior, due to Klebsiella.
      Pneumonia ,
      right lower lobe
      due to
Diagnosis: Pneumonia
Location: RLL, superior
Organism: Klebsiella

The combined lexicon used to underlie and drive the MLP/SHML system is sorted by class and terms are labeled for their linguistic and clinical properties. Classes are shown in the accompanying tables. The lexicon includes all terms encountered in a document, nouns, verbs, and all classes of modifiers. Traditional terminologies are known to contain only a small percentage of the terms encountered in medical text; their focus has been on nouns. The combined lexicon addresses and classes/tags each term for their linguistic and medical content.

Of great importance are the terms expressing uncertainty, negation and time. Each of these classes consists of several hundred terms. When present, they modify the data element to which they refer and are included in the appropriate HIU.

Ambiguity is a particularly difficult issue in medicine. Ambiguous language can be of several types. The first is intentional ambiguity, that is to be intentionally uncertain. The second is non-intentional (can't use the language properly). The third is within the language itself, e.g. homonyms (foot, depression). It has been estimated that up to 60% of statements found in medical records include terms expressing ambiguity. All must be recognized and accounted for to achieve an accurate rendition of a record. MLP/SHML does this.

Terms expressing time
Term MLP Class Part of Speech SHML Tag
antecede H-TMLOC TV <tm_tm-loc>
on admission H-TMLOC D <tm_tm-loc>
initially H-TMBEG D <tm_beg>
emergent H-TMBEG ADJ <tm_beg>
discontinue H-TMEND TV <tm_end>
end-stage H-TMEND N <tm_end>
unrelenting H-TMDUR ADJ <tm_dur>
yearly H-TMREP D <tm_rep>
after H-TMPREP P <tm_tm-prp>
will FUT * W <tm_tense>
09/30/2005   DT * <tm_tm-exact>

* FUT and DT are provided by the medical language processor.

Terms expressing negation
Term MLP Class Part of Speech SHML Tag
deny H-NEG TV <md_ng>
excepting H-NEG P <md_ng>
exclude H-NEG V <md_ng>
never H-NEG D <md_ng>
not able H-NEG ADJ <md_ng>
nothing H-NEG PRO <md_ng>
rejected H-NEG VEN <md_ng>
without H-NEG P <md_ng>

Terms expressing uncertainty
Term MLP Class Part of Speech SHML Tag
allegedly H-MODAL D <md_modal>
assume H-MODAL TV <md_modal>
assumption H-MODAL N <md_modal>
conceivably H-MODAL D <md_modal>
doubtful H-MODAL ADJ <md_modal>
hypothesis H-MODAL N <md_modal>
hypothesize H-MODAL TV <md_modal>
hypothetical H-MODAL ADJ <md_modal>

The combined lexicon is derived from terms found in actual records and from publicly available sources. It is possible that the lexicon could be extended to include all terms found in the UMLS, available from NLM (National Library of Medicine). Classing and tagging such a source could be a formidable, but worthwhile task. SHML tagging of this or other terminology sources would in effect make them more operational.

The enumeration of the current version of SHML tags is shown in the accompanying table (see Medical Tag Hierarchy). Each tag class is defined and brief examples shown. Each term is given a primary class tag and when deemed useful additional support tags are added (See Notes on Use of this Lexicon below).

Documents when processed by MLP/SHML technology are both human and machine readable. Tools needed for adoption of MLP/SHML are few, readily available and inexpensive. MLP/SHML markup provides a digital representation of a medical document. The software outlined above can store, display, process, transmit, search, and print each identified informational element.


The XML Medical Knowledge Lexicon is a list of 67,642 English lexical entries compiled in conjunction with the English Medical Lexicon for Natural Language Processing. Each entry contains:

  • a syntactic category (see Major lexical catergories) with
    — a syntactic medical class (see Syntactic classes)
    — a syntactic number singular or plural
  • a primary SHML tag (one of 64 in tag hierarchy) with
    — an anatomic structure tag
    — a body region tag
    — a series of support tags (separated by commas)
    — an SHML lexicon entry tag id.

For example, the term congestive heart failure is listed in the Lexicon as

N:(H-DIAG, SINGULAR), dx:(a-s_cv_hrt, b-r_tk_thx_int-thor_mediast, dx-kind_cardiov_fail, _2674).
SINGULARsyntactic number, singular
a-s_cv_heartanatomic system: cardiovascular system: heart
b-r_tk_thx_int-thor_mediastbody region: trunk: thorax: intrathoracic: mediastinum
dx-kind_cardiov_faildiagnostic group: cardiovascular group: heart failure
_2674lexical entry id of congestive heart failure

The term congestive heart failure can be found in a patient document sentence such as

HISTORY OF PRESENT ILLNESS: This is a 70 year old female with a history of congestive heart failure with an ejection fraction of 40% who was in her usual state of good health until she became short of breath on exertion on the day prior to admission but with no associated chest pain and no palpitations.