About Us | Advertise | FAQ | Contact | Work at ADVANCE Search
Welcome Robert | Edit Profile | Sign Out

Current Print Issue

Subscriptions are FREE to qualified Health Information Executives


From Our Archives

Go Back    Search Archives   E-Mail Article   Printer-Friendly    

a New View of Health Information

Page 39

A NEW VIEW of Health Information

parsed sentence

By Robert B. Bruegel, PhD; David J. Rothwell, MD; and Richard Wheeler, MD

In a previous series of articles published in ADVANCE (February 1998, January 1999 and March 2000), we presented an overview of some of the major issues that any interim clinical terminology effort must address in order to be considered a serious contender for adoption. These issues include:

* breadth and depth of coverage of clinical terms and concepts;

* flexibility and adaptability of the terminology;

* capture of both ongoing information relevant to the process of care as well as "snapshot" information at the end stage of care;

* capture and description of a broad range of patient perception issues and concerns which are relevant to utilization, satisfaction and disease management;

* support of both structured data entry as well as textual data -- whether dictated or through voice recognition;

* explicit handling of patient and provider confidentiality;

* support for the rapid adoption of the Internet and Internet-based standards;

* ease (or difficulty) of integration into clinical information systems;

* ongoing cost for development, distribution and maintenance, since these costs will be reflected ultimately in the cost of clinical information systems; and

* flexibility and adaptability to support -- rather than restrict -- the growth and development of medical practice and medical knowledge.

A useful clinical terminology must also incorporate a data representation which not only enables the initial collection of patient data, but which also enables the computer to sort and present the data in multiple ways, including specific summary sheets for patient use. This must be done without the loss of important elements of data between the views. For example, a surgical view could include such things as the date of the last operation, the wound condition and any evidence of recurrence.

An internal medicine view of this same patient, however, would involve the presentation of a different set of data, organized in a very different way.

In the earlier articles, we also noted the increasing potential of XML, the Extensible Markup Language, which seems well on its way to being adopted as a standard for e-commerce and other major Internet applications. (Note that HL7's version 3.0 adoption of XML as a messaging format is particularly important in this regard.)

This article presents further evidence of adoption of XML-based technologies as we move toward a new view of the vital information captured throughout the health care continuum.


Our approach

The not-for-profit Health Language Center (HLC) has concentrated not simply on utilizing XML, but on utilizing the power of the convergence between the development of the Structured Health Markup Language (SHML), a health-related XML, and the maturation and practical availability of the Medical Language Processor (MLP), a set of powerful natural language processing technologies based on the pioneering research and work of Naomi Sager, PhD, and her colleagues.

A key unresolved problem facing all efforts to gain rapid, systematic access to patient care data is the fact that the vast majority of the "data of interest" is present in text form, usually as transcribed reports. The MLP, in concert with a properly structured XML approach, has the potential to resolve this problem -- enabling the freedom, flexibility and provider acceptance of free text -- while providing discrete data currently only available from structured data entry.

Based on work to date, we believe that a combination of the MLP, the SHML, and a special XML-based browser developed by InContext Data Systems has considerable potential to address many of the key issues facing the development and use of practical clinical terminologies. The following subsections highlight the progress made with each of these approaches and tools.

Language processing

In order to understand the approach taken by the HLC, it is important to understand that what is first needed is a new understanding and appreciation of the role of language in all its complexity in medical practice and health care.

In the approach taken by Dr. Sager and her colleagues, language is seen as the primary vehicle through which people communicate and record information. Natural Language Processing (NLP) uses computers to generate an accurate and meaningful representation of sentence content. Sentence content, with each of its elements properly identified, can be used to perform certain functions such as retrieving data, or in the case of medicine, prompting a caregiver about a particular clinical event. NLP, often referred to as computational linguistics, attempts to understand language in procedural terms so that computers have the ability to perform these functions. The MLP developed by Dr. Sager focuses on the language and terms used in the provision of medical care.

The goal for the MLP is to generate from each sentence of medical text a clinically meaningful representation of the sentence content, so that it is amenable to further computer processing. The MLP generates a "parse tree" similar to sentence diagrams we all learned in grammar school.

Medical sentences are often complex, consisting of many information units within a single sentence. A simple example of this complexity is the sentence, "The patient has fever, cough and shortness of breath." The information units contained in this sentence are (1) the patient has fever; (2) the patient has cough; and (3) the patient has shortness of breath. Each information unit deserves its own position in a database, subject to further analysis and use.

In addition, sentences found in medical documents contain expressions related to belief, uncertainty, time, negation and fuzziness (e.g., the patient is young). Each of these expressions must be identified and placed into their correct information units . . . obtained by parsing as shown above. Some examples of this process would be: (1) the patient suspected he had fever (uncertainty); (2) the patient denied cough (negation); and (3) the patient occasionally had shortness of breath (time). It is important to note that formal logic approaches (e.g., predicate logic, propositional logic) have difficulty with or are unable to process such expressions.


Medical meaning

The linguistic analysis performed by the MLP, as previously illustrated, identifies and isolates the information units encountered in text. Once this is done, a second step is required, namely to represent the medical meaning of the information units. This is the role of the SHML, which is not a traditional medical terminology; rather, it is a carefully organized set of terms generally encountered in medical text -- a highly specialized, highly organized dictionary in which each term is tagged (marked up) with its linguistic and medical senses. Currently, more than 40 distinct SHML categories have been created, each a description of medical content in the computer-based patient record (CPR) and each with multiple subcategories. When fully elaborated, the SHML thus should provide a way to catalog all the medically relevant information in a clinical document.

The SHML includes traditional medical and non-medical (natural language ) entities used to express signs, symptoms, vital signs, drug, organisms, anatomic sites, body regions, chemicals, diagnoses and procedures -- each tagged as to their respective position in their relevant hierarchy or hierarchies.

Terms or phrases related to physiologic function, functional status, activities and patient social behavior are also included in the SHML. An illustration of how terms are characterized medically (i.e., tagged) is the term "cigarettes." The SHML tags this term as <patient-social-behavior tobacco> as well in the traditional placement of tobacco in a hierarchy of plants. Each term that references tobacco (e.g., cigar, pipe, smoking, etc.) is treated similarly. If the SHML encounters a negation term, it is placed within that information unit (e.g., "denies smoking"). In a similar fashion, the SHML tags terms related to patient preferences, patient education, patient understanding, compliance, response and risk factors.

The SHML tags are also organized according to other criteria designed to facilitate the medically relevant use and display of information. In the example presented, terms with the tag <patient-social-behavior tobacco> are tagged so that they can be placed in the social history/smoking section of a medical record summary, no matter where they might occur in the actual document.

SHML tags thus are a way of capturing and characterizing the medical content and meaning found in documents. Though not explored at this time, the medical content of images and graphs could be treated similarly.

In summary, the MLP identifies and isolates linguistically valid information units from text. When SHML tags are applied to these same information units, they are transformed into "health information units" (HIUs). These HIUs provide the meaningful representation of sentence content encountered in medical documents. Their content is retrievable through the SHML tags themselves as well as from the values (terms / phrases) that are tagged.

With a functional combination of the MLP and the SHML, providers will be able to dictate patient care records using their own language -- without being restricted to pull-down menus. The resulting text -- whether dictated and transcribed or converted directly from speech/voice recognition to text -- is then "tagged" to a set of SHML categories that are internally consistent and highly structured. As the MLP identifies each meaningful phrase, medical concept and its modifiers in the text, SHML captures the medical sense(s) of each parse and places the sense(s) into an XML structured database -- while retaining the context of the original text.


Using SHML

SHML can be used to provide highly distributed, multiple views of patient data. As work on the SHML progressed, it became clear that SHML-tagged data could be used for powerful data analysis; however, in order for the SHML to be truly useful, we also needed a way of easily obtaining multiple views of the tagged text. Ideally, this would be an Internet application enabling users to select content from tagged documents and present it for viewing in a number of alternative ways (depending on user, questions of interest, etc.). Another important criterion: The approach would enable users to move immediately from a specific data element on a specific patient or set of patients to the sentence in the text document which contained that information.

Working with a special browser developed by InContext Data Systems, we have been able to demonstrate the feasibility of this approach.

The process works as follows:

1) MLP processing / tagging to SHML. A series of patient documents is first processed using the MLP, and the clinical facts in the documents are tagged using the SHML tag set as HIUs.

2) Access via InContext browser. The SHML-tagged documents are then accessed using the browser developed by InContext.

3) Use of templates to organize / view information. The InContext browser utilizes a series of templates that organize the selection and presentation of HIU information from the tagged documents. These templates are highly flexible and enable users to view information relevant to the interests / requirements of multiple different users. For example, consider the following sentence from an actual patient document -- "She had a tendon injury when living in England 25 years ago and wondered if it has recurred, but she knows of no trauma recently to the ankle at all."

This sentence is first rephrased by the MLP -- "She had a tendon injury when living in England 25 years ago and (She) wondered if it (injury) has recurred, but she knows of no trauma recently to the ankle at all."

In this example, keep in mind that "she" is understood by parallel construction with "and." In addition, "it" (pronoun) refers to "injury" (antecedent).

This sentence is further processed using the MLP. The HIUs are then tagged using the SHML. This structured data can then be displayed in the browser according to views organized for particular relevance to different health care providers.


From prototype to production

Looking at all these factors as a whole, it appears that the combined approach of parsing and tagging dictated text reports using the MLP and SHML, and then making them available for review and viewing via an Internet-enabled browser is both practical and feasible. And such an approach provides demonstrated clinical utility.

As one would expect, a number of major questions remain to be answered, including the steps required to move this approach from its current status as an alpha prototype to an integrated, scalable, production solution. This effort is currently underway with a number of system designers and vendors evaluating the potential of the system. We will describe the progress of this effort, particularly the lessons learned, in a subsequent article. *

Dr. Bruegel, co-founder of Clinical Reference Systems, is president of the Health Language Center (HLC). He can be reached at (303) 499-1685.

Dr. Rothwell, a former co-editor of SNOMED, is chairman of the HLC. He is the chief developer of the SHML

Dr. Wheeler, a former medical director of HealthMatics, is a principal of InContext Data Systems, Inc.

Gaining Physician Acceptance of the CPR

Capturing the patient-physician clinical encounter electronically has long been -- and continues to be -- the "Holy Grail" of medical informatics. The need for an electronic patient file of key medical terms is critical, if accurate retrospective analysis of the care delivered and real-time clinical decision support at the point of care are to become realities.

For the past several decades, the academic and private sectors, using mostly structured data input approaches, have focused their efforts on the computer-based patient record (CPR) as the vehicle to electronically capture the clinical encounter. High costs coupled with poor usability and the lack of an accepted medical vocabulary have created significant barriers to physician adoption of the CPR. Without physician adoption, capturing the clinical encounter electronically is doomed. Physicians are loath to spend extra time, effort and money on technology that cannot produce an immediate return on their investment.

While early in the process of product development, the authors of the accompanying article clearly demonstrate that they understand the barriers to "capturing the physician's desktop." Addressing the need for technology that physicians will use and a rational, comprehensive medical vocabulary are important starting points. Physicians are still more comfortable dictating the results of the clinical encounter than pointing and clicking. They would rather be able to describe in their own terms what the patient said, than be constrained by the typical list of terms presented by most programs. Efforts to circumvent the tedious requirements of structured data entry through transcription have repeatedly underscored the adage that "there is nothing free about free text." Medical transcription, in its current form, cannot generate the "electronic patient file" required for data analysis and point-of-care, real-time clinical decision support -- the two key advantages of the CPR.

Bruegel, Rothwell and Wheeler, describe their efforts, in conjunction with a leading natural language expert, Naomi Sager, PhD, to generate clinically meaningful representations from complex medical transcriptions. Their approach, tailoring natural language processing for the medical domain, may accelerate the integration and adoption of information technology into medical practice. The ability to capture clinical information and turn it into "Health Information Units" represents a giant step toward more accurately measuring health care processes and outcomes, and providing real-time clinical decision support to health care providers.

Concurrently, we're seeing continued progress being made by leading speech recognition vendors. The potential of combing speech recognition and natural language parsing represents an opportunity to overcome some major CPR barriers. All of this points to a significant breakthrough in the CPR story.


David A Trace, MD, Trace Consulting; and William F. Andrew, PE, Andrew & Associates.

You can contact Dr. Trace at (847) 295-6947 or dtrace@6947@aol.com.

You can contact Mr. Andrew at (863) 299-4767 or wfandrew@email.msn.com.

Additional Perspectives

The ADVANCE editorial staff asked representatives from two companies, Synthesys Technologies and A-Life Medical, to comment on the concepts discussed by the authors of this article. The companies' comments are presented below.


The article points out that systems must account for the identification of proper referents. Synthesys Technologies has developed a pronominal reference method for distinguishing between diseases, symptoms, events and statements about medications for a patient and someone other than the patient. This important process must consistently identify such patient references across sentence boundaries and throughout a document.

Synthesys has successfully adopted a tagged transcription-based CPR and has a number of production sites across the country. The largest implementation has over 2,000 users and over 10 million (and growing) documents in the data repository. Documents are tagged at a highly granular level and a search language allows for the linguistic querying process to retrieve relevant information. Like the SHML, this method allows concepts and groups of terms to be tagged, but also provides users with control for desired recall and precision accuracies. Thus, querying does not rely on any specific quality or quantity of linguistic tagging.

The Synthesys approach to marking text and linguistic analysis also allows for the remediation of natural language uncertainties caused by the inconsistent use of terms in medicine. A combination of clinical and linguistic expertise enables us to correctly identify instances matching the desired search criteria as well as instances that may be ambiguous.


-- Anne-Marie Currie, director of linguistics and clinical data research

Synthesys Technologies, Inc.


Systems that can read and understand natural language well enough to abstract out important facts hold the promise of bridging the gap between the practitioner and a CPR.

The Medical Language Processor (MLP) is a milestone in the development of natural language technology. The information units extracted by the system are the key medical facts of a document. Structuring the output in an XML format allows these facts to be displayed in a browser as well as potentially exported to a back-end application.

To realize the benefit of natural language technology, applications must be built -- or current ones modified -- to take advantage of this new information. These applications include clinical decision support, data mining and medical coding. However, for these products to be successful in the real world, product engineers will need to feed back requirements to the natural language processing systems concerning the type of facts recognized and the attributes associated with each fact. For instance, the severity of an injury or illness may need to be inferred from the patient's general physical profile or from the types of exam and tests that were performed; multiple statements about the same condition may need to be resolved into a single fact; the specific form of treatment may need to be combined with the particular medical problem to infer the precise procedure.

These are all examples of additional inference on top of fact extraction. NLP technology will be mature enough to be integrated into products when the technology can be adapted to meet the specific input requirements of the application.


-- Mark Morsch, director of natural language technology, and

Dan Heinze, chief technology officer

A-Life Medical, Inc.