The Medical Language Processing System

The MLP System
is a process involving three successive stages, as shown in Figure 1. This arrangement allows users to alter the specifications of their "user-oriented" language as their needs change. The system may be adapted for use with an English grammar of very different external appearance by changing the input to the first stage of the process.


Figure 1
The Three Stages of the System

Stage 1
parses the grammar of the grammar of English (or the restriction language syntax, the RLS), which is a set of BNF statements describing the syntax of four components of the English grammar:
  1. The context-free component
  2. The lists
  3. The restrictions
  4. The dictionary cannonical forms
  5. the dictionary
Stage 2
parses the grammar and dictionary of English, interpreted by the compiled grammar of the grammar of English in Stage 1, and generates object grammar and dictionary. The input grammar and dictionary sources consist of
  1. The BNF declaration, switched on by a statement *BNF: a context-free component describing the grammar of English
  2. The attributes, lists, global functions, and types, switched on by a statement *LISTS
  3. The routines and restrictions, switched on by a statement *RESTR. There are three types of restrictions whose names begin with 'D' disqualifying a BNF generation, 'W' wellformedness, or 'T' transformation.
  4. The dictionary cannonical forms, switched on by a statement *WDCAN
  5. the dictionary, switched on by a statement *WD
Stage 3
reads in a standardized source text document, performs tokenization and dictionary lookup (by a process called dictionary lookup), using the compiled dictionary and lists, and passes the results to the parser, which uses the compiled grammar to map each source sentence into one or more grammatical parse trees.
The Compiler
The Compiler is a particular use of the basic MLP parser. In the first two stages, it is used as a syntax-directed compiler which translates the grammar of the grammar of English (i.e. the grammar of the user-oriented language) or the grammar of English from its input text form to list structure. Hence, the routines invoked by the parser in stage I and stage II programs are code generators, which construct the requisite list structure during the top-down analysis (Figure 2).

↓ ↑ ↓ ↑
(loading, updating, etc.)

Figure 2
Organization of Stages I and II

In the output of stages I and II, the source text and generated list structure are combined in a single file called an object grammar or an object dictionary (named in analogy with the object program produced by a compiler). Once an object grammar (file name with the extension obg) or object dictionary (file name with the extension wdo) as been initially created, the user may specify modifications to it on a statement-by-statement basis. The system will compile the new statements and will insert, delete, or replace the source text and corresponding list structure in parallel.

In Stages I and II, a compiler (directive *COMPILE()) is used to create object grammars or dictionaries. One can also use an updating system (directive *MODIFY()), which was included in the compiler.

The function of the various stages, and format of the source input to these stages, will now be described.

The Parser
As indicated in Figures 2 and 3, the parser has "hooks" on it to permit various routines to be invoked during the top-down parse.

The core of the program is a very simple top-down parser for context-free grammars, which generates multiple parses of ambiguous sentences sequentially using a back-up mechanism. This parser, together with a table-driven lexical processor and a directive processor which invokes all the other system components, is present in the program for each of the three stages of the system.

↓ ↑ ↓ ↑
(loading, updating, etc.)

Figure 3
Organization of Stage III

For stage III, the generators are replaced by a restriction interpreter (Figure 3). The grammar of English consists of a context-free component plus a set of restrictions, each of which is associated with one or more productions in the context-free component. These restrictions state conditions which the parse tree must meet if the analysis is to be accepted. Each time a node is added to the parse tree, and each time a level in the tree is completed, the parser invokes the restriction interpreter to execute those restrictions appearing in the corresponding production; the restriction interpreter returns a success or failure indication to the parser. If the restriction has succeeded, the parser continues normally (i.e., as if there had been no restriction). If the restriction has failed, the parser must either try an alternate option, or if all options in a production have been exhausted, dismantle part of the parse tree.

The parser is written totally in C++. It consists of approximately 22,000 source lines in about 100 subroutines, only some of which are included in the program for any one stage of the system.

The Dictionary Look-Up
The dictionary lookup (dlookup) was formerly part of the parser. It is now separated and forms an independent function, with three main jobs:
  1. A Tokenizer

    dlookup breaks the input sentence along the blank delimiter into a series of 2**n-1 possible sentences. Tokens of these sentences are matched against lexical entries to find best matches. After evaluation for most number of matched lexical entries, and least number of tokens, one of these best matches will be chosen as the sentence to be parsed.

  2. A generator of lexical entries

    dlookup automatically generates appropriate lexical categories and classes for:

    • standard numbers, times and dates
    • medical terms, according to a medical list
    • dose strings, according to a dose pattern list
    • organism terms, according to an organism list
    • geographic nouns, according to a geographic list
    • patient nouns, according to a patient list
    • institution/ward/service nouns, according to an institution list
    • physician/staff nouns, according to a staff list

  3. A reader of lexical entries from the main dictionary
The dictionary lookup creates a list of lexical entries with all categories and attributes for the input sentence to be picked up by the parser for parsing.