This chapter defines a base tag set for encoding human-oriented
monolingual and polyglot dictionaries (as opposed to computational
lexica, which are intended for use by language-processing software).
Dictionaries are most familiar in their printed form; however,
increasing numbers of dictionaries exist also in electronic forms which
are independent of any particular printed form, but from which various
displays can be produced --- e.g. CD-ROM dictionaries.
Both typographically and structurally, dictionaries are extremely
complex. In addition, dictionaries interest many communities with
different and sometimes conflicting goals. As a result, many general
problems of text encoding are particularly pronounced here, and more
compromises and alternatives within the encoding scheme may be required.
We refer the reader to previous and current discussions
of a common format for encoding dictionaries. For example,
Robert A. Amsler and Frank W. Tompa, An SGML-Based
Standard for English Monolingual Dictionaries, in
Information in Text: Fourth Annual Conference of the U[niversity
of] W[aterloo] Centre for the New Oxford English Dictionary
October 26-28, 1988, Waterloo, Canada, pp. 61-79;
Nicoletta Calzolari et al., Computational Model of the
Dictionary Entry: Preliminary Report, Acquilex: Esprit Basic
Research Action No. 3030, Six-Month Deliverable, Pisa, April 1990;
John Fought and Carol Van Ess-Dykema, Toward an SGML
Document Type Definition for Bilingual Dictionaries, TEI working
paper TEI AIW20 (available from the TEI);
Nancy Ide and Jean Veronis, Encoding Print
Dictionaries, Computers and the Humanities
(special TEI issue --- to appear);
Nancy Ide, Jacques Le Maitre, and Jean Veronis, Outline
of a Model for Lexical Databases, (Information Processing and
Management, 29, 2, 159-186, 1993);
Nancy Ide, Jean Veronis, Susan Warwick- Armstrong, Nicoletta Calzolari,
Principles for Encoding machine readable
dictionaries, Proceedings of the Fifth EURALEX
International Congress, EURALEX'92 (to appear), University of
Tempere, Finland;
and
The DANLEX Group, Descriptive tools for electronic
processing of dictionary data, in Lexicographica, Series
Maior (Tübingen: Niemeyer, 1987).
Two problems are particularly prominent.
First, because the structure of dictionary entries varies widely both
among and within dictionaries, the simplest way for an encoding scheme
to accommodate the entire range of structures actually encountered is to
allow virtually any element to appear virtually anywhere in a dictionary
entry. It is clear, however, that strong and consistent structural
principles do govern the vast majority of conventional
dictionaries, as well as many or most entries even in more
exotic dictionaries; ideally, a set of encoding
guidelines should capture these structural principles. We therefore
define two distinct elements for dictionary entries, one
(entry) which captures the regularities of most conventional
dictionary entries, and a second (entryFree) which uses the
same elements, but allows them to combine much more freely. It is
recommended that entry be used in preference to
entryFree wherever the structure of the entry allows it.
These elements and their contents are described in sections , , and .
Second, since so much of the information in printed dictionaries is
implicit or highly compressed, their encoding requires clear thought
about whether it is to capture the precise typographic form of
the source text or the underlying structure of the information it
presents. Since both of these views of the dictionary may be of
interest, it proves necessary to develop methods of recording both, and
of recording the interrelationship between them as well. Users
interested mainly in the printed format of the dictionary will require
an encoding to be faithful to an original printed version. However,
other users will be interested primarily in capturing the lexical
information in a dictionary in a form suitable for further processing,
which may demand the expansion or rearrangement of the information
contained in the printed form. Further, some users wish to encode
both of these views of the data, and retain the links
between related elements of the two encodings. Problems of recording
these two different views of dictionary data are discussed in section
, together with mechanisms for retaining both views
when this is desired.
Whichever view is adopted, a parameter entity
TEI.dictionaries must be declared within the
document type subset of any document using this base tag set. This
should have the value INCLUDE, as
further described in section . A document using this
base tag set and no other additional tag sets will thus begin as
follows:
]>
]]>
Dictionary Body and Overall Structure
Overall, dictionaries have the same structure of front matter, body,
and back matter familiar from other texts; the base tag set for
dictionaries uses the same front-matter and back-matter elements as
other TEI base tag sets; these are documented in chapter . In addition, dictionaries define the elements
entry, entryFree, and superEntry as
component-level elements which can occur directly within a text division
or the text body.
The following tags should be used to mark the gross structure of a
printed dictionary; the dictionary-specific tags are discussed further
in the following section.
contains a single text of any kind, whether unitary or
composite, for example a poem or drama, a collection of
essays, a novel,
a dictionary, or a corpus sample.contains any prefatory matter (headers,
title page,
prefaces, dedications, etc.)
found before the start of a
text proper.contains the whole body of a single unitary
text,
excluding any front or back
matter.contains any appendixes, etc. following the main part of a
text.contains a subdivision of the front, body, or back of a
text.contains the largest possible subdivision of the body
of a
text.contains a first-level subdivision of the front, body, or
back
of a text (the largest, if
div0 is not
used, the second largest if it is).contains a reasonably well-structured dictionary entry.contains a dictionary entry which does not necessarily
conform
to the constraints imposed by the entry
element.groups successive entries for a set of homographs.
The text-division elements div2 through div7 may also
be used, as described in chapter .
As members of the class entries,
entry and entryFree share the following attributes:
indicates type of entry, in dictionaries with multiple
types.
Suggested values include:
a main entry (default).a homograph with a separate entry.a reduced entry whose only function is to point to another
main entry (e.g. for forms of an irregular verb or for
variant spellings: was pointing to
be, or esthete to
aesthete).an entry for a prefix, infix, or suffix.an entry for an abbreviation.a supplemental entry (for use in dictionaries which issue
supplements to their main work in which they include
updated information about entries).an entry for a foreign word in a
monolingual dictionary.contains a (sortable) character sequence reflecting the
entry's alphabetical position in the printed dictionary.
The front and back matter of a dictionary may well contain
specialized material like lists of common and proper nouns, grammatical
tables, gazetteers, a guide to the use of the
dictionary, etc. These may be tagged as elements defined in
the core tag set (chapter ) or as specialized dictionary
elements as defined in this chapter.
The body element consists of a set of entries,
optionally grouped into one or several div, div0, or
div1 elements. These text divisions might correspond, for
example, to sections for different languages in a bilingual
dictionaries, sections for different letters of the alphabet, etc.It is unlikely that many conventional dictionaries will
require smaller divisions, but all the usual division elements
div2 through div7 may be used. In print
dictionaries, entries are typically typographically distinct entities,
each headed by some morphological form of the lexical item described
(the headword), and sorted in alphabetical order or
(for non-alphabetic scripts) in some other conventional sequence.
Dictionary entries should be encoded as distinct successive items, each
marked as an entry element. The type attribute may
be used to distinguish different types of entries, for example main
entries, related entries, run-on entries, or entries for
cross-references, etc.
Some dictionaries provide distinct entries for homographs, on the
basis of etymology, part-of-speech, or both, and typically provide a
numeric superscript on the headword identifying the homograph number.
In these cases each homograph should be encoded as a separate entry; the
superEntry element may optionally be used to group such
successive homograph entries. In addition to a series of entry
elements, the superEntry may contain a preliminary form
group (see section ) when information about
hyphenation, pronunciation, etc., is given only once for two or more
homograph entries. If the homograph number is to be recorded, the
global attribute n should be used for this purpose. In
some dictionaries, homographs are treated in distinct parts of the same
entry; in these cases, they may be separated by use of the hom
element, for which see section .
A sort key, given in the key attribute, is often required
for superentries and entries, especially in cases where the order of
entries does not follow the local character-set collating sequence (as,
for example, when an entry for 3D appears at the place where
three-D would appear).
The body of a bilingual dictionary with two parts will thus have an
overall structure resembling the following:
..................