Talking Language: Transcription and Coding of Spoken Discourse

%dtdmods; ]>

TEI P2, chapter 34: Base Tag Set for Transcriptions of Spoken Texts TEI AI2W1 and much more besides 18 Apr 92 : LB : final pre-publication check 16 Apr 92 : MSM : substantial revision of parts. 2 Apr 92 : WP, MSM : rename p234driv, delete invalid /div1 tag. 30 Mar 92 : LB : minor typos and other corrections for P2 release 18 Mar 92 : LB : substantial reordering and revision following JE/SJ 13-15 Jan 92 : LB : revised for new ODD format 26-31 Dec 91 : LB : Recast into ODD form and revised 5 Dec 91 : SJ : original draft

Base Tag Set for Transcriptions of Spoken Texts

There is no such thing as a simple conversion of speech to a transcription. All transcription schemes make a selection of the features to be encoded, taking into account the intended uses of the material. The goal of an electronic representation is to provide a text which can be manipulated by computer to study the particular features which the researcher wants to focus on. At the same time, the text must reflect the original as accurately as possible. We can sum this up by saying that an electronic representation must strike a balance between two partially conflicting requirements: authenticity and computational tractability.

A workable transcription scheme must also be convenient to use (write and read) and reasonably easy to learn. One advantage of using the scheme proposed here is that it can be used by any TEI-aware software, and can thus benefit from work done in developing such software by researchers in other fields. The scheme is both systematic and flexible, serving equally well as a working encoding or storage scheme and as an intermediate language for the translation of other systems. Because it makes no assumptions about the way transcriptions will be displayed, texts using it may be converted to whatever visual display format is desired by the user. And finally the scheme uses many of the same basic mechanisms as the rest of the TEI encoding scheme; this should make it easier to reuse software and also facilitate the comparative analysis of written and spoken forms of language.

The system defined here is based on a survey of about twenty widely used and documented systems for the transcription of spoken materials taken from such diverse fields as journalism, large reference corpora, language-acquisition studies, and dialect studies. The tags provided here mark all the features most commonly encoded in these systems. The framework developed here for timing, annotation, and synchronization of events has been carefully designed to be general enough for extension to most areas of research involving transcribed spoken material. It should be stressed, however, that the present proposals are not intended to support unmodified every variety of research undertaken upon spoken material now or in future; some discourse analysts, some phonologists, and doubtless others may wish to extend the scheme presented here to express more precisely the set of distinctions they wish to draw in their transcriptions. Speech regarded as a purely acoustic phenomenon may well require different methods from those outlined here, as may speech regarded solely as a process of social interaction. We believe the framework provided here to be general enough to support whatever types of extension individual researchers may deem necessary, and to suffice as it stands for most requirements of those working with a wide range of transcribed spoken material.

This chapter begins with a brief overview. This is followed by a detailed section in which each of the basic structural elements of a spoken text is introduced and formally defined. The third section discusses alternative ways of segmenting and aligning spoken texts, in particular for the representation of synchrony or overlap between sections of a transcript. The fourth section considers a number of problems specific to the representation of spoken language and makes a number of specific recommendations. General Considerations and Overview

There is great variation in the ways different researchers have chosen to represent speech using the written medium. For a discussion of several of these see J. A. Edwards and M. D. Lampert, eds., Talking Language: Transcription and Coding of Spoken Discourse (Hillsdale, N.J.: Lawrence Erlbaum Associates, forthcoming); Stig Johansson, Encoding a Corpus in Machine-Readable Form, in Computational Approaches to the Lexicon: An Overview, ed. B. T. S. Atkins et al. (Oxford: Oxford University Press, forthcoming); and Stig Johansson et al. Working Paper on Spoken Texts, document TEI AI2 W1, 1991. This reflects the special difficulties which apply to the encoding or transcription of speech. Speech varies according to a large number of dimensions, many of which have no counterpart in writing (tempo, loudness, pitch, etc.). The audibility of speech recorded in natural communication situations is often less than perfect, affecting the accuracy of the transcription. Spoken material may be transcribed in the course of linguistic, acoustic, anthropological, psychological, ethnographic, journalistic, or many other types of research. Even in the same field, the interests and theoretical perspectives of different transcribers may lead them to prefer different levels of detail in the transcript and different styles of visual display. The production and comprehension of speech are intimately bound up with the situation in which speech occurs, far more so than is the case for written texts. A speech transcript must therefore include some contextual features; determining which are relevant is not always simple. Moreover, the ethical problems in recording and making public what was produced in a private setting and intended for a limited audience are more frequently encountered in dealing with spoken texts than with written ones.

Speech also poses difficult structural problems. Unlike a written text, a speech event takes place in time. Its beginning and end may be hard to determine and its internal composition difficult to define. Most researchers agree that the utterances or turns of individual speakers form an important structural component in most kinds of speech, but these are rarely as well-behaved (in the structural sense) as paragraphs or other analogous units in written texts: speakers frequently interrupt each other, use gestures as well as words, leave remarks unfinished and so on. Speech itself, though it may be represented as words, frequently contains items such as vocalized pauses which, although only semi-lexical, have immense importance in the analysis of spoken text. Even non-vocal elements such as gestures may be regarded as forming a component of spoken text for some analytic purposes.

Spoken texts transcribed according to the guidelines presented here are organized as follows. Speech is regarded as being composed of arbitrary high-level units called texts. This name has been chosen to match that used in other TEI tag sets. Where both spoken and written texts are to be treated as forming part of the same SGML document, the mechanisms described in section for combining different bases should be used. A spoken text might typically be a conversation between a small number of people, a lecture, a broadcast TV item and so forth. Each such unit has associated with it a TEI.header which provides detailed contextual information such as the source of the transcript, the identity of the participants, whether the speech is scripted or spontaneous and so forth. Full details of this header are not given here but at the appropriate place in section .

Within a text it may be necessary to identify subdivisions of various kinds, if only for convenience of handling. A neutral div element is provided for this purpose. It may be found useful also for representing subdivisions relating to discourse structure, speech act theory, transactional analysis etc., provided that these divisions are hierarchically well-behaved. Where they are not, the mechanisms discussed in section should be used instead. . The div is also the unit within which other components of the transcript are aligned with respect to time.

A spoken text may contain any of the following components: utterances pauses vocalized but non-lexical phenomena such as coughs kinesic (non-verbal, non-lexical) phenomena such as gestures entirely non-linguistic events occurring during and possibly influencing the course of speech writing, regarded as a special class of event in that it can be transcribed, for example captions or overheads displayed during a lecture

An utterance may contain lexical items interspersed with pauses and non-lexical vocal sounds; during an utterance, non-linguistic events may occur and written materials may be presented. The u element can thus contain any of the other elements listed, interspersed with a transcription of the lexical items of the utterance; the other elements may all appear between utterances or next to each other, but except for writing they do not contain any other elements, nor any data. More precise alignment of individual components and of subdivisions or segments of individual utterances is also possible by using the pointer mechanism discussed in more detail below. A similar mechanism allows for the multiple segmentation of a spoken text into units which are not well behaved with respect to the basic structural hierarchy described here, such as macrosyntagms (for which see section ).

The overall structure of a TEI-conformant spoken text is thus as follows: ... ...

... ...

]]> Overall Structure of Spoken Texts

A spoken text transcribed according to these Guidelines has two major components: a header in which all the contextual information relating to the spoken text is recorded, and the text itself. The latter may have no substructure, or it may be divided into logical units known here as divisions. If components of the text (e.g. utterances) are to be aligned, for example to mark their synchronization, one or more alignment maps may appear. A single such map may be defined for the whole text, or for one or more divisions of it.

The overall structure of a TEI spoken text is identical to that of any other TEI text: the TEI.2 element for a spoken text contains a TEI.Header and a text, in that order. Differences of structure begin to appear within the text itself; the text element and some other elements are defined differently for written and spoken materials. The proper declarations must be made available by defining the base tag set of the TEI document as the tag set for texts rather than that for written texts. The document type declaration at the beginning of the document will thus look something like the following: ]> ]]> This substitutes the base tag set in the file teispok2.dtd for that in the default base file. For further discussion, see section . The Header

Like any other TEI-conformant text, a spoken text transcription must begin with a TEI header, as described in section . In addition to the usual bibliographic information about the electronic text itself, the header of a spoken text will normally include information about the recording or transcription from which a text derives, the setting and participants involved, etc. It is strongly recommended that the header for a spoken text include information about at least the following items: title and responsibility statements (section ) recording details (section ) editorial principles (section ) setting and participants (section ) The Text

Defining the bounds of a spoken text is frequently a matter of arbitrary convention or convenience. In public or semi-public contexts, a text may be regarded as synonymous with, for example, a lecture, broadcast item, a meeting etc. In informal or private contexts, a text may be simply a conversation involving a specific group of participants. Alternatively, researchers may elect to define spoken texts solely in terms of their duration in time, or length in words. By default, these Guidelines assume of a text only that: it is internally cohesive it is describable by a single header it represents a single stretch of time with no significant discontinuities Deviation from these requirements may be specified (for example, the org attribute on the text element may take the value compos to specify that the components of the text are discrete) but is not recommended.

A spoken text itself may have no substructure, that is, it may consist simply of units such as utterances, pauses, etc., not grouped together in any way, or it may be subdivided into one or more divisions as described in the following section. The spoken text element text is at the top of the hierarchy within which all the elements described in this chapter fall; it has the following formal definition: ]]> Divisions and Their Components

If the notion of what constitutes a text in spoken discourse is inevitably rather an arbitrary one, the notion of formal subdivisions within such a text is even more debatable. Nevertheless, such divisions may be useful for such types of discourse as debates, broadcasts, etc., where structural subdivisions can easily be identified, or more generally wherever it is desired to aggregate utterances or other parts of a transcript into units smaller than a complete text. Examples might include conversations or discourse fragments, or more narrowly, that part of the conversation where topic x was discussed, provided only that the union of all such divisions is coextensive with the text.

Each such division of a spoken text should be represented by a div element, with the following description and attributes: div any arbitrary subdivision of a spoken text, comprising one or more utterances etc. which are to be treated as a unit for analytic purposes. Attributes include: type categorizes the division in some respect. dur specifies the total duration of the division in time. org specifies how the content of the division is organized. Values: compos composite content: i.e. no claim is made about the sequence in which elements below this one are to be processed, or their inter-relationships. seq sequential content: i.e. elements below this are regarded as forming a logical unit, to be processed in the sequence given. decls specifies one or more declarations in the TEI.header element associated with this text which are understood to apply to this element occurrence only.

The type attribute may be used to characterize divisions in any way that is convenient; no specific recommendations are made in these Guidelines. The dur attribute may specify temporal duration in any units convenient; for further discussion of time alignment see section . The org attribute should be used only exceptionally to specify whether or not the contents of the division have been artificially combined together and may have no internal cohesion. The decls attribute should be used only exceptionally, where the contents of the division do not all share the same set of the contextual declarations specified in the TEI header. For a general discussion of the elements contained by divisions, see the next section.

For some detailed kinds of analysis it may be found useful to divide the divisions of a text into further structural subdivisions: this is catered for by allowing for nested div elements within each div. The element div thus has the following formal definition: ]]>

Basic Structural Elements

Spoken texts, or their divisions if they are subdivided, are composed of the following elements. Together, these form the basic structural elements of spoken language, as represented in these Guidelines. u a stretch of speech usually preceded and followed by silence or by a change of speaker. pause a pause either between or within utterances. vocal any vocalized but non-lexical phenomenon, for example voiced pauses, non-lexical backchannels, etc. kinesic any non-vocalized but communicative phenomenon, for example a gesture, frown, etc. event any non-vocalized non-communicative phenomenon, for example incidental noises or other events affecting communication. writing a passage of written text revealed to participants in the course of a spoken text.

Each of these is further discussed and specified below in sections to . We can show the relationship between these constituents of speech using the features eventive, communicative, anthropophonic (for sounds produced by the human vocal apparatus), and lexical: The differences are not always clear-cut. Among events we might include actions like slamming the door, which can certainly be communicative. Vocals include coughing and sneezing, which are usually involuntary noises. Equally, the distinction between utterances and vocals is not always clear (as implied by our use of the term semi-lexical) although for many analytic purposes it will be convenient to regard them as distinct. Individual scholars may differ in the way borderlines are drawn and should declare their definitions in the editorial.decl element of the header (see ).

In addition to these constituents, spoken texts require the encoding of contextual and temporal information, that is, information concerning the origin or context of the speech, and information concerning its performance in time, respectively. Contextual Information

Contextual information, as already noted, is generally provided by the header to a text. In general, the whole of the information in a header is understood to be relevant to the whole of the associated text. Where this is not the case, a decls attribute may be used to specify exceptions to the rule. All contextual information about the script, the recording and the transcription applicable to a given spoken text is provided in the header. By default, the information recorded there is assumed to apply to the whole of the text. The decls attribute is provided to enable this association to be over-ridden for any div, u or writing element within it. Its value is a string consisting of the ID values for the script, recording or transcription declarations to be applied.

Defaults are inherited in the following manner: if there is only a single declaration in the header, it applies to the whole text if there is more than one declaration in the header, then that which is specified as the default applies by default to the whole text if any element contained by the text specifies a declaration, then that declaration applies to that element, and to all elements contained by it For example: ... [information about script SD1] [information about script SD2] ... [information about editorial practice ED1] [information about editorial practice ED2] ...

This utterance is associated with script SD1 and editorial practice ED1 this one with SD1 and ED2 This one is associated with SD2 and ED1 This one is associated with SD1 and ED1 ...

this utterance is associated with script SD1 and editorial practice ED2 this one with SD2 and ED2 this utterance is associated with editorial practice ED1 and script SD2 but this utterance is associated with editorial practice ED2 and script SD1

]]>

The decls attribute may be used in this way with the elements text, div, u, and writing only. It should only be used to over-ride declarations for the elements encoding.decl, script.decl and recording.decl components of the TEI header. In general, TEI recommended practice is to avoid heterogeneity in the contents of divisions: for this reason the decls attribute should only exceptionally be needed. Temporal Information

In addition to the global attributes n, id, and lang, utterances, vocals, pauses, kinesics, events and writing elements may all take a common set of attributes providing information about their position in time. For this reason, these elements are regarded as forming a class, referred to here as timed. The following attributes are common to all elements in this class: start indicates the location within a temporal alignment at which this element begins. end indicates the location within a temporal alignment at which this element ends. dur indicates the length of this element in time, using either specific units or the units specified on the associated temporal alignment. Note that if start and end point to loc elements whose temporal distance from each other is specified in the alignment map, then dur is ignored.

The tag ptr (see ) tag may be used as an alternative means of aligning the start and end of timed elements, and is required where the temporal alignment is with points within an element.

The parameter entity corresponding with this class has the following declaration: ]]> Utterances

Each distinct utterance in a spoken text is represented by a u element, described as follows: u a stretch of speech usually preceded and followed by silence or by a change of speaker. Attributes include: who supplies an identifier for the speaker or group of speakers. Its value is the identifier of a participant or participant.grp element in the TEI header. trans indicates the nature of the transition between this utterance and the previous one. Values: smooth no noticeable pause or overlap. latching overlapping or latching transition. pause noticeable pause. Use of the who attribute to associate the utterance with a particular speaker is recommended, but not required. Its use implies as a further requirement that all speakers be identified by a participant or participant.grp element in the TEI header, (see section ). The trans attribute is provided as a means of characterizing the transition from one utterance to the next at a simpler level of detail than that provided by the general alignment mechanism discussed in section below. For example Have you heard the the election results? yes it's a disaster ]]>

An utterance may contain only running text, or text within which other basic structural elements are nested. Where such nesting occurs, the who attribute is considered to be inherited for the elements pause, vocal, shift and kinesic; that is, a pause (etc.) within an utterance is regarded as being a pause by that speaker only, while a pause between utterances applies to all speakers.

Occasionally, an utterance may contain other utterances, for example where there is a change in the script associated with it. This may occur when a speaker changes script in mid-utterance. For example: Listen to this The government is confident, he said, that the current economic problems will be completely overcome by June what nonsense ]]> Here speaker A interrupts his own utterance with another nested one, which is read from a newspaper. The decls attribute on the nested utterance is used to indicate that its script is S1, rather than the default. Alternatively, the embedded utterance might be regarded as a new (non-nested) one, or represented as an event without transcribing the read material: Listen to this what nonsense ]]>

The formal definition of these elements is as follows: ]]> Pause

The pause empty element tag is used to indicate a silent pause, either between or within utterances. A pause contained by an utterance applies to the utterer of that utterance. A pause between utterances applies to all utterers. The type attribute may be used to categorize the pause, for example as short, medium or long; alternatively the attribute dur may be used to indicate its length more exactly. If detailed synchronization of pausing with other vocal phenomena is required, the alignment mechanism discussed at section should be used. Note that the trans attribute mentioned in the previous section may also be used to characterize the degree of pausing between utterances.

The formal definition of this element is as follows: ]]> Vocal, Kinesic, Event

These three empty elements are used to indicate the presence of non-transcribed semi-lexical or non-lexical phenomena either between or within utterances. vocal is used for vocalizations such as filled pauses, kinesic for gestures, and event for other types of event.

The who attribute should be used to specify the person or group responsible for a vocal, kinesic or event which is contained within an utterance, if this differs from that of the enclosing utterance. The attribute must be supplied for a vocal, kinesic or event which is not contained within an utterance.

The iterated attribute may be used to indicate that the vocal, kinesic or event is repeated, for example laughter as opposed to laugh. If detailed synchronization of these phenomena is required, the alignment mechanism discussed at section should be used.

The desc attribute may be used to supply a conventional representation for the phenomenon, for example: lexical burp, click, cough, exhale, giggle, gulp, inhale, laugh, sneeze, sniff, snort, sob, swallow, throat, yawn semi-lexical ah, aha, aw, eh, ehm, er, erm, hmm, huh, mm, mmhm, oh, ooh, oops, phew, tsk, uh, uh-huh, uh-uh, um, urgh, yup Researchers may prefer to regard some semi-lexical phenomena as words within the bounds of the u element. See further the discussion at section below. As for all basic categories, the definition should be made clear in the encoding.decl element of the header.

It is not envisaged that events or kinesics should be included in any transcription, except where their presence is felt to be of potential significance for the interpretation of the interaction.

The formal definition of these elements is as follows: ]]> Writing

Written text may also be encountered when speech is transcribed, for example in a television broadcast or cinema performance, or where one participant shows written text to another. The writing element may be used to distinguish such written elements from the spoken text in which they are embedded. writing a passage of written text revealed to participants in the course of a spoken text. Attributes include: who supplies an identifier for the participant who reveals or creates the writing, if any. Its value is the identifier of a participant or participant.grp element in the TEI header. gradual indicates whether the writing is revealed all at once or gradually. Values: y the writing is revealed gradually. n the writing is revealed all at once. u unknown or unmarked. type categorizes the kind of writing in some way, for example as a subtitle, noticeboard etc.

The formal definition of these elements is as follows: ]]> Segmentation and Alignment

As mentioned above, an utterance may contain simply prose, possibly mixed with the other elements already discussed above. It may also be further subdivided into segments of various kinds, and may contain pointers, indicating points of synchronization within it, or shifts, indicating locations at which changes are noted in various para-linguistic phenomena. These are discussed in the following sections. Segments

For some analytic purposes it may be desirable to subdivide the divisions of a spoken text into units smaller than the individual utterance or turn. Segmentation may be performed for a number of different purposes and in terms of a variety of speech phenomena. Common examples include units defined both prosodically (by intonation, pausing etc.) and syntactically (clauses, phrases etc.) The term macrosyntagm has been used to define units peculiar to speech transcripts by a number of researchers. The term was apparently first proposed by Bengt Loman and Nils Jørgensen, in Manual for analys och beskrivning av makrosyntagmer (Lund: Studentlitteratur, 1971) where it is defined as follows: A text can be analysed as a sequence of segments which are internally connected by a network of syntactic relations and externally delimited by the absence of such relations with respect to neighbouring segments. Such a segment is a syntactic unit called a macrosyntagm (trans. S. Johansson). These Guidelines propose that such analyses be performed in terms of neutrally-named segments, represented by an s element. This element may take a type attribute to specify the kind of segmentation applicable to a particular segment, if more than one is possible in a text. A full definition of the segmentation scheme or schemes used should be provided in the segmentation element of the editorial.decl element in the TEI header (see ).

The s element has the following formal definition: ]]>

It is often the case that the desired segmentation does not respect utterance boundaries, for example, syntactic units may cross utterance boundaries. This may be handled in a number of different ways, most of which are discussed in more general terms elsewhere in these Guidelines. a concurrent DTD may be defined (see ) milestone tags may be used (see ); the special-purpose shift tag discussed in section below is an extension of this method where the conflict is between utterance and segment boundaries only, it may be regarded as an instance of overlap (see section and ) where several discontinuous segments are to be grouped together to form a syntactic unit (e.g. a phrasal verb with interposed complement), the general purpose alignment mechanism should be used (see sections and Shifts

Paralinguistic features which characterize stretches of speech not co-extensive with utterances or any of the other units discussed so far may be encoded to a limited extent by marking simply their boundaries. The shift empty element is provided for this purpose. It may appear within an utterance or a segment to mark a significant change in the particular feature defined by its attributes, which is then understood to apply to all subsequent utterances for the same speaker, unless changed by a new shift for the same feature in the same speaker. Intervening utterances by other speakers do not normally carry the same feature.

The feature attribute is used to identify the particular feature which noticeably changes at this point. Its value is taken from a closed list of paralinguistic features, based on those used by the Survey of English Usage; For details see S. Boase, London-Lund Corpus: Example Text and Transcription Guide (London: Survey of English Usage, University College London, 1990). this list may be revised or supplemented using the methods outlined in section .

The new attribute specifies the new state of the feature following the shift. If no value is specified, it is implied that the feature concerned ceases to be remarkable at this point: the special value normal may be specified to have the same effect.

A shift itself can be aligned with other elements only by being pointed at from an alignment. If therefore alignment is expressed by pointing from the text, rather than into it, a ptr element following the shift should be used. See further . In summary, shift marks the point at which some paralinguistic feature of a series of utterances by any one speaker changes. Attributes include: feature a paralinguistic feature. Values: tempo speed of utterance. loud loudness. pitch pitch range. tension tension or stress pattern. rhythm rhythmic qualities. voice voice quality. new specifies the new state of the paralinguistic feature specified.

A list of suggested values for each of the features proposed follows: tempo a allegro (fast) aa very fast acc accelerando (getting faster) l lento (slow) ll very slow rall rallentando (getting slower) loud (for loudness): f forte (loud) ff very loud cresc crescendo (getting louder) p piano (soft) pp very soft dimin diminuendo (getting softer) pitch (for pitch range): high high pitch-range low low pitch-range wide wide pitch-range narrownarrow pitch-range asc ascending desc descending monot monotonous scand scandent, each succeeding syllable higher than the last, generally ending in a falling tone tension: sl slurred lax lax, a little slurred ten tense pr very precise st staccato, every stressed syllable being doubly stressed leg legato, every syllable receiving more or less equal stress rhythm: rh beatable rhythm arrh arrhythmic, particularly halting spr spiky rising, with markedly higher unstressed syllables spf spiky falling, with markedly lower unstressed syllables glr glissando rising, like spiky rising but the unstressed syllables, usually several, also rise in pitch relative to each other glf glissando falling, like spiky falling but with the unstressed syllables also falling in pitch relative to each other voice (for voice quality): whisp whisper breath breathy husk husky creak creaky fals falsetto reson resonant giggle unvoiced laugh or giggle laugh voiced laugh trem tremulous sob sobbing yawn yawning sigh sighing

A full definition of the sense of the values provided for each feature should be provided in the encoding description section of the text header (see section )

The formal definition of these elements is as follows: ]]> Pointers and Alignment

A major difference between spoken and written texts is the importance of the temporal dimension to the former. As a very simple example, consider the following, first as it might be represented in a playscript: Let us assume that Stig and Lou respond to Jane's question before she has finished asking it --- a fairly normal situation in spontaneous speech. Three things are therefore synchronous: the end of Jane's utterance, Stig's whole utterance and Lou's kinesic. To represent such situations, these Guidelines recommend the use of a mechanism known as an alignment map, which is linked with other parts of the text by the usual SGML id/idref pointer mechanism. One way to represent the simple example above might be as follows: have you read Vanity Fair yes ]]>

The rest of this section, which should be read in conjunction with the more general discussion of alignment at section explains how this mechanism works.

A specialized form of alignment map or timeline is used to coordinate simultaneous phenomena, whether they are utterances by different speakers, utterances and gestures, or gestures and events of other kinds. The map represents a series of points in time, which are then linked or aligned with other elements in the text in one of three ways by pointing from the text to the alignment map by pointing from the alignment map into the text by linking in both directions Examples of each method are given below and in section .

An alignment map is represented by an align element, which consists of a series of loc elements. Each loc represents a synchronization point, and may bear attributes indicating its exact temporal position relative to other points in the alignment in addition to the sequencing implied by its position within the alignment itself.

For example, the following alignment map ]]> represents four points in time, named P1, P2, P6 and P3 (as with all attributes named id in the TEI scheme, the names must be unique within the document but have no other significance). P1 is located absolutely, at 12:20:01:01 BST. P2 is 4.5 seconds later than P2 (i.e. at 12:20:46). P6 is at some unspecified time later than P2 and previous to P3 (this is implied by its position within the timeline, as no attribute values have been specified for it). The fourth point, P3, is 1.5 seconds later than P6.

One or more alignment maps may be specified within a spoken text, to suit the encoder's convenience. If more than one is supplied, the origin attribute may be used on each to specify which other align element it follows. The units attribute indicates the units used for timings given on loc elements contained by the alignment map, except where otherwise specified. To avoid the need to specify times explicitly, the interval attribute may be used to indicate that any contained loc elements are a fixed distance apart from each other, or to indicate that only their sequence is specified.

Elements within a spoken text may be aligned with the times specified by a loc element in three ways, as stated above. To point from the text into the alignment, the identifier of the required alignment point should be supplied as the value of one of the following the start attribute of a timed element the end attribute of a timed element the target attribute of a ptr element within the text

For example, This is my turn ]]> The start of this utterance is aligned with P2 and its end with P3. The transition between the words my and turn occurs at point P6.

To point from the alignment into the text, ptr elements are used within the loc elements of the alignment. The alignment represented by the preceding examples could equally well be represented as follows: ... This is my turn ]]> Here, the whole of the object with identifier U1 (the utterance) has been aligned with two different points, P2 and P3. This is interpreted to mean that the utterance spans at least those two points. Note that since we are now pointing from the alignment map into the text and not vice versa, the ptr element within the utterance has been replaced by an anchor element, the function of which is to provide a name for a location in the text so we can point at it.

The two methods can of course be combined, which eliminates the need for anchors in the text: ... This is my turn ]]>

Temporal alignment at this level of precision is generally appropriate for handling the common case of speaker overlap, and examples are given of its application for that purpose below. The mechanism outlined here may however be used for a variety of other purposes, most notably the alignment of discontinuous segments in syntactic analysis, for examples of which see section . Because any element in the TEI scheme may bear an id attribute, and because the ptr and anchor elements may appear anywhere within a TEI-conformant text, it is possible to align any or all parts of a spoken text using the same basic mechanism outlined here.

The formal definition of these elements is as follows: ]]> Recommended Transcription Practice Speaker Overlap

Speaker overlap should be handled using the alignment mechanism discussed in the previous section (). This allows for any element in the transcription to be aligned with any other, either singly (from the alignment map to the element, or vice versa) or doubly (in both directions). Alignment from the align element into the text is appropriate where the text is already fully marked up and it is not to be altered; alignment from the text to the align element is appropriate where the minimum of additional tagging is required. Double alignment considerably simplifies the task of writing software to handle the alignment, at the expense of considerable density of tagging.

As an example of the three possibilities, consider the following dialogue, represented first as it might appear in a conventional playscript: A commonly used convention might be to transcribe such a passage as follows: I used to smoke [ a lot more than this ] <2> [ you used to smoke ] <1> but I never inhaled the smoke ]]> Such conventions have the drawback that they are hard to generalize or to extend beyond the very simple case presented here. Their reliance on the accidentals of physical layout also make them difficult to transport and very difficult to process. These Guidelines recommend one of the following courses:

Single linkage, text to alignment: .... I used to smoke a lot more than this but I never inhaled the smoke You used to smoke ]]> Note that the second utterance above could equally well be encoded as follows with exactly the same effect: You used to smoke ]]>

Single linkage, alignment to text: .... I used to smoke a lot more than this but I never inhaled the smoke You used to smoke ]]>

Double linkage: .... I used to smoke a lot more than this but I never inhaled the smoke You used to smoke ]]>

Note that in each case, although Bob's utterance follows Tom's sequentially in the text, it is aligned temporally with its middle, without any need to disrupt the normal syntax of the text.

As a further example, consider the following exchange, first as it might be represented using a musical score like notation, in which points of synchronization are represented by vertical alignment of the text All three speakers are simultaneous at the words my, Balderdash and No; speakers A and C are simultaneous at the words turn and it's. This could be encoded as follows, using pointers from the alignment map into the text: ... this is my turn balderdash no it's mine ]]> Word Form

When speech is transcribed into writing, it is customary to use ordinary orthographic notation. This necessarily implies some compromise between the sounds produced and conventional orthography. Particularly when dealing with informal, dialectal or other varieties of language, the transcriber will frequently have to decide whether a particular sound is to be treated as a distinct vocabulary item or not. For example, while kinda is probably not worth distinguishing as a vocabulary item from kind of, isn't is clearly worth distinguishing from is not; for some purposes, the regional variant isnae might also be worth distinguishing in the same way. One rule of thumb might be to allow such variation only where a generally accepted orthographic form exists, for example, in published dictionaries of the language register being encoded; this has the disadvantage that such dictionaries may not exist. Another is to maintain a controlled (but extensible) set of normalized forms for all such words; this has the advantage of enforcing some degree of consistency amongst different transcribers. Occasionally, as for example when transcribing abbreviations or acronyms, it may be felt necessary to depart from conventional spelling to distinguish between cases where the abbreviation is spelled out letter by letter (for example, B B C or V A T) and where it is pronounced as a single word (for example VAT or RADA). Similar considerations might apply to pronunciation of foreign words (for example Monsewer vs Monsieur).

In general, use of punctuation, capitalization etc in spoken transcripts should be carefully controlled. It is important to distinguish the transcriber's intuition as to what the punctuation should be from the marking of prosodic features such as pausing, intonation etc.

Whatever practice is adopted, it is essential that it be clearly and fully documented in the editorial declarations section of the header. It may also be found helpful to include normalized forms of non-conventional spellings within the text, using the regularization element as described in section . Prosody

In the absence of conventional punctuation, the marking of prosodic features becomes of paramount importance, since these structure and organize the spoken message. Pauses have already been dealt with in section ; while tone units (or intonational phrases) can be indicated by the segmentation tag discussed in section . The shift tag discussed in section may also be used to encode some prosodic features, for example where all that is required is the ability to record shifts in voice quality.

For more detailed work, involving a detailed phonological transcription including representation of stress and pitch patterns, it is probably best to maintain the prosodic description in parallel with the conventional written transcription, rather than attempt to embed detailed prosodic information within the written transcription. The two parallel streams may be aligned with each other and with other streams, for example an acoustic encoding, using the general alignment mechanisms discussed in section . For representation of phonemic information, we recommend the use of the International Phonetic Alphabet, as further discussed in section Speech Management

By speech management we mean disfluencies such as filled and unfilled pauses, interrupted or repeated words, corrections and reformulations as well as interactional devices asking for or providing feedback. Depending on the importance attached to such features, transcribers may choose to adopt conventionalized representations for them (as discussed in section above), or transcribe them using IPA or some other phonemic transcription system. To simplify analysis of the lexical features of a speech transcript, it may be felt useful to tidy away in some sense many of these disfluencies. Where this policy has been adopted, these Guidelines recommend the use of the tags for simple editorial intervention discussed in section to make explicit the extent of regularization or normalization performed by the transcriber.

For example, false starts, repetition and truncated words might all be included within a transcription, but marked as editorially deleted, in the following way: ssee ~~you you~~you know ~~it's~~he's crazy ]]>

Similarly, where a transcriber is believed to have incorrectly identified a word, the regularization/non-regularization tags sic and reg may be used to indicate both the original and a regularization of it: skuzzy SCSI ]]> As discussed in section , the first of these would be appropriate where faithfulness to the transcribers' intuition is paramount, and the second where the editorial interpretation is felt more significant. In either case, the user of the text can perceive the basis of the choice being offered. Analytic Coding

The recommendations made here only concern the establishment of a basic text. Where a more sophisticated analysis is needed, more sophisticated methods of markup will also be necessary, requiring the use of concurrent markup streams for multiple segmentation of the stream of discourse, or complex alignment of several segments within it. The general purpose analytic tools discussed in section should be used for such purposes, as they should for representation of structures larger than the individual utterance. As yet, however, we have been unable to find any consensus in the field as to what kinds of units should be identified at this level, and cannot therefore make more specific recommendations. It is hoped however that the basic building blocks offered here will be serviceable in the expression of such a consensus when it is achieved.