1
Introduction
Every step is a first step if it's a step in the right direction.
TERRY PRATCHETT: I Shall Wear Midnight
1.1 Overview
THIS THESIS INTRODUCES and describes FiESTA, a new data model and library that assists researchers in creating, managing, and analyzing multimodal data collections. In this introductory chapter, we clarify the motivation for this project and, in parallel, give a commented overview of how each chapter contributes to the big picture. The visual roadmap in Figure 1 on the following page accompanies and illustrates this outline.
SECTION 1.2 describes our motivation and contains pointers to the respective chapters of the thesis.
SECTION 1.3 connects this thesis to other publications and projects from the wider context of multimodal corpora and data sets.
Figure 1: Visual roadmap for this thesis. Large, dashed boxes indicate parts, nested solid boxes stand for chapters. The narrative flow is shown as arrow connections between the chapters. Italic texts next to chapters outline the goals or accomplishments of their respective chapter(s). "GEF" (due to the restricted space in the diagram) stands for "generic exchange format".
SECTION 1.4 introduces some conventions used in this thesis, along with some remarks about mathematical notations.
1.2 Motivation
Multi-modal records allow us not only to approach old research problems in new ways, but also open up entirely new avenues of research.
Wittenburg, Levinson, et al., 2002 : 176
THIS STATEMENT DESCRIBES a central development in linguistics and its neighbouring disciplines in the last decades: The focus of research is no longer on the purely linguistic component of communicative interaction only. Instead, interaction is understood as a complex interplay between linguistic events (typically, spoken utterances) and events in other modalities, such as gesture, gaze, or facial expressions (cf. Kress and Leeuwen, 2001; Knapp, Hall, and Horgan, 2013).
A couple of decades ago, technology could only provide limited support to this branch of research. Microlevel video analysis, for instance, originated in the last century: Back then, researchers used purpose-built film projectors that could play film reels "at a variety of speeds, from very slow to extremely fast, effectively achieving slow motion vision and sound" (Condon and Ogston, 1967 : 227). This served as the basis for detailed, yet hand-written, analyses of interaction on the level of single video frames.
Since then, researchers benefitted from various developments and technological shifts, such as easily available computing facilities and digitization of video and audio recordings: The fact that media recordings can be digitized means that there is no loss of quality in copies anymore. This is an improvement compared to situations where copies of analog media often were expensive while at the same time being lossy, thus also limiting the number of generations of copies that could be produced (cf. Draxler, 2010 : 11 f.). In addition, year by year, computational power, disk space and recordable devices (such as working memory or hard disks) become more affordable (Gray and Graefe, 1997).
In addition, the advent of high-level programming languages (in general, and especially in the scientific context) and the evergrowing supply with modular, reusable programming libraries containing solutions to many problems enabled the community to create annotation tools. These are special pieces of software suited to the needs of researchers in the field of multimodal interaction, such as the EUDICO Linguistic Annotator ELAN (Wittenburg, Brugman, et al., 2006), or Anvil (Kipp, 2001). Both tools support the playback and navigation of video and audio recordings, as a basis for the creation and temporal localisation of additional data. Similarly, for detailed phonological and phonetic analyses of sound files, the tool Praat (Boersma and Weenink, 2013, 2001) was developed.
With these tools and their wide range of possible operations, scientists work on a diverse range of research questions, investigating phenomena such as
- the synchronicity and cross-modal construction of meaning in speech and gesture signals (Lücking et al., 2010; Bergmann and Kopp, 2012; Bergmann, 2012; Lücking et al., 2013),
- the use of speech-accompanying behaviour signalling emotion, and its possible differences in patients and healthy subjects (Jaecks et al., 2012),
- the interaction of speech and actions in object arrangement games, with a focus on the positioning of critical objects in a twodimensional target space (Schaffranietz et al., 2007; Lücking, Abramov, et al., 2011; Menke, McCrae, and Cimiano, 2013),
- or the multimodal behaviour in negotiation processes concerning object configurations in miniature models of buildings or landscapes (Pitsch et al., 2010; Dierker, Pitsch, and Hermann, 2011; Schnier et al., 2011).
IN ALL DIALOGICAL1 situations investigated in these experiments, interlocutors produced several series of interaction signals over time - such as speech, gestures, facial expressions or manipulations of objects located in the shared space between interlocutors. These streams of interactive patterns are sometimes independent of each other. Often, however, multiple streams are coupled in a single interlocutor (e. g., in speech-accompanying gestures), and in other cases, the streams of different interlocutors are (at least locally) coupled (e. g., in coconstructions of speech, where a fragmentary segment of a linguistic construction is continued or completed by another interlocutor).
Figure 2: Schema of the data generation workflow in the research of multimodal interaction. Left: The different levels of data, and information about how subsequent layers are generated out of prior ones. Right: An example of primary and secondary annotations based on the segment of a recording (containing a speech transcription, an annotation of gesture, a syntactic analysis of the speech, and a secondary annotation expressing an hypothesis about how items on both the speech and gesture form a joint semantic unit.
A detailed and thorough analysis of such dialogues typically pursues the following course (cf. Figure 2; the following description indicates numbers in the diagram for easier reference):
- First, video and audio recordings of the interaction are created ().
- To simplify work, further references to these recordings (and, indirectly to the events of the original situation) refer to an abstraction in the shape of a so-called timeline (). Points and intervals on this timeline are the only link to the underlying media files, since (under the assumption that all media have been synchronised) every segment in the media files can unambigously be referenced with such a time stamp information.
- Then, researchers create primary annotations on the basis of these media recordings (). This is done by identifying points or intervals on the timeline and associating them with a coded representation (typically, by using text) of the observed phenomenon.
- In addition, it can be necessary to generate annotations on an additional level, so-called secondary annotations (). These do not refer to temporal information directly. Instead, they point to one or multiple annotations (cf. Brugman and Russel, 2004 : 2065). They typically assign a property or category to an annotation, or model a certain kind of relation between two or more annotations.
"DATA" AND "MODALITY" are two terms which, although researchers have an intuitive understanding, are often deficiently defined. Therefore, we prepend two chapters to this thesis that attempt to clarify the exact definitions of terms from the two fields of data (Chapter 2) and of modalities (Chapter 3).
While most of the investigations concerning multimodal interaction follow the basic schema described above, its differentiations per project can diverge substantially. This is mostly due to the fact that different research questions often require idiosyncratic data structures and different descriptive categories (as, for instance, for the description of non-linguistic behaviour).
IN ORDER TO give a more detailed overview of how these data structures can be designed, and how they diverge against the background of varying research questions, descriptions of a sample of multimodal data collections, along with the underlying research questions, are presented in Chapter 4, starting on page . This is accompanied by an introduction to the graphical user interfaces and the file formats of two annotation tools that were repeatedly used for creating the example data collection: Praat and ELAN (Chapter 5, starting on page ).
FIRST AND FOREMOST, the annotation tools mentioned in the previous section provide a solid basis for the research of multimodal interaction. And yet, as will be shown in the following chapters, there are still areas and specific tasks where these general-purpose tools fail, and where creative, but ad-hoc solutions are implemented. Examples of such problematic tasks are
- the creation of a certain connection inside the annotation structure for which the developers of the tool did not provide a...