The Handbook of Multimodal-Multisensor Interfaces, Volume 1. Sharon Oviatt
knowledge with multisensory sources of information. (From Ernst and Bulthoff [2004])
In research on multisensory integration, Embodied Cognition theory also has provided a foundation for understanding human interaction with the environment from a systems perspective. Figure 1.3 illustrates how multisensory signals from the environment are combined with prior knowledge to form more accurate percepts [Ernst and Bulthoff 2004]. During multisensory integration, Ernst and colleagues describe the Maximum Likelihood Estimation (MLE) model, using Bayes’ rule. As introduced earlier, MLE integrates sensory signal input to minimize variance in the final estimate under different circumstances. It determines the degree to which information from one modality will dominate over another [Ernst and Banks 2002, Ernst and Bulthoff 2004]. For example, the MLE rule predicts that visual capture will occur whenever the visual stimulus is relatively noise-free and its estimate of a property has less variance than the haptic estimate. Conversely, haptic capture will prevail when the visual stimulus is noisier.
Empirical research has shown that the human nervous system’s multisensory perceptual integration process is very similar to the MLE integrator model. Ernst and Banks [2002] demonstrated this in a visual and haptic task. The net effect is that the final estimate has lower variance than either the visual or the haptic estimator alone. To support decision-making, prior knowledge is incorporated into the sensory integration model to further disambiguate sensory information. As depicted in Figure 1.3, this embodied perception-action process provides a basis for deciding what goal-oriented action to pursue. Selective action may in turn recruit further sensory information, alter the environment that is experienced, or change people’s understanding of their multisensory experience. See James and colleagues’ Chapter 2 in this volume [James et al. 2017] for an extensive discussion and empirical evidence supporting Embodied Cognition Theory.
Communication Accommodation theory presents a socially situated perspective on embodied cognition. It has shown that interactive human dialogue involves extensive co-adaptation of communication patterns between interlocutors. Interpersonal conversation is a dynamic adaptive exchange in which speakers’ lexical, syntactic, and speech signal features all are tailored in a moment-by-moment manner to their conversational partner. In most cases, children and adults adapt all aspects of their communicative behavior to converge with those of their partner, including speech amplitude, pitch, rate of articulation, pause structure, response latency, phonological features, gesturing, drawing, body posture, and other aspects [Burgoon et al. 1995, Fay et al. 2010, Giles et al. 1987, Welkowitz et al. 1976]. The impact of these communicative adaptations is to enhance the intelligibility, predictability, and efficiency of interpersonal communication [Burgoon et al. 1995, Giles et al. 1987, Welkowitz et al. 1976]. For example, if one speaker uses a particular lexical term, then their partner has a higher likelihood of adopting it as well. This mutual shaping of lexical choice facilitates language learning, and also the comprehension of newly introduced ideas between people.
Communication accommodation occurs not only in interpersonal dialogue, but also during human-computer interaction [Oviatt et al. 2004b, Zolton-Ford 1991]. These mutual adaptations also occur across different modalities (e.g., handwriting, manual signing), not just speech. For example, when drawing interlocutors typically shift from initially sketching a careful likeness of an object to converging with their partner’s simpler drawing [Fay et al. 2010]. A similar convergence of signed gestures has been documented between deaf communicators. Within a community of previously isolated deaf Nicaraguans who were brought together in a school for the deaf, a novel sign language became established rapidly and spontaneously. This new sign language and its lexicon most likely emerged through convergence of the signed gestures, which then became widely produced among community members as they formed a new language [Kegl et al. 1999, Goldin-Meadow 2003].
At the level of neurological processing, convergent communication patterns are controlled by the mirror and echo neuron systems [Kohler et al. 2002, Rizzolatti and Craighero 2004]. Mirror and echo neurons provide the multimodal neurological substrate for action understanding, both at the level of physical and communicative actions. Observation of an action in another person primes an individual to prepare for action, and also to comprehend the observed action. For example, when participating in a dialogue during a cooking class, one student may observe another’s facial expressions and pointing gesture when she says, “I cut my finger.” In this context, the listener is primed multimodally to act, comprehend, and perhaps reply verbally. The listener experiences neurological priming, or activation of their own brain region and musculature associated with fingers. This prepares the listener to act, which may involve imitating retraction that they observe with their own fingers. The same neurological priming enables the listener to comprehend the speaker’s physical experience and emotional state. This socially situated perception-action loop provides the evolutionary basis for imitation learning, language learning, and mutual comprehension of ideas.
This theory and related literature on convergence of multimodal communication patterns has been applied to designing more effective conversational software personas and social robots. One direct implication of this work is that the design of a system’s multimodal output can be used to transparently guide users to provide input that is more compatible with a system’s processing repertoire, which improves system reliability and performance [Oviatt et al. 2004b]. As examples, users interacting with a computer have been shown to adopt a more processable volume, rate, and lexicon [Oviatt et al. 2004b, Zolton-Ford 1991].
Affordance theory presents a systems-theoretic view closely related to Gestalt theory. It also is a complement to Activity theory, because it specifies the type of activity that users are most likely to engage in when using different types of computer interface. It states that people have perceptually based expectations about objects, including computer interfaces, which involve different constraints on how one can act on them to achieve goals. These affordances of objects establish behavioral attunements that transparently but powerfully prime the likelihood that people will act in specific ways [Gibson 1977, 1979]. Affordance theory has been widely applied to human interface design, especially the design of input devices [Gaver 1991, Norman 1988].
Since object perception is multisensory, people are influenced by an array of object affordances (e.g., auditory, tactile), not just their visual properties [Gaver 1991, Norman 1988]. For example, the acoustic qualities of an animated computer persona’s voice can influence a user’s engagement and the content of their dialogue contributions. In one study, when an animated persona sounded like a master teacher by speaking with higher amplitude and wider pitch excursions, children asked more questions about science [Oviatt et al. 2004b]. This example not only illustrates that affordances can be auditory, but also that they affect the nature of communicative actions as well as physical ones [Greeno 1994, Oviatt et al. 2012]. Furthermore, this impact on communication patterns involves all modalities, not just spoken language [Oviatt et al. 2012].
Recent interpretations of Affordance theory, especially as applied to computer interface design, specify that it is human perception of interface affordances that elicits specific types of activity, not just the presence of specific physical attributes. Affordances can be described at different levels, including biological, physical, perceptual, and symbolic/cognitive [Zhang and Patel 2006]. They are distributed representations that are the by-product of external representations of an object (e.g., streetlight color) and internal mental representations that a person maintains about their action potential (e.g., cultural knowledge that “red” means stop), which determines the person’s physical response. This example of an internal representation involves a cognitive affordance, which originates in cultural conventions mediated by symbolic language (i.e., “red”) that are specific to a person and her cultural/linguistic group.
Affordance theory emphasizes that interfaces should be designed to facilitate easy discoverability of the actions they are intended to support. It is important to note that the behavioral attunements that arise from object affordances depend on perceived action possibilities that are distinct from specific learned patterns. As such, they are potentially capable of stimulating human activity in a way that facilitates learning in contexts never encountered before.