The Handbook of Speech Perception. Группа авторов
helped usher in a new understanding of the perceptual brain.
In what has been termed the “multisensory revolution” (e.g. Rosenblum, 2013), research is now showing that brain areas and perceptual behaviors, long thought to be related to a single sense, are now known to be modulated by multiple senses (e.g. Pascual‐Leone & Hamilton, 2001; Reich, Maidenbaum, & Amedi, 2012; Ricciardi et al., 2014; Rosenblum, Dias, & Dorsi, 2016; Striem‐Amit et al., 2011). This research suggests a degree of neurophysiological and behavioral flexibility with perceptual modality not previously known. The research has been extensively reviewed elsewhere and will not be rehashed here (e.g. Pascual‐Leone & Hamilton, 2001; Reich, Maidenbaum, & Amedi, 2012; Ricciardi et al., 2014; Rosenblum, Dias, & Dorsi, 2016; Striem‐Amit et al., 2011). It is relevant, however, that research on audiovisual speech perception has spearheaded this revolution. Certainly, the phenomenological power of the McGurk effect has motivated research into the apparent automaticity with which the senses integrate/merge. Speech also provided the first example of a stimulus that could modulate an area in the human brain that was thought to be solely responsible for another sense. In that original report, Calvert and her colleagues (1997) showed that lip‐reading of a silent face could induce activity in the auditory cortex. Since the publication of that seminal study, hundreds of other studies have shown that visible speech can induce cross‐sensory modulation of the human auditory cortex. More generally, thousands of studies have now demonstrated crossmodal modulation of primary and secondary sensory cortexes in humans (for a review, see Rosenblum, Dias, & Dorsi, 2016). These studies have led to a new conception of the brain as a multisensory processing organ, rather than as a collection of separate sensory processing units.
This chapter will readdress important issues in multisensory speech perception in light of the enormous amount of relevant research conducted since publication of the first version of this chapter (Rosenblum, 2005). Many of the same topics addressed in that chapter will be addressed here including: (1) the ubiquity and automaticity of multisensory speech in human behavior; (2) the stage at which the speech streams integrate; and (3) the possibility that perception involves detection of a modality‐neutral – or supramodal – form of information that is available in multiple streams.
Ubiquity and automaticity of multisensory speech
Since 2005, evidence has continued to grow that supports speech as an inherently multisensory function. It has long been known that visual speech is used to enhance challenging auditory speech, whether that speech is degraded by noise or accent, or simply contains complicated material (e.g. Arnold & Hill, 2001; Bernstein, Auer, & Takayanagi, 2004; Reisberg, McLean, & Goldfield, 1987; Sumby & Pollack, 1954; Zheng & Samuel, 2019). Visual speech information helps us acquire our first language (e.g. Teinonen et al., 2008; for a review, see Danielson et al., 2017) and our second languages (Hardison, 2005; Hazan et al., 2005; Navarra & Soto‐Faraco, 2007). The importance of visual speech in language acquisition is also evidenced in research on congenitally blind individuals. Blind children show small delays in learning to perceive and produce segments that are acoustically more ambiguous, but visually distinct (e.g. the /m/–/n/ distinction). Recent research shows that these idiosyncratic differences carry through to congenitally blind adults who show subtle distinctions in speech perception and production (e.g. Delvaux et al., 2018; Ménard, Leclerc, & Tiede, 2014; Ménard et al., 2009, 2013, 2015).
The inherently multimodal nature of speech is also demonstrated by perceivers using and integrating information from a modality that they rarely, if ever, use for speech: touch. It has long been known that deaf‐blind individuals can learn to touch the lips, jaw, and neck of a speaker to perceive speech (the Tadoma technique). However, recent research shows just how automatic this process can be for even novice users (e.g. Treille et al., 2014). Novice perceivers (with normal sight and hearing) can readily use felt speech to (1) enhance comprehension of a noisy auditory speech (Gick et al., 2008; Sato, Cavé, et al., 2010); (2) enhance lip‐reading (Gick et al., 2008); and (3) influence perception of discrepant auditory speech (Fowler & Dekle, 1991, in a McGurk effect). Consistent with these findings, neurophysiological research shows that touching an articulating face can speed auditory cortex reactions to congruent auditory speech in the same way as is known to occur with visual speech (Treille et al., 2014; Treille, Vilain, & Sato, 2014; and see Auer et al., 2007). Other research shows that the speech function can effectively work with very sparse haptic information. Receiving light puffs of air on the skin in synchrony with hearing voiced consonants (e.g. b) can make those consonants sound voiceless (p; Derrick & Gick, 2013; Gick & Derrick, 2009). In a related example, if a listener’s cheeks are gently pulled down in synchrony with hearing a word that they had previously identified as “head,” they will be more likely to now hear that word as “had” (Ito, Tiede, & Ostry, 2009). The opposite effect occurs if a listener’s cheeks are instead pulled to the side.
These haptic speech demonstrations are important for multiple reasons. First, they demonstrate how readily the speech system can make use of – and integrate – even the most novel type of articulatory information. Very few normally sighted and hearing individuals have intentionally used touch information for purposes of speech perception. Despite the odd and often limited nature of haptic speech information, it is readily usable, showing that the speech brain is sensitive to articulation, regardless through which modality it is conveyed. Second, the fact that this information can be used spontaneously despite its novelty may be problematic for integration accounts based on associative learning between the modalities. Both classic auditory accounts of speech perception (Diehl & Kluender, 1989; Hickok, 2009; Magnotti & Beauchamp, 2017) and Bayesian accounts of multisensory integration (Altieri, Pisoni, & Townsend, 2011; Ma et al., 2009; Shams et al., 2011; van Wassenhove, 2013) assume that the senses are effectively bound and integrated on the basis of the associations gained through a lifetime of experience simultaneously seeing and hearing speech utterances. However, if multisensory speech perception were based only on associative experience, it is unclear how haptic speech would be so readily used and integrated by the speech function. In this sense, the haptic speech findings pose an important challenge to associative accounts (see also Rosenblum, Dorsi, & Dias, 2016).
Certainly, the most well‐known and studied demonstration of multisensory speech is the McGurk effect (McGurk & MacDonald, 1976; for recent reviews, see Alsius, Paré, & Munhall, 2017; Rosenblum, 2019; Tiippana, 2014). The effect typically involves a video of one type of syllable (e.g. ga) being synchronously dubbed onto an audio recording of a different syllable (ba) to induce a “heard” percept (da) that is strongly influenced by the visual component. The McGurk effect is considered to occur whenever the heard percept is different from that of the auditory component, whether a subject hears a compromise between the audio and visual components (auditory ba + visual ga = heard da) or hears a syllable dominated by the visual component (auditory ba + visual va = heard va). The effect has been demonstrated in multiple contexts, including with segments and speakers of different languages (e.g. Fuster‐Duran, 1996; Massaro et al., 1993; Sams et al., 1998; Sekiyama & Tohkura, 1991, 1993); across development (e.g. Burnham & Dodd, 2004; Desjardins & Werker, 2004; Jerger et al., 2014; Rosenblum, Schmuckler, & Johnson, 1997); with degraded audio and visual signals (Andersen et al., 2009; Rosenblum & Saldaña, 1996; Thomas & Jordan, 2002); and regardless of awareness of the audiovisual discrepancy (Bertelson & De Gelder, 2004; Bertelson et al., 1994; Colin et al., 2002; Green et al. 1991; Massaro, 1987; Soto‐Faraco & Alsius, 2007, 2009; Summerfield & McGrath, 1984). These characteristics have been interpreted as evidence that multisensory speech integration is automatic, and impenetrable to outside influences (Rosenblum, 2005).
However, some recent research has challenged this interpretation of integration (for a review, see Rosenblum, 2019). For example, a number of studies have been construed as showing that attention can influence whether integration occurs in the McGurk effect (for reviews, see Mitterer & Reinisch, 2017; Rosenblum, 2019). Adding a distractor to the visual, auditory, or even tactile channels seems to significantly reduce the strength of the effect (e.g. Alsius et al., 2005; Alsius, Navarra,