«Abstract. A well-established feature of speech production is that talkers, faced with both anticipated and unanticipated perturbations, can ...»
EXPLORING THE INFORMATION SUPPORT FOR SPEECH*
J. A. Scott Kelso+ and Betty Tuller++
Abstract. A well-established feature of speech production is that
talkers, faced with both anticipated and unanticipated perturbations, can spontaneously adjust the movement patterns of articulators such that the acoustic output remains relatively undistorted.
Less clear is the nature of the underlying process( es) involved. In
this study we examined five subjects' production of the point vowels / i, a, u/ in isolation and the same vowels embedded in a dynamic speech context under normal conditions and a combined condition in which (a) the mandible was fixed by means of a bite block, (b) proprioceptive information was reduced through bilateral anaesthetization of the temporomandibular joint, (c) tactile information from oral mucosa was reduced by extensive application of topical anaesthetic, and (d) auditory information was masked by white noise.
Analysis of formant patterns revealed minimal distortion of the speech signal under the combined condition. These findings are unfavorable for central (e.g., predictive simulation) or peripheral closed-loop models, both of which require reliable peripheral information; they are more in line with recent work suggesting that movement goals may be achieved by muscle collectives that behave in a way that is qualitatively similar to a nonlinear vibratory system.
The remarkable generativity of human movement is a mystery that continues to resist explanation. Wi thin limits, people (and animals) can achieve the same 'goal' through a variety of kinematic trajectories, with different muscle groups and in the face of ever-changing post.ural and biomech.anical re9.uirements. This phenomenon--variously referred to as motor equivalence (Hebb,
1949) or equifinality (von Bertalanffy, 1973)--has been demonstrated again qui te recently by Raibert (1978), who showed writing patterns to be characteristicof the same individual even when produced by structures (such as the foot or mouth) that had never previously been used for the act of writing.
*A preliminary version of this paper was presented at the 101 st meeting of the Acoustical Society of America, May 18-22, 1981.
+Also Departments of Psychology and Biobehavioral Sciences, The University of Connecticut.
++Also Department of Neurology, New York University Medical Center.
Acknowledgment. This research was supported by NIH Grants NS13617 and Biomedical Research Support Grant RR05596 to Haskins Laboratories and NIH Post-Doctoral Fellowship 1-F32-NS-06718-01 to the second author. We thank Thomas Gay and the University of Connecticut Health Center at Farmington for use of facilities. We are especially grateful to Dr. Robert Gross of Louisiana State University Medical Center for performing anaesthetization proced ures.
[HASKINS LABORATORIES: Status Report on Speech Research SR-69 (1982)] Human language is generative in a quaLitatively similar way: We seem to have a potentially infinite number of vlays of constructing sentences. Nor is it trivial that language, even when stripped of its symbolic component, is a creative or generative activity. Articulatory maneuvers for producing speech sounds can be effected in spite of continuously varying initial conditions.
Often the same phonetic segment in different environments can be achieved by very different movement trajectories and end-states. One commonly used experimental paradigm for examining equifinali ty in speech takes the form of placing a bite block between the teeth, thus fixing the position of the Under such conditions, so~called "steady state" vowels can be mandible.
produced apparently without the need for on~line acoustic feedback. Normal range formant patterns are obtained even at the first glottal pitch pulse (Gay, Lindblom, & Lubker, 1981; Lindblom, Lubker, & Gay, 1979; Lindblom & Sundberg, 1971). Moreover, speakers are capable of such" compensatory articulation" with little (if any) articulatory experimentation. Recent work on bite- block speech has shown that response times to produce vowels of the same acoustic quality under normal and bite-block conditions are nearly identical.
In addition, the degree of "compensation" (as indexed by deviations from normal formant frequencies) remained unchanged as a function of practice (Fowler & Turvey, 1980; Lubker, 1979). The eVidence, then, favors an interpretation that articulatory adjustments to novel contextual conditions are relatively immediate.
What kind of control process could account for the adaptive, generative An open-loop control system in which commands nature of speech production?
for producing a given vowel prescribe in detail the activities of relevant muscles can be dispensed with because, by definition, such systems are insensitive to changing contextual conditions. On the other hand, closed-loop control does offer the advantage of adjustment to initial conditions. In peripheral closed-loop, feedback systems, a sensory goal in the form of a spatial (MacNeilage, 1970) or auditory target (Ladefoged, DeClerk, Lindau, & Pap~un, 1972; MacNeilage, 1980) is paired with an appropriate set of commands for accomplishing the goal. Resulting sensory consequences are then compared wi th the sensory goal so that corrections can be made. A potential problem wi th peripheral closed-loop control is that the corrective process requires time (at least one cycle around the corrective loop). However, if the adjustment to novel conditions is indeed immediate--thus excluding the need for trial and error methods--then a closed-loop mechanism tied to the peripheral motor system fails to capture the phenomenon of interest.
An alternative account favored by Lindblom and colleagues (e.g., Lindblom et aI., 1979) replaces the peripheral feedback loop by a central simulation process that derives the expected sensory consequences from a simulated set of motor commands before the actual efferent signals are sent to the periphery.
An internal comparison between the simulated and 'target' sensory consequences yields an error signal on the basis of which new (and correct) commands can be emitted. In this manner, adjustments to changes in context can be made in the internal simulation without incurring erroneous effects at the periphery.
It is important to note that the models discussed thus far make the explici t assumption that reliable peripheral information about the articulators' initial conditions is available before motor commands (simulated or actual) are generated. In the peripheral closed-loop model, for example, sensory input must be compared to the internal referent before the output of command signals. In the internal loop model, simulated motor commands are generated for the initial conditions that currently exist (Lindblom et al. 1979). It is not clear in the latter formulation what would happen if contextual conditions changed between the time that simulated and actual motor commands were generated. A more efficient system would be continuously sensitive to, and be capable of modulation by, contextual conditions. For the sake of argument, however, let us assume with Lindblom et al. that one benefit of the internal loop is its speed of correction; possibly the loop is sO fast that appropriate output can be generated before contextual conditions have changed.
In any case, for both closed-loop models, elimination or reduction of peripheral information about initial conditions should drastically affect the system's ability to adjust to the novel situation created by a bite block.
There are very limited data on this point. Gay and Turvey (1979) found that a single subject (a phonetician) made several attempts before producing 'normal' formant frequencies for the vowel / i/ under conditions in which a bite block was combined with topical anaesthesia of the oral mucosa and bilateral nerve blockage of the temporomandibular joint. Although this result has suggested to some (cf. Perkell, 1979) that joint and tactile information is used to establish an "orosensory frame of reference," we believe there are grounds for caution. One problem is that it is unclear how--given the considerable reduction of peripheral information--Gay and Turvey's subject was capable of adaptive adjustment at all. One possibility, which we consider here, is that auditory information may have played a potentiating role. Although acoustic information does not appear to be a necessary condition for compensatory articulation (e.g., Lindblom et al., 1979), the Gay-Turvey experiment does not preclude an auditory contribution in "recalibrating" the speech system when information from motor structures is rendered unreliable.
The present experiment was designed to examine the role of peripheral information (auditory and somesthetic) in accounts of "immediate adjustment" by asking naive subjects to produce vowels under normal conditions and under bi te- block conditions in which somatosensory information was drastically reduced (if not eliminated) and audition was masked by white noise. In addi tion we address the question of whether the so- called "steady-state" paradigm for bite-block vowels reflects normal dynamic speech motor processes.
By examining the produc tion of vowels embedded in a dynamic speech context as well as in isolation, we can discover what differences there are, if any, in observed acoustic patterns. As we shall see, the availability of peripheral information from neither aud i tory nor peripheral motor struc tures appears to be crucial to immediate adjustment. We take this resul t as non-supportive for extant models of the phenomenon. In their place, we offer a class of model-emerging in other areas of motor control (Bizzi, 1980; Fel'dman, 1966, 1980;
Kelso, 1977; Kelso & Holt, 1980; Kelso, Holt, Kugler, & Turvey, 1980; Polit & Bizzi, 1978) as well as in the recent speech production literature (cf. Fowler, 1977; Fowler, Rubin, Remez, & Turvey, 1980 )--that identifies functional groupings of muscles as exhibiting properties qualitatively similar to a nonlinear oscillatory system. The bottom line of this model and of the present paper is that the equifinality characteristic of vowel production may not be prescribed by closed-loop servomechanisms of the peripheral or central kind. Rather, we argue that it may be a consequence of the parameterization of a dynamical system whose design is intrinsically self-equilibrating. That is, a design in which equilibrium points are a natural by-product of the stiffness and damping specifications for the vowel-producing system.
Subjects. Four female volunteers were paid to participate in this experiment. All were naive to the purpose of the experiment. A fifth subject (male) who was phonetically trained and had prior experience in a similar experiment (see Gay & Turvey, 1979) was also included.
Stimuli. The subjects' task was to say the point vowels Ii, a, ul in isolation and in a I pi-vowel- Ipl context. The I pVpl syllables were spoken in the carrier phrase "A again." Utterances were produced in three groups of three tokens of a particular vowel or phrase. The subjects were instructed to produce all tokens of a given utterance in exactly the same fashion, with a clear pause after each token. They were also told not to talk between experimental conditions or to practice the production task.
Conditions. The bite block used was a small acrylic cylinder with wedges carved out of each end so that it could fit snugly between the teeth. A 5 mm bite block was used to restrict the normally low jaw position for production of lal and Ipap/. Either a 17 mm or a 23 mm bite block was used (depending on the individual subject's oral dimensions) for production of Ii, u, pip, pupl, which normally involve a high jaw position.
All anaesthetic procedures were performed by Dr. Robert Gross, a specialist in oral and maxillofacial surgery who had collaborated with us in earlier work (TUller, Harris, & Gross, 1981). Tactile information from the oral mucosa was reduced by spraying the surface of the tongue and oral cavity with a 2% Xylocaine solution. The effectiveness of the topical anaesthesia was tested by pricking the surfaces with a needle until the subject no longer reported sensation. A few catch trials were also included in an attempt to insure honest reporting on the part of the subject. Information from mechanoreceptors in the jaw was reduced by injecting percutaneously a 2% Xylocaine solution directly into left and right temporomandibular joint capsules to achieve auriculotemporal nerve blockage. Chemical blockage of this nerve drastically impairs perception of joint position and movement (cf. Thilander, 1961). This condition will be referred to as the TMJ block.
In o'rder to restrict the availability of auditory information, white noise was presented to the subject over headphones at approximately 90 dB.
The subject was told to monitor the amplitude of her or his productions by wa tching a VU meter and to restrict the excursion of the needle to approximately 55 dB or under.
All subjects spoke with and without the bite block prior to the application of anaesthesia and under all experimental conditions. Two of the four naive subjects received the TMJ block before the topical anaesthesia, and the other two subj €lets underwent topical anaesthesia first. In each of these pairs, one subject spoke under conditions of auditory masking and the other subj ect was allowed normal auditory information. The phonetically trained subj ect received topical anaesthesia before the TMJ block and spoke with masking noise in combination with these two procedures.
Measurement procedure. Individual utterances were input through a Ubiquitous spectrum analyzer to a Honeywell DDP-224 computer, using a 12.8 msec window and 40 Hz frequency resolution. The first and second formants of each utterance were measured from a spectral section display. As in previous experiments (e.g., Lindblom et al., 1979; Fowler & Turvey, 1980), acoustic measures of the isolated vowels were made at the first glottal pulc1e. For many English speakers the isolated vowels may not be truly static, that is, they may show some articulatory movement and thus some shifting of forn, "\'-\;
frequencies; nevertheless, the adopted procedure was to measure forman frequencies at the first glottal pulse. For the Ip/-vowel-/pl syllables, 2'1 and F2 values were taken from the point within the vowel at which F2 was most extreme. This point was chosen as the closest approximation to the" target" vowel formants.