Scribe for Monday 1/10/05 A new set of slides was handed out (slides for Lecture 3, Lecture 3/4 ) but not used much today. The derivation of the one-dimensional wave equation, which describes the propogation of a wave with speed v [1], listed in Lecture 3 s outline won t be covered. We finished the previous lecture (from the Lecture 2 slides) and looked/listed to an acoustic tube example. This demonstrated that the human vocal tract could be modeled as an acoustic tube [2], which is the focus of the lecture following this one. Symbol Description Example Word Transcription p voiceless bilabial stop put [ p uh t ] ng voiced velar nasal sing [ s ih ng ] n voiced alveolar nasal night [ n ay t ] f voiceless labiao dental fricative f ind [ f ay n d ] Table 1: ARPAbet Examples Transcription in Speech Recognition Here we note the difference between Phonemic and Phonetic transcriptions. According to Wikipedia [3], In spoken language, a phoneme is a basic, theoretical unit of sound that can distinguish words......a succinct way to describe the idea of a phoneme is the smallest difference that makes a difference. Phonetic means how a vocal system would form the sounds, i.e. Phonetics (from the Greek word phone = sound/voice) is the study of speech sounds (voice). It is concerned with the actual nature of the sounds and their production. [3] This also relates to Baseforms and Surface Forms of types of speech, where the baseform is the ideal text which is to be spoken while the surface form is the actual expression of the speech. Speech recognition is typically phonetic. IPA [4] and ARPAbet [5] are two phonetic alphabets commonly used in speech recognition. Here are a few examples from ARPAbet s phonetic alphabet: There are numerous properties of speech used to categorize phones. Place of articulation is all about where in the vocal system the sounds are generated (where s the tongue, where s the teeth, etc.), a few examples of which are listed below. There is a good interactive webpage demonstrating place and manner of articulation at http://www.chass.utoronto.ca/ danhall/phonetics/sammy.html [6]. This applet allows selection of numerous articulation applets, then shows the appropriate modification to a graphical cross section of the vocal tract. Manner of articulation is another one of these properties. Continuant sounds (steady-state) vs. non-continuant sounds (transient) come out 1
Location Description Sample Word Bilabial Lips Closed Together mmm Labiodental Upper Lip and Lower Teeth fit Aveolar Tongue Behind Upper Teeth dune Palatal Tongue Behind Ridge Behind Upper Teeth nog] Table 2: Some Places of Articulation when there is an articulator moving or not moving during the sound production. With continuant sounds, the passage of air is restricted, but not completely stopped. Articulators do not move in the production of continuant sounds. Continuants are sometimes called fricatives [7]. Non-continuant sounds are those in which a change in the vocal tract configuration is required during the production of the sound [8]. More than one sound can be made with the same articulator placement! These different sounds are determined by whether the sound is voiced or unvoiced. For example, foo and voo sound different, but if you whisper the two, the v sound is no longer voiced and sounds just like the f. Vowels are distinguished by several characteristics: Large amplitude Long duration (40-400ms) Distinguished by tongue hump and degree of constriction Every language has a schwa vowel sound. It s just that popular. Some cultures have vowel sounds that others don t. Pronunciation is (intuitively) based on the speaker. Vowel degrees of constriction range from high to low, depending on (you guessed it) constriction of the airway. Tongue hump position ranges from front to back. As a general rule, going from low high constriction (ex. a ee ) tends to lower the F1 formant. Formants and Frequency Singers and musicians sometimes use software which shows them the formants of their voice in real-time in order to help them keep their sounds more steady [9]. Formants are the resonance frequencies of the vocal tract. Formants are not harmonics of the fundamental frequency. If you have a system G(S) as shown in Figure 1 below, the frequency of the output signal Y (S) is going to be based on the frequency of the input signal U(S). If the system is designed 2
to model the frequency response of the vocal tract, then it cannot be modeled as a Linear, Time-Invariant (LTI) system except for brief intervals of time where the formants are not changing. The frequency response of the vocal tract is determined by the formants, not the frequency of the output. As a reminder, the output of an LTI system is determined by the convolution of the input with the system s description, Y (S) = U(S) G(S) or y(t) = (u(τ)g(t τ)dτ) or y(n) = k(u(k)g(n k)) Figure 1: System Block Diagram If several of the formants of vowels are plotted (F1 formant vs. F2 formant), a rough figure of a triangle [10] is formed (see figure below). This usually doesn t work very well for determining vowels. On top of that difficulty it s pretty tough to automatically find formants. To see this more clearly, we could record 5-10 seconds of a vowel without changing the articulators, view the spectrogram and fft, and try to guess the formants from looking at these analyses. Formants with a higher center frequency occupy a greater bandwidth. Figure 2: Vowel Triangle for some Common Vowel Sounds The sound of a dipthong is defined by movement of the articulators. Therefore they must be non-continuants. 3
Stops/Plosives involve the buildup and release of air pressure behind an articulator. The McGurk effect is a phenomenon where a person s perception of what is being said may be dependent on what the person is seeing. We watched a video clip of a person s lips moving while listening to a repetitive audio clip. The results of this experiment are in Table 1. Class Visual Audio ba ba ba va va ba za tha ba ah ga ba Table 3: McGurk Effect Prosody is like the musical part of speech. It refers to the rhythmic and pitch changes in a word which can affect its meaning. Sarcasm is one way to use prosody to alter the meaning of a word or phrase. Musical notation isn t really precise enough to properly represent the pitch contours of prosody. In some cultures, the pitch affects meaning. Usually there isn t much differentiation in pitches representing different information because humans with perfect pitch are rare (people who can listen to a tone and accurately know its absolute pitch). As a preview for next lecture, we listened to a duck whistle with a variety of acoustic tube attachments. The duck whistle mimicked the glottal sounds, while the tubes altered the duck call to sound like certain vowel sounds by mimicking the constrictions of the vocal tract. REFERENCES 1 Wave Equation Description http://mathworld.wolfram.com/waveequation.html 2 Human Vocal Tract as an Acoustic Tube http://ccrma.stanford.edu/ bilbao/master/node5.html 3 Wikipedia http://en.wikipedia.org/wiki/ 4 IPA Alphabet Chart http://www.arts.gla.ac.uk/ipa/ipachart.html 5 ARPAbet Alphabet Chart http://www.billnet.org/phon/arpabet.html 6 Articulation Applet http://www.chass.utoronto.ca/ danhall/phonetics/sammy.html 7 Continuant Sounds 4
http://www.inthebeginning.org/ntgreek/phonics/continuant.htm 8 Speech Terms http://www.research.ibm.com/people/l/lvsubram/teaching/speech/speechterms.htm 9 Video Voice Software http://www.videovoice.com/ 10 Vowel Triangle Image http://isl.ira.uka.de/speechcourse/slides/nature/acoustics/formants/formants.gif 5