Anatomical Structures for Speech Production

Size: px

Start display at page:

Download "Anatomical Structures for Speech Production"

Buddy Johnson
6 years ago
Views:

1 Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone variations Spectrographic Examples CLSP Workshop 2 Acoustic Properties of Speech Sounds 1 Anatomical Structures for Speech Production Soft Palate (Velum) Soft Palate (Velum) Hyoid Bone Epiglottis Cricoid Cartilage Esophagus Lung Nasal Cavity Nasal Cavity Hard Palate Tongue Tongue Thyroid Cartilage Sternum Hard Palate Thyroid Cartilage Vocal Cords Vocal Folds Trachea Trachea Lung Jaw CLSP Workshop 2 Acoustic Properties of Speech Sounds 2

2 Sub-Word Linguistic Units The phoneme is one of the most basic linguistic units used to represent pronunciations of words ASR systems typically represent words as phoneme sequences English contains approximately 4 phonemes which can be grouped by manner and place of articulation Manner Class Number Vowels 16 Fricatives 8 Stops 6 Semivowels 4 Nasals 3 Affricates 2 Aspirant 1 CLSP Workshop 2 Acoustic Properties of Speech Sounds 3 Phonemes in American English IPA AB Word IPA AB Word IPA AB Word /i/ iy beat /s/ s see /w/ w wet /I/ ih bit /S/ sh she /r/ r red /e/ ey bait /f/ f fee /l/ l let /E/ eh bet /T/ th thief /y/ y yet /@/ ae bat /z/ z z /m/ m meet /a/ aa bob /Z/ zh Gigi /n/ n neat /O/ ao bought /v/ v v /4/ ng sing /^/ ah but /D/ dh thee /C/ ch church /o/ ow boat /p/ p pea /J/ jh judge /U/ uh book /t/ t tea /h/ hh heat /u/ uw boot /k/ k key /5/ er bird /b/ b bay /a / ay bite /d/ d day /O / oy Boyd /g/ g geese /a / aw bout /{/ ax about CLSP Workshop 2 Acoustic Properties of Speech Sounds 4

3 Places of Articulation for Speech Production Alveopalatal Alveolar Labial Dental Palatal Velar Uvular CLSP Workshop 2 Acoustic Properties of Speech Sounds 5 A Speech Waveform Two plus seven is less than ten CLSP Workshop 2 Acoustic Properties of Speech Sounds 6

4 Spectral Representations Speech waveforms are usually sampled at rates varying from 8K (telephone) to 2K (wide-band) samples/sec ASR systems typically transform the waveform into a spectrum: a sequence of frequency-based analyses usually performed at regular intervals (e.g., 1 ms) A short-time Fourier transform (STFT) performs a spectral analysis on waveform segments small enough to be able to assume that the speech signal is quasi-stationary The waveform segment is created by a moving window, whose type (e.g., Hamming) and duration (e.g., 5-25ms) have a significant impact on the resulting spectrum A spectrogram is an image computed from the resulting spectrum, which is often used to examine the waveform CLSP Workshop 2 Acoustic Properties of Speech Sounds 7 Short-Time Fourier Transform w [ 5 - m ] w [ 1 - m ] w [ 2 - m ] x [ m ] m n = 5 n = 1 n = 2 X n (e jω )= + m= w[n m]x[m]e jωm If n is fixed, then it can be shown that: X n (e jω )= 1 π 2π W(e jθ )e jθn X(e j(ω+θ) )dθ π The above equation is meaningful only if we assume that X(e jω ) represents the Fourier transform of a signal whose properties continue outside the window, or simply that the signal is zero outside the window. In order for X n (e jω ) to correspond to X(e jω ), W(e jω ) must resemble an impulse with respect to X(e jω ). CLSP Workshop 2 Acoustic Properties of Speech Sounds 8

5 Comparison of Windows CLSP Workshop 2 Acoustic Properties of Speech Sounds 9 Comparison of Windows (cont d) CLSP Workshop 2 Acoustic Properties of Speech Sounds 1

6 A Wide-Band Speech Spectrogram Two plus seven is less than ten CLSP Workshop 2 Acoustic Properties of Speech Sounds 11 A Narrow-Band Speech Spectrogram Two plus seven is less than ten CLSP Workshop 2 Acoustic Properties of Speech Sounds 12

7 Spectral Averages: Corpus and Representation TIMIT acoustic-phonetic corpus phonetic transcription aligned with waveform native speakers of American English (8 dialects) 8 sentences/speaker (dialect sentences excluded) 136 female, 326 male speakers (NIST train set) 3,696 utterances, 142,91 tokens Mel-Frequency Spectral Coefficients (MFSC s) Mel-frequency scale (linear < 1kHz,log> 1kHz) 4 channels (2 Hz khz) 25 ms Hamming window, 5 ms frame-rate Average computed over entire phonetic token (for stops spectral slice at release was used) CLSP Workshop 2 Acoustic Properties of Speech Sounds 13 Happy Little Vowel Chart "So inaccurate, yet so useful." Rob's F2 Increases FRONT BACK i I uú U u HIGH e E Think ^,{ O o MID F 3 is mighty a LOW Your pal 5 is the way TENSE = Towards Edges tends to be longer LAX = Towards Center tends to be shorter to go! SCHWAS: Plain ({) About /{ba t/ Front ( ) Roses /ro z z/ Retroflex (}) Forever /f}ev5/ F 1 Increases CLSP Workshop 2 Acoustic Properties of Speech Sounds 14

8 Friendly Little Consonant Chart "Somewhat more accurate, yet somewhat less useful." The Semi-vowels: Manner of Articulation Nasal Fricative Stop Place of Articulation Labial Dental Alveolar Palatal Velar p b f v m T D Weak (Non-strident) t d s z S Z Strong (Strident) n 4 Voicing: Unvoiced Voiced k g y w l is like an extreme is like an extreme is like an extreme i u o r is like an extreme 5 The Odds and Ends: h (unvoiced h) H (voiced h) F (flap)? (glottal stop) The Affricates: C J is like is like t+s d+z FÊ (nasalized flap) CLSP Workshop 2 Acoustic Properties of Speech Sounds 15 Vowel Production No significant constriction in the vocal tract Usually produced with periodic excitation Acoustic characteristics depend on the position of the jaw, tongue, and lips [i] [@] [a] [u] CLSP Workshop 2 Acoustic Properties of Speech Sounds 16

9 Vowels of American English There are approximately 18 vowels in American English made up of monothongs, diphthongs, and reduced vowels (schwa s) They are often described by the articulatory features: High/Low, Front/Back, Retroflexed, Rounded, andtense/lax /i/ iy beat /O/ ao bought /a / ay bite /I/ ih bit /^/ ah but /O / oy Boyd /e/ ey bait /o/ ow boat /a / aw bout /E/ eh bet /U/ uh book [{] ax about /@/ ae bat /u/ uw boot [ ] ix roses /a/ aa Bob /5/ er Bert [}] axr butter CLSP Workshop 2 Acoustic Properties of Speech Sounds 17 Vowel Formant Averages Vowels are often characterized by F1, F2, and F3 High/Low is correlated with F1 Front/Back is correlated with F2 Retroflexion is marked by a low F3 35 Female Speakers 35 Male Speakers 3 F 3 F 2 F 1 3 F 3 F 2 F 1 Average Frequency (Hz) Average Frequency (Hz) i I e a O ^ o U u 5 { Vowel i I e a O ^ o U u 5 { Vowel CLSP Workshop 2 Acoustic Properties of Speech Sounds 18

10 Vowel Formant Trajectories Diphthongs can have significant formant motion Most vowels in American English are somewhat diphthongized F Female Speakers i e I E a 5 ^ U u { o O O a F Male Speakers i e I a a 5 U ^ u o a O { O F F 1 CLSP Workshop 2 Acoustic Properties of Speech Sounds 19 Vowel Durations Each vowel has a different intrinsic duration Schwa s have distinctly shorter durations (5ms) /I, E, ^, U/ are the shortest monothongs Context can greatly influence vowel duration 25 Female Speakers 25 Male Speakers Average Duration (ms) Average Duration (ms) i I e a O ^ o U u 5 { a o a u Vowel i I e a O ^ o U u 5 { a o a u Vowel CLSP Workshop 2 Acoustic Properties of Speech Sounds 2

11 Fricative Production Turbulence produced at narrow constriction Constriction position determines acoustic characteristics Can be produced with periodic excitation [f] [T] [s] [S] CLSP Workshop 2 Acoustic Properties of Speech Sounds 21 Fricatives of American English There are 8 fricatives in American English They are often described by the features Strident/Non-Strident (Strong/Weak), Voiced/Unvoiced Four places of articulation: Labial, Dental, Alveolar, and Palatal Type Unvoiced Voiced Labial /f/ f fee /v/ v v Dental /T/ th thief /D/ dh thee Alveolar /s/ s see /z/ z z Palatal /S/ sh she /Z/ zh Gigi CLSP Workshop 2 Acoustic Properties of Speech Sounds 22

12 Fricative Energy NON-STRIDENT STRIDENT Probability Density unadjusted for frequency Average Total Energy Strident fricatives tend to be stronger than non-strident CLSP Workshop 2 Acoustic Properties of Speech Sounds 23 Fricative Durations UNVOICED VOICED Probability Density unadjusted for frequency Duration Voiced fricatives tend to be shorter than unvoiced CLSP Workshop 2 Acoustic Properties of Speech Sounds 24

13 Nasal Production Velum lowering results in airflow through nasal cavity Consonants produced with closure in oral cavity Nasalized vowels have output through oral and nasal cavities Nasal murmurs have similar spectral characteristics [m] [n] [4] CLSP Workshop 2 Acoustic Properties of Speech Sounds 25 Nasal Consonants of American English Three places of articulation: Labial, Alveolar, and Velar Always attached to a vowel, though can form an entire syllable in unstressed environments ([ní ], [mí ], [4Í ]) /4/ is always post-vocalic Place identified by neighboring formant transitions Type Nasal Labial /m/ m me Dental /n/ n knee Velar /4/ ng sing CLSP Workshop 2 Acoustic Properties of Speech Sounds 26

14 Nasal Durations Duration (ms) Singleton Unvoiced Cluster Voiced Cluster Nasal consonants tend to be shorter in clusters with unvoiced consonants, and longer with voiced consonants CLSP Workshop 2 Acoustic Properties of Speech Sounds 27 Semivowel Production Constriction in vocal tract, no turbulence Slower articulatory motion than other consonants Laterals form complete closure with tongue tip, airflow via sides of constriction [w] [y] [r] [l] CLSP Workshop 2 Acoustic Properties of Speech Sounds 28

15 Semivowels of American English There are 4 semivowels in American English Always attached to a vowel, though /l/ can form an entire syllable in unstressed environments ([lí]) Extreme articulation of a corresponding vowel Similar formant positions Generally weaker due to constriction Type Semivowel Nearest Vowel Glides /w/ w wet /u/ /y/ y yet /i/ Liquids /r/ r red /5/ /l/ l let /o/ CLSP Workshop 2 Acoustic Properties of Speech Sounds 29 Acoustic Properties of Semivowels /w/ is characterized by a very low F1, F2 Typically a rapid spectral falloff above F2 /y/ is characterized by very low F1, very high F2 /r/ is characterized by a very low F3 Prevocalic F3 < medial F3 < postvocalic F3 /l/ is characterized by a low F1 and F2 Often presence of high frequency energy Postvocalic /l/ characterized by minimal spectral discontinuity, gradual motion of formants CLSP Workshop 2 Acoustic Properties of Speech Sounds 3

16 Aspirant Production /h/ inamericanenglish Turbulence excitation at glottis No constriction in the vocal tract, normal formant excitation Coupling with subglottal system results in little energy in F1 region Periodic excitation can be present in medial position CLSP Workshop 2 Acoustic Properties of Speech Sounds 31 Stop Production Complete closure in the vocal tract, pressure build up Sudden release of the constriction, turbulence noise Can have periodic excitation during closure [b] [d] [g] CLSP Workshop 2 Acoustic Properties of Speech Sounds 32

17 Stops of American English There are 6 stop consonants in American English Same places of articulation as nasal consonants Unvoiced stops are typically aspirated Voiced stops usually exhibit a voice-bar during closure Information about formant transitions and release useful for classification Type Voiced Unvoiced Labial /b/ b bee /p/ p pea Dental /d/ d Dee /t/ t tea Velar /g/ g geese /k/ k key CLSP Workshop 2 Acoustic Properties of Speech Sounds 33 Singleton Stop Durations VOT Duration (ms) b d g p t k The voice onset time (VOT) of unvoiced stops is longer than that of voiced stops CLSP Workshop 2 Acoustic Properties of Speech Sounds 34

18 /s/-stop Durations VOT Duration (ms) p t k Unvoiced stops are unaspirated in /s/ stop sequences CLSP Workshop 2 Acoustic Properties of Speech Sounds 35 Stop-Semivowel Durations VOT Duration (ms) Singletons [Stop][Semivowel] Clusters b d g p t k Semivowels are partially devoiced in stop semivowel sequences CLSP Workshop 2 Acoustic Properties of Speech Sounds 36

19 Voicing Cues for Stops There are many voicing cues for a stop CLSP Workshop 2 Acoustic Properties of Speech Sounds 37 Affricate Production Alveolar-stop palatal-fricative pairs Sudden release of the constriction, turbulence noise Can have periodic excitation during closure Affricates of American English There are two affricates in American English Voiced Unvoiced /J/ jh judge /C/ ch church CLSP Workshop 2 Acoustic Properties of Speech Sounds 38

Speech from a Close-Talking Microphone Time (seconds)..1.2.3.4.5.6.7.8.9 1. 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.

20 Speech from a Close-Talking Microphone Time (seconds) Zero Crossing Rate 16 khz 8 8 khz Total Energy 8 Energy Hz to 75 Hz Wide Band Spectrogram khz 4 4 khz Waveform The Thinker is a famous sculpture File: /server/users/jwc/latex/sum97/sennheiser.wav Printed by jwc on Wed Jul 16 11:58: Page: 1 CLSP Workshop 2 Acoustic Properties of Speech Sounds 39 Speech from a Omni-Directional Microphone Time (seconds) Zero Crossing Rate 16 khz 8 8 khz Total Energy 8 Energy Hz to 75 Hz Wide Band Spectrogram khz 4 4 khz Waveform The Thinker is a famous sculpture File: /server/users/jwc/latex/sum97/bk.wav Printed by jwc on Wed Jul 16 11:57: Page: 1 CLSP Workshop 2 Acoustic Properties of Speech Sounds 4

Speech over a Telephone Channel Time (seconds)..1.2.3.4.5.6.7.8.9 1. 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.

21 Speech over a Telephone Channel Time (seconds) Zero Crossing Rate 16 khz 8 8 khz Total Energy 8 Energy Hz to 75 Hz Wide Band Spectrogram khz 4 4 khz Waveform The Thinker is a famous sculpture File: /server/users/jwc/latex/sum97/telephone.wav Printed by jwc on Wed Jul 16 11:59: Page: 1 CLSP Workshop 2 Acoustic Properties of Speech Sounds 41

On the Formation of Phoneme Categories in DNN Acoustic Models

On the Formation of Phoneme Categories in DNN Acoustic Models Tasha Nagamine Department of Electrical Engineering, Columbia University T. Nagamine Motivation Large performance gap between humans and state-