Anatomical Structures for Speech Production

Similar documents
On the Formation of Phoneme Categories in DNN Acoustic Models

Consonants: articulation and transcription

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Phonetics. The Sound of Language

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

age, Speech and Hearii

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Speaker Recognition. Speaker Diarization and Identification

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

To appear in the Proceedings of the 35th Meetings of the Chicago Linguistics Society. Post-vocalic spirantization: Typology and phonetic motivations

THE RECOGNITION OF SPEECH BY MACHINE

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

source or where they are needed to distinguish two forms of a language. 4. Geographical Location. I have attempted to provide a geographical

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Universal contrastive analysis as a learning principle in CAPT

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Lecture 9: Speech Recognition

Segregation of Unvoiced Speech from Nonspeech Interference

Phonology Revisited: Sor3ng Out the PH Factors in Reading and Spelling Development. Indiana, November, 2015

Body-Conducted Speech Recognition and its Application to Speech Support System

Contrasting English Phonology and Nigerian English Phonology

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Consonant-Vowel Unity in Element Theory*

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

Proceedings of Meetings on Acoustics

DEVELOPMENT OF LINGUAL MOTOR CONTROL IN CHILDREN AND ADOLESCENTS

Speech Emotion Recognition Using Support Vector Machine

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Journal of Phonetics

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

The pronunciation of /7i/ by male and female speakers of avant-garde Dutch

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Christine Mooshammer, IPDS Kiel, Philip Hoole, IPSK München, Anja Geumann, Dublin

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Speech Recognition at ICSI: Broadcast News and beyond

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Mandarin Lexical Tone Recognition: The Gating Paradigm

An Acoustic Phonetic Account of the Production of Word-Final /z/s in Central Minnesota English

Speaker recognition using universal background model on YOHO database

SOUND STRUCTURE REPRESENTATION, REPAIR AND WELL-FORMEDNESS: GRAMMAR IN SPOKEN LANGUAGE PRODUCTION. Adam B. Buchwald

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

The Indian English of Tibeto-Burman language speakers*

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Richardson, J., The Next Step in Guided Writing, Ohio Literacy Conference, 2010

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

Linguistics 220 Phonology: distributions and the concept of the phoneme. John Alderete, Simon Fraser University

A study of speaker adaptation for DNN-based speech synthesis

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

**Note: this is slightly different from the original (mainly in format). I would be happy to send you a hard copy.**

Affricates. Affricates, nasals, laterals and continuants. Affricates. Affricates. Study questions

Voice conversion through vector quantization

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

Learning Methods in Multilingual Speech Recognition

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Multilingual Speech Data Collection for the Assessment of Pronunciation and Prosody in a Language Learning System

WHEN THERE IS A mismatch between the acoustic

Phonological and Phonetic Representations: The Case of Neutralization

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

Audible and visible speech

Radical CV Phonology: the locational gesture *

The Journey to Vowelerria VOWEL ERRORS: THE LOST WORLD OF SPEECH INTERVENTION. Preparation: Education. Preparation: Education. Preparation: Education

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Word Stress and Intonation: Introduction

9 Sound recordings: acoustic and articulatory data

NIH Public Access Author Manuscript Lang Speech. Author manuscript; available in PMC 2011 January 1.

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

Human Emotion Recognition From Speech

The ABCs of O-G. Materials Catalog. Skills Workbook. Lesson Plans for Teaching The Orton-Gillingham Approach in Reading and Spelling

Quarterly Progress and Status Report. Sound symbolism in deictic words

Rhythm-typology revisited.

A Believable Accent: The Phonology of the Pink Panther

DIBELS Next BENCHMARK ASSESSMENTS

Self-Supervised Acquisition of Vowels in American English

Fix Your Vowels: Computer-assisted training by Dutch learners of Spanish

SARDNET: A Self-Organizing Feature Map for Sequences

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Different Task Type and the Perception of the English Interdental Fricatives

Automatic segmentation of continuous speech using minimum phase group delay functions

Learners Use Word-Level Statistics in Phonetic Category Acquisition

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Automatic intonation assessment for computer aided language learning

Language Change: Progress or Decay?

Manner assimilation in Uyghur

Speaker Identification by Comparison of Smart Methods. Abstract

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Affricates. Affricates, nasals, laterals and continuants. Affricates. Affricates. Affricates. Affricates 11/20/2015. Phonetics of English 1

Automatic English-Chinese name transliteration for development of multilingual resources

Edinburgh Research Explorer

Characterizing and Processing Robot-Directed Speech

Modeling function word errors in DNN-HMM based LVCSR systems

Transcription:

Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone variations Spectrographic Examples CLSP Workshop 2 Acoustic Properties of Speech Sounds 1 Anatomical Structures for Speech Production Soft Palate (Velum) Soft Palate (Velum) Hyoid Bone Epiglottis Cricoid Cartilage Esophagus Lung Nasal Cavity Nasal Cavity Hard Palate Tongue Tongue Thyroid Cartilage Sternum Hard Palate Thyroid Cartilage Vocal Cords Vocal Folds Trachea Trachea Lung Jaw CLSP Workshop 2 Acoustic Properties of Speech Sounds 2

Sub-Word Linguistic Units The phoneme is one of the most basic linguistic units used to represent pronunciations of words ASR systems typically represent words as phoneme sequences English contains approximately 4 phonemes which can be grouped by manner and place of articulation Manner Class Number Vowels 16 Fricatives 8 Stops 6 Semivowels 4 Nasals 3 Affricates 2 Aspirant 1 CLSP Workshop 2 Acoustic Properties of Speech Sounds 3 Phonemes in American English IPA AB Word IPA AB Word IPA AB Word /i/ iy beat /s/ s see /w/ w wet /I/ ih bit /S/ sh she /r/ r red /e/ ey bait /f/ f fee /l/ l let /E/ eh bet /T/ th thief /y/ y yet /@/ ae bat /z/ z z /m/ m meet /a/ aa bob /Z/ zh Gigi /n/ n neat /O/ ao bought /v/ v v /4/ ng sing /^/ ah but /D/ dh thee /C/ ch church /o/ ow boat /p/ p pea /J/ jh judge /U/ uh book /t/ t tea /h/ hh heat /u/ uw boot /k/ k key /5/ er bird /b/ b bay /a / ay bite /d/ d day /O / oy Boyd /g/ g geese /a / aw bout /{/ ax about CLSP Workshop 2 Acoustic Properties of Speech Sounds 4

Places of Articulation for Speech Production Alveopalatal Alveolar Labial Dental Palatal Velar Uvular CLSP Workshop 2 Acoustic Properties of Speech Sounds 5 A Speech Waveform Two plus seven is less than ten CLSP Workshop 2 Acoustic Properties of Speech Sounds 6

Spectral Representations Speech waveforms are usually sampled at rates varying from 8K (telephone) to 2K (wide-band) samples/sec ASR systems typically transform the waveform into a spectrum: a sequence of frequency-based analyses usually performed at regular intervals (e.g., 1 ms) A short-time Fourier transform (STFT) performs a spectral analysis on waveform segments small enough to be able to assume that the speech signal is quasi-stationary The waveform segment is created by a moving window, whose type (e.g., Hamming) and duration (e.g., 5-25ms) have a significant impact on the resulting spectrum A spectrogram is an image computed from the resulting spectrum, which is often used to examine the waveform CLSP Workshop 2 Acoustic Properties of Speech Sounds 7 Short-Time Fourier Transform w [ 5 - m ] w [ 1 - m ] w [ 2 - m ] x [ m ] m n = 5 n = 1 n = 2 X n (e jω )= + m= w[n m]x[m]e jωm If n is fixed, then it can be shown that: X n (e jω )= 1 π 2π W(e jθ )e jθn X(e j(ω+θ) )dθ π The above equation is meaningful only if we assume that X(e jω ) represents the Fourier transform of a signal whose properties continue outside the window, or simply that the signal is zero outside the window. In order for X n (e jω ) to correspond to X(e jω ), W(e jω ) must resemble an impulse with respect to X(e jω ). CLSP Workshop 2 Acoustic Properties of Speech Sounds 8

Comparison of Windows CLSP Workshop 2 Acoustic Properties of Speech Sounds 9 Comparison of Windows (cont d) CLSP Workshop 2 Acoustic Properties of Speech Sounds 1

A Wide-Band Speech Spectrogram Two plus seven is less than ten CLSP Workshop 2 Acoustic Properties of Speech Sounds 11 A Narrow-Band Speech Spectrogram Two plus seven is less than ten CLSP Workshop 2 Acoustic Properties of Speech Sounds 12

Spectral Averages: Corpus and Representation TIMIT acoustic-phonetic corpus phonetic transcription aligned with waveform native speakers of American English (8 dialects) 8 sentences/speaker (dialect sentences excluded) 136 female, 326 male speakers (NIST train set) 3,696 utterances, 142,91 tokens Mel-Frequency Spectral Coefficients (MFSC s) Mel-frequency scale (linear < 1kHz,log> 1kHz) 4 channels (2 Hz - 6.4 khz) 25 ms Hamming window, 5 ms frame-rate Average computed over entire phonetic token (for stops spectral slice at release was used) CLSP Workshop 2 Acoustic Properties of Speech Sounds 13 Happy Little Vowel Chart "So inaccurate, yet so useful." Rob's F2 Increases FRONT BACK i I uú U u HIGH e E Think ^,{ O o MID F 3 is mighty low? @ a LOW Your pal 5 is the way TENSE = Towards Edges tends to be longer LAX = Towards Center tends to be shorter to go! SCHWAS: Plain ({) About /{ba t/ Front ( ) Roses /ro z z/ Retroflex (}) Forever /f}ev5/ F 1 Increases CLSP Workshop 2 Acoustic Properties of Speech Sounds 14

Friendly Little Consonant Chart "Somewhat more accurate, yet somewhat less useful." The Semi-vowels: Manner of Articulation Nasal Fricative Stop Place of Articulation Labial Dental Alveolar Palatal Velar p b f v m T D Weak (Non-strident) t d s z S Z Strong (Strident) n 4 Voicing: Unvoiced Voiced k g y w l is like an extreme is like an extreme is like an extreme i u o r is like an extreme 5 The Odds and Ends: h (unvoiced h) H (voiced h) F (flap)? (glottal stop) The Affricates: C J is like is like t+s d+z FÊ (nasalized flap) CLSP Workshop 2 Acoustic Properties of Speech Sounds 15 Vowel Production No significant constriction in the vocal tract Usually produced with periodic excitation Acoustic characteristics depend on the position of the jaw, tongue, and lips [i] [@] [a] [u] CLSP Workshop 2 Acoustic Properties of Speech Sounds 16

Vowels of American English There are approximately 18 vowels in American English made up of monothongs, diphthongs, and reduced vowels (schwa s) They are often described by the articulatory features: High/Low, Front/Back, Retroflexed, Rounded, andtense/lax /i/ iy beat /O/ ao bought /a / ay bite /I/ ih bit /^/ ah but /O / oy Boyd /e/ ey bait /o/ ow boat /a / aw bout /E/ eh bet /U/ uh book [{] ax about /@/ ae bat /u/ uw boot [ ] ix roses /a/ aa Bob /5/ er Bert [}] axr butter CLSP Workshop 2 Acoustic Properties of Speech Sounds 17 Vowel Formant Averages Vowels are often characterized by F1, F2, and F3 High/Low is correlated with F1 Front/Back is correlated with F2 Retroflexion is marked by a low F3 35 Female Speakers 35 Male Speakers 3 F 3 F 2 F 1 3 F 3 F 2 F 1 Average Frequency (Hz) 25 2 15 1 5 Average Frequency (Hz) 25 2 15 1 5 i I e E @ a O ^ o U u 5 { Vowel i I e E @ a O ^ o U u 5 { Vowel CLSP Workshop 2 Acoustic Properties of Speech Sounds 18

Vowel Formant Trajectories Diphthongs can have significant formant motion Most vowels in American English are somewhat diphthongized F 2 27 25 23 21 19 17 15 13 11 9 Female Speakers i e I E a 5 ^ U u { o O O a @ a F 2 27 25 23 21 19 17 15 13 11 9 Male Speakers i e I E @ a a 5 U ^ u o a O { O 7 3 4 5 6 7 8 9 F 1 7 3 4 5 6 7 8 9 F 1 CLSP Workshop 2 Acoustic Properties of Speech Sounds 19 Vowel Durations Each vowel has a different intrinsic duration Schwa s have distinctly shorter durations (5ms) /I, E, ^, U/ are the shortest monothongs Context can greatly influence vowel duration 25 Female Speakers 25 Male Speakers Average Duration (ms) 2 15 1 5 Average Duration (ms) 2 15 1 5 i I e E @ a O ^ o U u 5 { a o a u Vowel i I e E @ a O ^ o U u 5 { a o a u Vowel CLSP Workshop 2 Acoustic Properties of Speech Sounds 2

Fricative Production Turbulence produced at narrow constriction Constriction position determines acoustic characteristics Can be produced with periodic excitation [f] [T] [s] [S] CLSP Workshop 2 Acoustic Properties of Speech Sounds 21 Fricatives of American English There are 8 fricatives in American English They are often described by the features Strident/Non-Strident (Strong/Weak), Voiced/Unvoiced Four places of articulation: Labial, Dental, Alveolar, and Palatal Type Unvoiced Voiced Labial /f/ f fee /v/ v v Dental /T/ th thief /D/ dh thee Alveolar /s/ s see /z/ z z Palatal /S/ sh she /Z/ zh Gigi CLSP Workshop 2 Acoustic Properties of Speech Sounds 22

Fricative Energy NON-STRIDENT STRIDENT Probability Density unadjusted for frequency..2.4.6-1 -9-8 -7-6 -5-4 Average Total Energy Strident fricatives tend to be stronger than non-strident CLSP Workshop 2 Acoustic Properties of Speech Sounds 23 Fricative Durations UNVOICED VOICED Probability Density unadjusted for frequency 2 4 6 8 1 12 14..5.1.15.2.25.3 Duration Voiced fricatives tend to be shorter than unvoiced CLSP Workshop 2 Acoustic Properties of Speech Sounds 24

Nasal Production Velum lowering results in airflow through nasal cavity Consonants produced with closure in oral cavity Nasalized vowels have output through oral and nasal cavities Nasal murmurs have similar spectral characteristics [m] [n] [4] CLSP Workshop 2 Acoustic Properties of Speech Sounds 25 Nasal Consonants of American English Three places of articulation: Labial, Alveolar, and Velar Always attached to a vowel, though can form an entire syllable in unstressed environments ([ní ], [mí ], [4Í ]) /4/ is always post-vocalic Place identified by neighboring formant transitions Type Nasal Labial /m/ m me Dental /n/ n knee Velar /4/ ng sing CLSP Workshop 2 Acoustic Properties of Speech Sounds 26

Nasal Durations Duration (ms) 15 125 1 75 5 25 Singleton Unvoiced Cluster Voiced Cluster Nasal consonants tend to be shorter in clusters with unvoiced consonants, and longer with voiced consonants CLSP Workshop 2 Acoustic Properties of Speech Sounds 27 Semivowel Production Constriction in vocal tract, no turbulence Slower articulatory motion than other consonants Laterals form complete closure with tongue tip, airflow via sides of constriction [w] [y] [r] [l] CLSP Workshop 2 Acoustic Properties of Speech Sounds 28

Semivowels of American English There are 4 semivowels in American English Always attached to a vowel, though /l/ can form an entire syllable in unstressed environments ([lí]) Extreme articulation of a corresponding vowel Similar formant positions Generally weaker due to constriction Type Semivowel Nearest Vowel Glides /w/ w wet /u/ /y/ y yet /i/ Liquids /r/ r red /5/ /l/ l let /o/ CLSP Workshop 2 Acoustic Properties of Speech Sounds 29 Acoustic Properties of Semivowels /w/ is characterized by a very low F1, F2 Typically a rapid spectral falloff above F2 /y/ is characterized by very low F1, very high F2 /r/ is characterized by a very low F3 Prevocalic F3 < medial F3 < postvocalic F3 /l/ is characterized by a low F1 and F2 Often presence of high frequency energy Postvocalic /l/ characterized by minimal spectral discontinuity, gradual motion of formants CLSP Workshop 2 Acoustic Properties of Speech Sounds 3

Aspirant Production /h/ inamericanenglish Turbulence excitation at glottis No constriction in the vocal tract, normal formant excitation Coupling with subglottal system results in little energy in F1 region Periodic excitation can be present in medial position CLSP Workshop 2 Acoustic Properties of Speech Sounds 31 Stop Production Complete closure in the vocal tract, pressure build up Sudden release of the constriction, turbulence noise Can have periodic excitation during closure [b] [d] [g] CLSP Workshop 2 Acoustic Properties of Speech Sounds 32

Stops of American English There are 6 stop consonants in American English Same places of articulation as nasal consonants Unvoiced stops are typically aspirated Voiced stops usually exhibit a voice-bar during closure Information about formant transitions and release useful for classification Type Voiced Unvoiced Labial /b/ b bee /p/ p pea Dental /d/ d Dee /t/ t tea Velar /g/ g geese /k/ k key CLSP Workshop 2 Acoustic Properties of Speech Sounds 33 Singleton Stop Durations VOT Duration (ms) 8 7 6 5 4 3 2 1 b d g p t k The voice onset time (VOT) of unvoiced stops is longer than that of voiced stops CLSP Workshop 2 Acoustic Properties of Speech Sounds 34

/s/-stop Durations VOT Duration (ms) 8 7 6 5 4 3 2 1 p t k Unvoiced stops are unaspirated in /s/ stop sequences CLSP Workshop 2 Acoustic Properties of Speech Sounds 35 Stop-Semivowel Durations VOT Duration (ms) 1 9 8 7 6 5 4 3 2 1 Singletons [Stop][Semivowel] Clusters b d g p t k Semivowels are partially devoiced in stop semivowel sequences CLSP Workshop 2 Acoustic Properties of Speech Sounds 36

Voicing Cues for Stops There are many voicing cues for a stop CLSP Workshop 2 Acoustic Properties of Speech Sounds 37 Affricate Production Alveolar-stop palatal-fricative pairs Sudden release of the constriction, turbulence noise Can have periodic excitation during closure Affricates of American English There are two affricates in American English Voiced Unvoiced /J/ jh judge /C/ ch church CLSP Workshop 2 Acoustic Properties of Speech Sounds 38

Speech from a Close-Talking Microphone Time (seconds)..1.2.3.4.5.6.7.8.9 1. 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 16 Zero Crossing Rate 16 khz 8 8 khz Total Energy 8 Energy -- 125 Hz to 75 Hz Wide Band Spectrogram 8 7 7 6 6 5 5 khz 4 4 khz 3 3 2 2 1 1 Waveform..1.2.3.4.5.6.7.8.9 1. 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 The Thinker is a famous sculpture File: /server/users/jwc/latex/sum97/sennheiser.wav Printed by jwc on Wed Jul 16 11:58:32 1997 Page: 1 CLSP Workshop 2 Acoustic Properties of Speech Sounds 39 Speech from a Omni-Directional Microphone Time (seconds)..1.2.3.4.5.6.7.8.9 1. 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 16 Zero Crossing Rate 16 khz 8 8 khz Total Energy 8 Energy -- 125 Hz to 75 Hz Wide Band Spectrogram 8 7 7 6 6 5 5 khz 4 4 khz 3 3 2 2 1 1 Waveform..1.2.3.4.5.6.7.8.9 1. 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 The Thinker is a famous sculpture File: /server/users/jwc/latex/sum97/bk.wav Printed by jwc on Wed Jul 16 11:57:43 1997 Page: 1 CLSP Workshop 2 Acoustic Properties of Speech Sounds 4

Speech over a Telephone Channel Time (seconds)..1.2.3.4.5.6.7.8.9 1. 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 16 Zero Crossing Rate 16 khz 8 8 khz Total Energy 8 Energy -- 125 Hz to 75 Hz Wide Band Spectrogram 8 7 7 6 6 5 5 khz 4 4 khz 3 3 2 2 1 1 Waveform..1.2.3.4.5.6.7.8.9 1. 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 The Thinker is a famous sculpture File: /server/users/jwc/latex/sum97/telephone.wav Printed by jwc on Wed Jul 16 11:59:12 1997 Page: 1 CLSP Workshop 2 Acoustic Properties of Speech Sounds 41