This lecture. Speech production and articulatory phonetics. Mel-frequency cepstral coefficients (i.e., the input to ASR systems) Next week: 3 lectures

Similar documents
On the Formation of Phoneme Categories in DNN Acoustic Models

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Consonants: articulation and transcription

Phonetics. The Sound of Language

Speaker Recognition. Speaker Diarization and Identification

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Proceedings of Meetings on Acoustics

age, Speech and Hearii

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

THE RECOGNITION OF SPEECH BY MACHINE

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Phonology Revisited: Sor3ng Out the PH Factors in Reading and Spelling Development. Indiana, November, 2015

Universal contrastive analysis as a learning principle in CAPT

Speech Emotion Recognition Using Support Vector Machine

Speaker recognition using universal background model on YOHO database

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Contrasting English Phonology and Nigerian English Phonology

Segregation of Unvoiced Speech from Nonspeech Interference

SOUND STRUCTURE REPRESENTATION, REPAIR AND WELL-FORMEDNESS: GRAMMAR IN SPOKEN LANGUAGE PRODUCTION. Adam B. Buchwald

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

Lecture 9: Speech Recognition

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Speech Recognition at ICSI: Broadcast News and beyond

To appear in the Proceedings of the 35th Meetings of the Chicago Linguistics Society. Post-vocalic spirantization: Typology and phonetic motivations

Affricates. Affricates, nasals, laterals and continuants. Affricates. Affricates. Study questions

Phonological Processing for Urdu Text to Speech System

source or where they are needed to distinguish two forms of a language. 4. Geographical Location. I have attempted to provide a geographical

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Learning Methods in Multilingual Speech Recognition

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Human Emotion Recognition From Speech

Body-Conducted Speech Recognition and its Application to Speech Support System

Journal of Phonetics

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

MASTERY OF PHONEMIC SYMBOLS AND STUDENT EXPERIENCES IN PRONUNCIATION TEACHING. Master s thesis Aino Saarelainen

WHEN THERE IS A mismatch between the acoustic

Speaker Identification by Comparison of Smart Methods. Abstract

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Consonant-Vowel Unity in Element Theory*

Affricates. Affricates, nasals, laterals and continuants. Affricates. Affricates. Affricates. Affricates 11/20/2015. Phonetics of English 1

Christine Mooshammer, IPDS Kiel, Philip Hoole, IPSK München, Anja Geumann, Dublin

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Audible and visible speech

The Journey to Vowelerria VOWEL ERRORS: THE LOST WORLD OF SPEECH INTERVENTION. Preparation: Education. Preparation: Education. Preparation: Education

Richardson, J., The Next Step in Guided Writing, Ohio Literacy Conference, 2010

The Indian English of Tibeto-Burman language speakers*

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

Fix Your Vowels: Computer-assisted training by Dutch learners of Spanish

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

DEVELOPMENT OF LINGUAL MOTOR CONTROL IN CHILDREN AND ADOLESCENTS

Linguistics 220 Phonology: distributions and the concept of the phoneme. John Alderete, Simon Fraser University

Modeling function word errors in DNN-HMM based LVCSR systems

9 Sound recordings: acoustic and articulatory data

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Mandarin Lexical Tone Recognition: The Gating Paradigm

RP ENGLISH AND CASTILIAN SPANISH DIPHTHONGS REVISITED FROM THE BEATS-AND-BINDING PERSPECTIVE

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

Language Change: Progress or Decay?

Modeling function word errors in DNN-HMM based LVCSR systems

Self-Supervised Acquisition of Vowels in American English

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Beginning primarily with the investigations of Zimmermann (1980a),

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

On the nature of voicing assimilation(s)

Word Stress and Intonation: Introduction

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

Edinburgh Research Explorer

Phonological and Phonetic Representations: The Case of Neutralization

The pronunciation of /7i/ by male and female speakers of avant-garde Dutch

An Acoustic Phonetic Account of the Production of Word-Final /z/s in Central Minnesota English

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

ABSTRACT. Some children with speech sound disorders (SSD) have difficulty with literacyrelated

Radical CV Phonology: the locational gesture *

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Automatic segmentation of continuous speech using minimum phase group delay functions

SARDNET: A Self-Organizing Feature Map for Sequences

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Manner assimilation in Uyghur

Understanding and Supporting Dyslexia Godstone Village School. January 2017

Rhythm-typology revisited.

The ABCs of O-G. Materials Catalog. Skills Workbook. Lesson Plans for Teaching The Orton-Gillingham Approach in Reading and Spelling

Evaluation of Various Methods to Calculate the EGG Contact Quotient

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Complexity in Second Language Phonology Acquisition

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Quarterly Progress and Status Report. Sound symbolism in deictic words

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Transcription:

This lecture Speech production and articulatory phonetics. Mel-frequency cepstral coefficients (i.e., the input to ASR systems) Next week: 3 lectures Some images from Jim Glass course 6.345 (MIT), the Jurafsky & Martin textbook, the Rolling Stones. CSC401/2511 Spring 2017 2

The vocal tract Nasal cavity Many physical structures are co-ordinated in the production of speech. Velum Tongue Lips Generally, sound is generated by passing air through the vocal tract. Glottis Lungs Jaw Sound is modified by constricting airflow in particular ways. CSC401/2511 Spring 2017 3

Place of articulation The location of the primary constriction can be: Alveolar: constriction near the alveolar ridge (e.g., /t/) Bilabial: touching of the lips together (e.g., /m/, /p/) Dental: constriction of/at the teeth (e.g., /th/) Labiodental: constriction between lip and teeth (e.g., /f/) Velar: constriction at or near the velum (e.g., /k/). CSC401/2511 Spring 2017 4

Phonetic features Phonemes can have several features, e.g.: Place: Location of primary constriction. Voicing: Manner: High/low: Front/back: {"#$%&#"', )*#")*"#, +%,-"#, #")*&+%,-"#, $%#"', } Whether the glottis is vibrating. {+$&*1%, $&*1%, } The class of articulation type. {3-&4, $&5%#,,"3"#, 6'*1"-*$%, "44'&7*8",-, '%-'&6#%7, } Anterior position of the tongue. {h*;h, 8*+, #&5, } Ventral position of the tongue. {6'&,-, 1%,-'"#, )"1<, } voicing as in silence CSC401/2511 Spring 2017 5

Spectrograms Spectrograms are 3D representations of any acoustic signal where a co-ordinate represents the amplitude of a particular sinusoidal frequency at a particular time. CSC401/2511 Spring 2017 6

Formants and phonemes Formant: n. A large concentration of energy within a band of frequencies (e.g., = >, =?, = @ ). = @ =? = > CSC401/2511 Spring 2017 7

Phonemic alphabets There are several alphabets that categorize the sounds of speech. The International Phonetic Alphabet (IPA) is popular, but it uses non-ascii symbols. The TIMIT phonemic alphabet will be used by default in this course. Other popular alphabets include ARPAbet, Worldbet, and OGIbet, usually adding special cases. E.g., /pcl/ is the period of silence immediately before a /p/. TIMIT IPA e.g. /iy/ /i y / beat /ih/ /ɪ/ bit /eh/ /ɛ/ bet /ae/ /æ/ bat /aa/ /ɑ/ Bob /ah/ /ʌ/ but /ao/ /ɔ/ bought /uh/ /ʊ/ book /uw/ /u/ boot /ux/ /u/ suit /ax/ /ə/ about CSC401/2511 Spring 2017 8

Phonemes Now we re going to discuss each of these manners of articulation: Vowels: open vocal tract, no nasal air. Fricatives: noisy, with air passing through a tight constriction (e.g., shift ). Stops/plosives: complete vocal tract constriction and burst of energy (e.g., papa ). Nasals: involve air passing through the nasal cavity (e.g., mama ). Semivowels: similar to vowels, but typically with more constriction (e.g., wall ). Affricates: Alveolar stop followed by fricative. CSC401/2511 Spring 2017 9

Vowels (1/6) There are approximately 19 vowels in Canadian English, including diphthongs in which the articulators move over time. Vowels are distinguished primarily by their formants. other /er/ /axr/ e.g. Bert butter diphthong /ey/ /ow/ /ay/ /oy/ /aw/ /ux/ e.g. bait boat bite boy bout suit Monophthong /iy/ /ih/ /eh/ /ae/ /aa/ /ao/ /ah/ /uh/ /uw/ /ax/ /ix/ e.g. beat bit bet bat Bob bought but book boot about roses CSC401/2511 Spring 2017 10

The uniform tube Closed, vibrating end glottis 17 cm Open, radiating end lips The positions of the tongue, jaw, and lips change the shape and cross-sectional area of the vocal tract. CSC401/2511 Spring 2017 11

Uniform tubes in practice Many musical instruments are based on the idea of uniform (or, in many cases, bent) tubes. Longer tubes produce deeper sounds (lower frequencies). A tube ½ the length of another will be 1 octave higher. CSC401/2511 Spring 2017 12

Vowels as concatenated tubes The vocal tract can be modelled as the concatenation of dozens, hundreds, or thousands of tubes. /iy/ /uw/ CSC401/2511 Spring 2017 13

Aside waves in concatenated tubes We model the volume velocity A B and the pressure variation 4 B at position 7 in the < CD lossless tube (whose area is E B ) at time - A B 7, - = A B G - 7 1 A B H - + 7 1 4 B 7, - = I1 E B A B G - 7 1 + A B H - + 7 1 where 1 is the speed of sound, I is the density of air. CSC401/2511 Spring 2017 14

Waves in concatenated tubes Because of partial wave reflections that occur at tube boundaries, we can generate spectra with particular resonances. CSC401/2511 Spring 2017 15

Spectrograms of vowels Tongue height is correlated with the first formant, = >. Tongue frontness is correlated with the second formant, =?. CSC401/2511 Spring 2017 16

The vowel trapezoid = > increases =? increases CSC401/2511 Spring 2017 17

Formants and front/back, high/low Front/ low Front/ high Back/ high CSC401/2511 Spring 2017 18

Fricatives (2/6) Fricatives are caused by acoustic turbulence at a narrow constriction whose position determines the sound. Fricatives can be voiced (i.e., the glottis can be vibrating). Labiodental dental alveolar palatal /f/ /th/ /s/ /sh/ CSC401/2511 Spring 2017 19

Fricatives Fricatives have four places of articulation: Labio-dental ( labial ) Interdental ( dental ) Alveolar Palato-alveolar ( palatal ) Unvoiced Every place of articulation has both a voiced and an unvoiced version. Voiced Labial /f/ fee /v/ Vendetta Dental /th/ thief /dh/ Thee Alveolar /s/ see /z/ Zardoz Palatal /sh/ she /zh/ Zha-zha CSC401/2511 Spring 2017 20

Unvoiced fricatives Note in these examples that /f/ has more energy in the word-final position, and /sh/ excites more of the spectrum than /s/ when word-initial. CSC401/2511 Spring 2017 21

Voiced versus unvoiced fricatives CSC401/2511 Spring 2017 22

Voiced versus unvoiced fricatives Spectra here look similar, but voiced fricatives often include significant energy < 150 Hz. CSC401/2511 Spring 2017 23

Plosives (3/6) Plosives build pressure behind a complete closure in the vocal tract. This closure can be associated with voiced excitation. A sudden release of this constriction results in brief noise. labial /b/ alveolar /d/ velar /g/ CSC401/2511 Spring 2017 24

Plosives Plosives have three places of articulation: Unvoiced Voiced Labial /p/ porpoise /b/ baboon Alveolar /t/ tort /d/ dodo Velar /k/ kick /g/ Google Voiced stops are usually characterized by a voice bar during closure, indicating the vibrating glottis. Formant transitions are very informative in classification. CSC401/2511 Spring 2017 25

Voicing contrasts The voice bar CSC401/2511 Spring 2017 26

Formant transitions among plosives Despite a common vowel, the motion of =? and = @ into (and out of) the vowel helps identify the plosive. CSC401/2511 Spring 2017 27

Voice onset time and voicing cues Voice onset time: n. the time from stop release (i.e., the start of the sound burst) to the start of vocal periodicity. There are at least 6 features that indicate voicing in plosives. CSC401/2511 Spring 2017 28

Voice onset time among plosives Unvoiced plosives have longer voice onset times CSC401/2511 Spring 2017 29

Nasals (4/6) Nasals involve lowering the velum so that air passes through the nasal cavity. Closures in the oral cavity (at same positions as plosives) change the resonant characteristics of the nasal sonorant. labial alveolar velar /m/ /n/ /ng/ CSC401/2511 Spring 2017 30

Formant transitions among nasals Nasals often appear as two formants Despite a common vowel, the motion of =? and = @ before and after each nasal helps to identify it. CSC401/2511 Spring 2017 31

Semivowels (5/6) Semivowels act as consonants in syllables and involve constriction in the vocal tract, but there is less turbulence. They also involve slower articulatory motion. Laterals involve airflow around the sides of the tongue. /w/ /y/ /r/ /l/ CSC401/2511 Spring 2017 32

Semivowels Semivowels are often sub-classified as glides or liquids. Glides Liquids Semivowel Nearest vowel /w/ Wow /uw/ /y/ yoyo /iy/ /r/ rear /er/ /l/ Lulu /ow/ Semivowels are more constricted versions of corresponding vowels. Similar formants, though generally weaker. CSC401/2511 Spring 2017 33

Semivowels Note the drastic formant transitions which are more typical of semivowels. CSC401/2511 Spring 2017 34

Affricates and aspirants (6/6) There are two affricates: /jh/ (voiced; e.g., judge) and /ch/ (unvoiced; e.g., church). These involve an alveolar stop followed by a fricative. Voicing in /jh/ is normally indicated by voice bars, as with plosives. There s only one aspirant in Canadian English: /h/ (e.g., hat) This involves turbulence generated at the glottis, In Canadian English, there is no constriction in the vocal tract. CSC401/2511 Spring 2017 35

Affricates and aspirants CSC401/2511 Spring 2017 36

Putting it all together Phonemes are grouped together in syllables which must contain a non-obstruent nucleus. Consonants can be clustered in particular ways in the onset and coda, but with constraints. E.g., you usually can t start a word with /kt/, nor end one with /tk/. CSC401/2511 Spring 2017 37

Alternative pronunciations Pronunciations of words can vary significantly, but with observable frequencies. The Switchboard corpus is a phonetically annotated database of speech recorded in telephone conversations. CSC401/2511 Spring 2017 38

Known effects of pronunciation Speakers tend to drop or change pronunciations in predictable ways in order to reduce the effort required to co-ordinate the various articulators. Palatalization generally refers to a conflation of phonemes closer to the frontal palate than they should be. Final t/d deletion is simply the omission of alveolar plosives from the ends of words. CSC401/2511 Spring 2017 39

Variation at syllable boundaries CSC401/2511 Spring 2017 40

Recall our input to ASR Frame Spectrum Amplitude We want to transform the spectrum into something more useful Frequency (Hz) CSC401/2511 Spring 2017 41

1. The Mel-scale filter bank To mimic the response of the human ear (and because it empirically improves speech recognition), we often discretize the spectrum using J triangular filters. Uniform spacing before 1 khz, logarithmic after 1 khz CSC401/2511 Spring 2017 42

2. Source and filter The acoustics of speech are produced by a glottal pulse waveform (the source) passing through a vocal tract whose shape modifies that wave (the filter). The shape of the vocal tract is more important to phoneme recognition. We to separate the source from the filter in the acoustics. CSC401/2511 Spring 2017 43

2. Source and filter (aside) Since speech is assumed to be the output of a linear time invariant system, it can be described as a convolution. Convolution, 7 L, is beyond the scope of this course, but can be conceived as the modification of one signal by another. For speech signal 7[,], glottal signal ;,, and vocal tract transfer $[,] with spectra O[P], Q[P], and R[P], respectively : 7, = ;, $, O P = Q P R P log O[P] = log Q P + log R[P] We ve separated the source and filter into two terms! CSC401/2511 Spring 2017 44

2. The cepstrum We separate the source and the filter by pretending the log of the spectrum is actually a time domain signal. the log spectrum log O[P] is a sum of the log spectra of the source and filter, i.e., a superposition; finding its spectrum will allow us to isolate these components. Cepstrum: n. the spectrum of the log of the spectrum. Fun fact: ceps is the reverse of spec. Instead of filters we have lifters log log log CSC401/2511 Spring 2017 45

2. The cepstrum Spectrum Log spectrum Cepstrum The domain of the cepstrum is quefrency (a play on the word frequency ). CSC401/2511 Spring 2017 46

2. The cepstrum Spectrum Pictures from John Coleman (2005) Cepstrum This is due to the vocal tract shape This is due to the glottis CSC401/2511 Spring 2017 47

Mel-frequency cepstral coefficients Mel-frequency cepstral coefficients (MFCCs) are the most popular representation of speech used in ASR. They are the spectra of the logarithms of the mel-scaled filtered spectra of the windows of the waveform. Speech signal window DFT Mel filterbank log DFT MFCC CSC401/2511 Spring 2017 48

Advantages of MFCCs The cepstrum produces highly uncorrelated features (every dimension is useful). This includes a separation of the source and filter. In practice, the cepstrum has been easier to learn than the spectrum for phoneme recognition. There is an efficient method to compute cepstra called the discrete cosine transform. CSC401/2511 Spring 2017 49

MFCCs in practice An observation vector of MFCCs often consists of The first 13 cepstral coefficients (i.e., the first 13 dimensions produced by this method), An additional overall energy measure, The velocities (V) of each of those 14 dimensions, i.e., the rate of change of each coefficient at a given time The accelerations (VV) of each of original 14 dimensions. The result is that at a timeframe - we have an observation MFCC vector of (13+1)*3=42 dimensions. This vector is what is used by our ASR systems CSC401/2511 Spring 2017 50

Next week