Speech Communication, Spring Intelligent Multimedia Program -

Similar documents
Consonants: articulation and transcription

Phonetics. The Sound of Language

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speaker Recognition. Speaker Diarization and Identification

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

age, Speech and Hearii

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

THE RECOGNITION OF SPEECH BY MACHINE

Voice conversion through vector quantization

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speech Emotion Recognition Using Support Vector Machine

Body-Conducted Speech Recognition and its Application to Speech Support System

Speaker recognition using universal background model on YOHO database

A study of speaker adaptation for DNN-based speech synthesis

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Segregation of Unvoiced Speech from Nonspeech Interference

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Proceedings of Meetings on Acoustics

SOUND STRUCTURE REPRESENTATION, REPAIR AND WELL-FORMEDNESS: GRAMMAR IN SPOKEN LANGUAGE PRODUCTION. Adam B. Buchwald

Human Emotion Recognition From Speech

On the Formation of Phoneme Categories in DNN Acoustic Models

Audible and visible speech

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

Journal of Phonetics

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Contrasting English Phonology and Nigerian English Phonology

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Evaluation of Various Methods to Calculate the EGG Contact Quotient

The pronunciation of /7i/ by male and female speakers of avant-garde Dutch

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

To appear in the Proceedings of the 35th Meetings of the Chicago Linguistics Society. Post-vocalic spirantization: Typology and phonetic motivations

Klaus Zuberbühler c) School of Psychology, University of St. Andrews, St. Andrews, Fife KY16 9JU, Scotland, United Kingdom

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

An Acoustic Phonetic Account of the Production of Word-Final /z/s in Central Minnesota English

9 Sound recordings: acoustic and articulatory data

Expressive speech synthesis: a review

Speech Recognition at ICSI: Broadcast News and beyond

Mandarin Lexical Tone Recognition: The Gating Paradigm

Online Publication Date: 01 May 1981 PLEASE SCROLL DOWN FOR ARTICLE

Self-Supervised Acquisition of Vowels in American English

Edinburgh Research Explorer

Fix Your Vowels: Computer-assisted training by Dutch learners of Spanish

Beginning primarily with the investigations of Zimmermann (1980a),

Consonant-Vowel Unity in Element Theory*

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Learners Use Word-Level Statistics in Phonetic Category Acquisition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Learning Methods in Multilingual Speech Recognition

source or where they are needed to distinguish two forms of a language. 4. Geographical Location. I have attempted to provide a geographical

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

SARDNET: A Self-Organizing Feature Map for Sequences

Automatic segmentation of continuous speech using minimum phase group delay functions

Modeling function word errors in DNN-HMM based LVCSR systems

THE MULTIVOC TEXT-TO-SPEECH SYSTEM

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Phonology Revisited: Sor3ng Out the PH Factors in Reading and Spelling Development. Indiana, November, 2015

Phonological Processing for Urdu Text to Speech System

Universal contrastive analysis as a learning principle in CAPT

Self-Supervised Acquisition of Vowels in American English

Speaker Identification by Comparison of Smart Methods. Abstract

Course Law Enforcement II. Unit I Careers in Law Enforcement

DEVELOPMENT OF LINGUAL MOTOR CONTROL IN CHILDREN AND ADOLESCENTS

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

WHEN THERE IS A mismatch between the acoustic

Statistical Parametric Speech Synthesis

A Hybrid Text-To-Speech system for Afrikaans

been each get other TASK #1 Fry Words TASK #2 Fry Words Write the following words in ABC order: Write the following words in ABC order:

Algebra 2- Semester 2 Review

Quarterly Progress and Status Report. Sound symbolism in deictic words

Phonological and Phonetic Representations: The Case of Neutralization

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Affricates. Affricates, nasals, laterals and continuants. Affricates. Affricates. Study questions

Lecture 9: Speech Recognition

Automatic Pronunciation Checker

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

Edinburgh Research Explorer

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Prevalence of Oral Reading Problems in Thai Students with Cleft Palate, Grades 3-5

Ansys Tutorial Random Vibration

Automatic intonation assessment for computer aided language learning

Building Text Corpus for Unit Selection Synthesis

Complexity in Second Language Phonology Acquisition

Provisional. Using ambulatory voice monitoring to investigate common voice disorders: Research update

Language Change: Progress or Decay?

Transcription:

Speech Communication, Spring 2006 - Intelligent Multimedia Program - Lecture 1: Introduction, Speech Production and Phonetics Zheng-Hua Tan Speech and Multimedia Communication Division Department of Communication Technology Aalborg University, Denmark zt@kom.aau.dk Speech Communication, I, Zheng-Hua Tan, 2006 1 Part I: Introduction Introduction Problem definition State-of-the-art Course overview Speech production and acoustic phonetics The anatomy of speech production Articulatory phonetics Acoustic phonetics Models of speech production Speech Communication, I, Zheng-Hua Tan, 2006 2

Computer as dream of human being HAL talks, listens, reads lips and solves problems Nature and effortless for huamn Hard for computer Dream of AI scientists and human True in 2001: A Space Odyssey (After 2001: A Space Odyssey, 1968 ) Speech Communication, I, Zheng-Hua Tan, 2006 3 Computer as a reality: state-of-the-art Demo Microsoft demo video Text to speech (TTS) Festival TTS @ CSTR Edinburg University Next generation TTS @ AT&T Speech Communication, I, Zheng-Hua Tan, 2006 4

Information in Speech Speech coding data rates Rate (bits/sec) 200k 100k 64k 32k 16k 12k 9k 4.8k 2k 1k 500 100 60 ADPCM, DPCM, PCM LPC, CELP, MELP, Vocoders Waveform coding Parametric (source) coding Human can understand text: 10 char/sec x 6 bits/ascii char = 60 bits/sec Is content in speech more than 60 bits/sec? Speech Communication, I, Zheng-Hua Tan, 2006 5 Information in Speech cont. Examples That's one small step for man; one giant leap for mankind. -- Neil Armstrong, Apollo 11 Moon Landing Speech "I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today!" -- Martin Luther King, Jr., I Have a Dream Speech contains speaker identity, emotion, meaning, text. speech techniques Speech Communication, I, Zheng-Hua Tan, 2006 6

Speech is a complex process Physiology Linguistics Speech Acoustics Speech Communication, I, Zheng-Hua Tan, 2006 7 Human speech communication process Rabiner and Levinson, IEEE Tans. Communications, 1981 (After Rabiner & Levinson, 1981) Speech synthesis Speech understanding Speech coding Speech recognition Speech Communication, I, Zheng-Hua Tan, 2006 8

Study topics and applications Introduction Speech Production and Acoustics Phonetics Speech Analysis and Speech Synthesis Speech Coding Speech Recognition Speech-Related Tools and Applications Speech Communication, I, Zheng-Hua Tan, 2006 9 Course Outline MM1 Speech production, acoustic phonetics and speech modelling The anatomy of speech production Phonetics Models of speech production MM2 speech analysis Speech perception and its models Short-term processing of speech Linear prediction analysis Cepstral analysis MM3 speech coding and synthesis Speech synthesis Speech coding MM4 - speech recognition Introduction DTW based speech recognition HMM MM5 speech recognition HMM based speech recognition HTK, token passing Speech Communication, I, Zheng-Hua Tan, 2006 10

Literature Textbook: J Deller, J Hansen and J Proakis, Discrete-Time Processing of Speech Signals, IEEE Press, 2000. Reading: Huang, Acero and Hon, Spoken Language Processing, Prentice-Hall, 2001. D. O Shaughnessy, Speech Communications, IEEE Press, 2000 Rabiner and Juang, Fundamentals of Speech Recognition, Prentice-Hall, 1993. Rabiner and Schafer, Digital Processing of Speech Signals, Prentice-Hall, 1978. Speech Communication, I, Zheng-Hua Tan, 2006 11 Part II: Speech production Introduction Speech production, acoustic phonetics and speech modelling The anatomy of speech production Articulatory phonetics Acoustic phonetics Models of speech production Speech Communication, I, Zheng-Hua Tan, 2006 12

The speech chain (After Denes & Pinson, 1993) Speech Communication, I, Zheng-Hua Tan, 2006 13 Schematic diagram of speech production Vocal folds Speech Communication, I, Zheng-Hua Tan, 2006 14

Block diagram of speech production Speech Communication, I, Zheng-Hua Tan, 2006 15 Model of speech production Digital model of speech production Speech Communication, I, Zheng-Hua Tan, 2006 16

Cross section of the larynx Larynx: the source of most speech Vocal cords (folds): the two folds of tissue in the larynx. They can open and shut like a pair of fans. Glottis: the gap between the vocal cords. As air is forced through the glottis the vocal cords will start to vibrate and modulate the air flow. This process is known as phonation. The frequency of vibration determines the pitch of the voice (for a male, 50-200Hz; for a female, up to 500Hz). Speech Communication, I, Zheng-Hua Tan, 2006 17 Vocal cords Vocal cords form a relaxation oscillator (voiced excitation) Speech Communication, I, Zheng-Hua Tan, 2006 18

Glottal flow Volume velocity (cc/sec) Opening phase Closing phase Closure Pitch Period = 12.5ms Fundamental frequency = 1/.0125 = 80Hz 50 Time (ms) Speech Communication, I, Zheng-Hua Tan, 2006 19 Vocal tract modelling Source-filter model Source Filter Vocal tract Output Vocal tract is a concatenation of tubes with varying cross-sectional areas Speech Communication, I, Zheng-Hua Tan, 2006 20

Type of excitation Voiced: produced by forcing air through the glottis vowels (inc. diphthongs) are voiced Unvoiced: generated by forming a constriction at some point along the vocal tract and forcing air through the constriction Speech Communication, I, Zheng-Hua Tan, 2006 21 Role of the vocal tract Vowels: produced by exciting a fixed vocal tract with quasi-periodic pulsed of air caused by vibration of the vocal cords Consonants: a significant restriction and thus weaker in amplitude and noisy-like Formants: resonances determined by the shape of vocal tract, which form the overall spectrum and the properties of the filter Speech Communication, I, Zheng-Hua Tan, 2006 22

The speech signal Speech is a sequence of highly changing sounds When producing sounds, the vocal cords and the various articulators slowly change over time There is a need to study speech sounds, their production, and the signs used to represent them phonetics Speech Communication, I, Zheng-Hua Tan, 2006 23 Phonetics Phonetics: study of speech sounds, their production, and the signs used to represent them. articulatory phonetics: how they are made by moving various organs in the vocal tract. acoustic phonetics: how they are perceived by the human ear and their physical properties. The study is conducted by observing and measuring the speech waveform and spectrum. Speech Communication, I, Zheng-Hua Tan, 2006 24

Speech sounds and waveforms sixteen /s/ /i/ /k/ /s/ /t/ /ee/ /n/ six periodicity, intensity, duration, boundary, etc Speech Communication, I, Zheng-Hua Tan, 2006 25 Observing pitch from waveforms Speech Communication, I, Zheng-Hua Tan, 2006 26

Spectrogram Spectrogram two-dimensional waveform (amplitude/time) is converted into a three-dimensional pattern (amplitude/frequency/time) Wideband spectrogram: analyzed on 15ms sections of waveform with a step of 1ms voiced regions with vertical striations due to the periodicity of the time waveform (each vertical line represents a pulse of vocal folds) while unvoiced regions are solid/random, or snowy Narrowband spectrogram: on 50ms pitch for voiced intervals in horizontal lines Speech Communication, I, Zheng-Hua Tan, 2006 27 Sound Spectrogram: an example waveform F3 F2 Wideband spectrogram F1 narrowband spectrogram Speech Communication, I, Zheng-Hua Tan, 2006 28

Phonemes in American English (After J. Hansen) Speech Communication, I, Zheng-Hua Tan, 2006 29 Phoneme classification chart Sound categorization according to the position of the articulators. (After Rabiner and Schafer, 1978) Speech Communication, I, Zheng-Hua Tan, 2006 30

Vowel production: examples (After Joseph Picone ) Fixed vocal tract shape Voiced Cross-sectional area F i Tongue position sound Speech Communication, I, Zheng-Hua Tan, 2006 31 The vowel space by the locations of the first and second formant frequencies: (After Peterson & Barney, 1952) F1 F2 F3 Speech Communication, I, Zheng-Hua Tan, 2006 32

The vowel triangle Speech Communication, I, Zheng-Hua Tan, 2006 33 Consonant production: examples (After Joseph Picone ) Speech Communication, I, Zheng-Hua Tan, 2006 34

Diphthongs A diphthongs involves an intentional movement from one vowel toward another vowel Differ from two distinct vowels: representing a transition from one vowel target to another, yet neither vowel is actually reached Diphthongs: (Fig. 2.14, pp129, John3 2000) /Y/ hide /W/ down /O/ boy /X/ rose Speech Communication, I, Zheng-Hua Tan, 2006 35 Semivowels Vowel-like, but weaker than most vowels due to their more constricted vocal tract Voiced Semivowels: (Fig. 2.15, pp130, John3 2000) Liquids: /r/ ran /l/ liquid Glides: /w/ want /y/ yard Speech Communication, I, Zheng-Hua Tan, 2006 36

Nasals Produced by the glottal waveform exciting an open nasal cavity and closed oral cavity. Similar to vowel but weaker due to limited ability of the nasal cavity to radiate sound Nasals: /m/ moon /n/ noon /G/ sing Speech Communication, I, Zheng-Hua Tan, 2006 37 Fricatives Produced by exciting the vocal tract with a steady air-stream that becomes turbulent at some point of constriction Fricatives Speech Communication, I, Zheng-Hua Tan, 2006 38

Affricates formed by transitions from a stop to a fricative Affricates: /J/ just /C/ channel Speech Communication, I, Zheng-Hua Tan, 2006 39 Stops (or Plosives) Stops consonants are transient, noncontinuant sounds that are produced by building up pressure behind a total constriction somewhere along the vocal tract, and suddenly releasing this pressure Stops Speech Communication, I, Zheng-Hua Tan, 2006 40

Speech Tool Speech Filing System- Tools for Speech Research It performs standard operations such as recording, replay, waveform editing and labelling, spectrographic and formant analysis and fundamental frequency estimation. http://www.phon.ucl.ac.uk/resource/sfs/ Speech Communication, I, Zheng-Hua Tan, 2006 41 Summary Speech technology The speech chain Anatomy of speech production Speech signals: waveform and spectrogram Phonetics Modelling Next lecture: Speech Analysis Speech Communication, I, Zheng-Hua Tan, 2006 42