University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

Similar documents
Speaker Identification by Comparison of Smart Methods. Abstract

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Speaker recognition using universal background model on YOHO database

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Human Emotion Recognition From Speech

Speaker Recognition. Speaker Diarization and Identification

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speech Emotion Recognition Using Support Vector Machine

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

WHEN THERE IS A mismatch between the acoustic

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

A study of speaker adaptation for DNN-based speech synthesis

Consonants: articulation and transcription

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Body-Conducted Speech Recognition and its Application to Speech Support System

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Segregation of Unvoiced Speech from Nonspeech Interference

Proceedings of Meetings on Acoustics

Lecture 9: Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Automatic segmentation of continuous speech using minimum phase group delay functions

Modeling function word errors in DNN-HMM based LVCSR systems

Probabilistic Latent Semantic Analysis

RETURNING TEACHER REQUIRED TRAINING MODULE YE TRANSCRIPT

Learning Methods in Multilingual Speech Recognition

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Voice conversion through vector quantization

THE RECOGNITION OF SPEECH BY MACHINE

Introduction to the Practice of Statistics

3. Improving Weather and Emergency Management Messaging: The Tulsa Weather Message Experiment. Arizona State University

age, Speech and Hearii

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

Course Law Enforcement II. Unit I Careers in Law Enforcement

Python Machine Learning

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

On the Formation of Phoneme Categories in DNN Acoustic Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Phonetics. The Sound of Language

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Evolutive Neural Net Fuzzy Filtering: Basic Description

CHANCERY SMS 5.0 STUDENT SCHEDULING

arxiv: v1 [math.at] 10 Jan 2016

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

Algebra 2- Semester 2 Review

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Mandarin Lexical Tone Recognition: The Gating Paradigm

Soil & Water Conservation & Management Soil 4308/7308 Course Syllabus: Spring 2008

Learning Methods for Fuzzy Systems

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

English Language and Applied Linguistics. Module Descriptions 2017/18

Rote rehearsal and spacing effects in the free recall of pure and mixed lists. By: Peter P.J.L. Verkoeijen and Peter F. Delaney

Lecture 1: Machine Learning Basics

Chapter 1 Analyzing Learner Characteristics and Courses Based on Cognitive Abilities, Learning Styles, and Context

A Privacy-Sensitive Approach to Modeling Multi-Person Conversations

Assignment 1: Predicting Amazon Review Ratings

Visit us at:

Edinburgh Research Explorer

Creating an Online Test. **This document was revised for the use of Plano ISD teachers and staff.

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Using dialogue context to improve parsing performance in dialogue systems

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Rhythm-typology revisited.

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

REVIEW OF CONNECTED SPEECH

Ansys Tutorial Random Vibration

Support Vector Machines for Speaker and Language Recognition

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Corpus Linguistics (L615)

A Case Study: News Classification Based on Term Frequency

Statistical Parametric Speech Synthesis

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Automatic Pronunciation Checker

Graduate Program in Education

Self Study Report Computer Science

Answer Key Applied Calculus 4

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Transcription:

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005 Lecture 6 Slides Jan 31 st, 2005 Outline of Today s Lecture Cepstral Analysis of speech signals 1

books & sources Huang, Acero, Hon, Spoken Language Processing, Chapter 6. Rabiner&Juang: Fundamentals of speech recognition. Deller et. al. Discrete-time Processing of speech signals Beranek, Acoustics, 1993. Flanagan, Speech Analysis Synthesis and Perception Clark & Yallop, An Intro to Phonetics and phonology Ladefoged A Course in Phonetics Lieberman & Blumstein Speech physiology, speech perception, and acoustic phonetics K. Stevens, Acoustic Phonetics Malmberg, Manual of phonetics Rossing, The Science of Sound Linguistics 001, University of Pennsylvania Background Extract important bits of speech signal while filter out parts that (for intelligibility) do not matter speech compression (compress only essential information, use fewer bits to encode), represent speaker ID separately speech recognition: need concise, accurate, robust speaker normalized representation of speech signal. 2

Recall Linear Systems Suppose we re given signal low-frequency high-freq noise Spectrum combines linearly as well: = + We can use a linear LPF to (mostly) recover x 1 Production Model Glottis impulse train excites vocal tract E( jω) periodic pulse train Time Varying Vocal tract system function Φ( jω) S( jω) Speech signal Information is in θ, but can t use linear filter to separate these two components. Goal: turn convolution into a linear operator (namely, addition) 3

Cepstral Processing Complex Cepstrum keep 1 st and 2 nd term, but often not needed for speech processing (phase is less important) Real Cepstrum keep only 1 st term, assume zero phase. Used very often in speech processing Turns into linear combination Goal: find way to separate glottal excitation from vocal tract response Real Cepstral Processing real & even c s [n] is real and even in n voiced speech analysis (s[n] is periodic, can use DFS representation of S(e jω ), with p=pitch period, and D(k) DFS coefficients: 4

Real Cepstral Processing Since glottal pulse is approximated by impulse train, the DFS coefficients are essentially seen as sampling the underlying vocal tract response Real Cepstral Processing Caveats glottal pulse is not a true impulse et () = gt ()*() it We typically window speech with length L window Convolution in frequency to give: windowing glottal pulse train and weighting it by vocal tract response We will make approximation: 5

Real Cepstral Processing Applying the real cepstrum this is linear combination of what originally was convolution Important points: c s (ω) is periodic (because S(e jω ) is periodic with p=2π) C s (ω) is real C s (ω) is even (since s[m] is real, S(e jω ) is even) DFS (line spectral coefficients) From definition of cepstrum (taking IDTFT) So we ve gone from the IDTFT to the DCT, and also note that α n = c s (n), since c s (n) is symmetric and even Real Cepstral Processing Note, c(ω) is in the frequency domain (it is a log spectrum), but when we do IDTFT, it becomes a funny time domain, but also we are looking at the spectrum of a spectrum (due to DFS and DCT interpretation). So, what is it, time or frequency, or both? Time Domain 1. Frequency Domain 2. Spectrum 3. Frequency Axis 4. Harmonics 5. Filtering (removing components) 1. Quefrency domain 2. Cepstrum 3. Quefrency axis 4. Rahmonics 5. Liftering Hope is that in the quenfrency domain, the summands occupy different parts of quenfrency axis, and if so, we can do some liftering quickly varying part slowly varying part 6

Real Cepstral Processing Real Cepstral Processing We can then do liftering to obtain smoothed spectrum. We use a low-time lifter (similar to low-pass filter in frequency) to remove c e (n), the glottal source, while retaining c φ (n), the element containing the communicative information. converts to quefrency domain 7

Real Cepstral Processing We can t recover φ(n) since all we have is log φ(ω) We can get φ(ω), the magnitude response (but this is ok) Minimum phase assumption H mp has all poles in unit circle H ap is an all-pass system, reflects zeros that are outside of H inside of unit circle, adds poles outside of unit circle So if φ(ω) is min. phase, we can get it from just the magnitude φ(ω) Even from φ(ω), this contains information bearing element in speech Note that we do this for every window in speech, get a series of speech vectors which will be the input to a speech recognition system for windowed frame-based processing (later lecture) But we can do other things as well, including: Application: pitch estimation Another application of this procedure. Since the pitch periods in the above center figure are clear, it is possible to find periodicity in the upper cepstral coefficients and use that at the pitch value. In other words, since both modulation and pitch are clear in this representation, either can relatively easily be extracted. 8