Introduction to Speech Technology

Similar documents
Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Recognition at ICSI: Broadcast News and beyond

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Human Emotion Recognition From Speech

A study of speaker adaptation for DNN-based speech synthesis

Speaker Recognition. Speaker Diarization and Identification

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

English Language and Applied Linguistics. Module Descriptions 2017/18

Segregation of Unvoiced Speech from Nonspeech Interference

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Lecture 9: Speech Recognition

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speaker recognition using universal background model on YOHO database

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Rhythm-typology revisited.

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Voice conversion through vector quantization

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Learning Methods in Multilingual Speech Recognition

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Body-Conducted Speech Recognition and its Application to Speech Support System

WHEN THERE IS A mismatch between the acoustic

Switchboard Language Model Improvement with Conversational Data from Gigaword

Modeling function word errors in DNN-HMM based LVCSR systems

On the Formation of Phoneme Categories in DNN Acoustic Models

REVIEW OF CONNECTED SPEECH

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Emotion Recognition Using Support Vector Machine

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Jacqueline C. Kowtko, Patti J. Price Speech Research Program, SRI International, Menlo Park, CA 94025

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Proceedings of Meetings on Acoustics

Speech Recognition by Indexing and Sequencing

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

CEFR Overall Illustrative English Proficiency Scales

SIE: Speech Enabled Interface for E-Learning

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

Automatic Pronunciation Checker

Speaker Identification by Comparison of Smart Methods. Abstract

A Neural Network GUI Tested on Text-To-Phoneme Mapping

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Automatic intonation assessment for computer aided language learning

Phonological Processing for Urdu Text to Speech System

Mandarin Lexical Tone Recognition: The Gating Paradigm

Biome I Can Statements

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

The Acquisition of English Intonation by Native Greek Speakers

Investigation on Mandarin Broadcast News Speech Recognition

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

THE RECOGNITION OF SPEECH BY MACHINE

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

C a l i f o r n i a N o n c r e d i t a n d A d u l t E d u c a t i o n. E n g l i s h a s a S e c o n d L a n g u a g e M o d e l

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Evolutive Neural Net Fuzzy Filtering: Basic Description

Cross Language Information Retrieval

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Appendix L: Online Testing Highlights and Script

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

IEEE Proof Print Version

Automatic segmentation of continuous speech using minimum phase group delay functions

Letter-based speech synthesis

Artificial Neural Networks written examination

Circuit Simulators: A Revolutionary E-Learning Platform

THE PERCEPTION AND PRODUCTION OF STRESS AND INTONATION BY CHILDREN WITH COCHLEAR IMPLANTS

Using dialogue context to improve parsing performance in dialogue systems

The Smart/Empire TIPSTER IR System

Probabilistic Latent Semantic Analysis

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Calibration of Confidence Measures in Speech Recognition

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Lecture Notes in Artificial Intelligence 4343

A Hybrid Text-To-Speech system for Afrikaans

Transcription:

13/Nov/2008 Introduction to Speech Technology Presented by Andriy Temko Department of Electrical and Electronic Engineering

Page 2 of 30 Outline Introduction & Applications Analysis of Speech Speech Recognition Problem

Page 3 of 30 Speech Signal Speech signal converted to a electrical waveform by a microphone Possibility to be converted to electric waveform and then back to acoustic waveform is the basis for Bell s telephone invention

Page 4 of 30 Speech Chain

Page 5 of 30 Applications: Speech Coding Speech coding block diagram encoder and decoder.

Page 6 of 30 Applications: Text-to-Speech Synthesis Simulation of the entire upper part of Speech Chain Set of linguistic rules determine the appropriate set of sounds Not just simple looking up the words in a pronouncing dictionary: abbreviation, ambiguous words, acronyms, proper names, special terms, intonation, etc Most popular method: Unit Selection & Concatenation

Page 7 of 30 Applications: Speech Recognition Feature Analysis convert a digital speech signal to a set of feature vectors Pattern Matching finds the closest match of the dynamically time-aligned set of feature vectors with a set of stored patterns Speech Recognition extracting a message from a signal Command and control of computer software Voice dictation Dialog with machines help desks and call centers

Page 8 of 30 Applications: Others Speaker Recognition who is speaking Speaker Verification verify the claimed identity Speaker Diarization who spoke when Word Spotting monitoring the signal for a special word Speech/Audio Indexing identifying audio class (Broadcast news transcription) Audio Recognition identifying acoustic events (Audio-based surveillance/smart-rooms) Speech Enhancement make speech more intelligible

Page 9 of 30 Interesting Facts: Perception of Loudness Greatest sensitivity at around 3 to 4 khz. Almost precisely the range of frequencies occupied by most of the sounds of speech! Non-uniform filter-bank analysis

Page 10 of 30 Interesting Facts: Auditory Masking Critical bands phenomena Widely used in speech coding (perceptual lossless coding)

Page 11 of 30 Outline Introduction & Applications Analysis of Speech Speech Recognition Problem

Page 12 of 30 Short-Time Analysis of Speech. Windowing Windowing small portions assumed to be pseudostationary Windowing yields a set of speech samples x(n) weighted by the shape of the window w(n) Generally, successive windows will overlap as w(n) tends to have a shape that will deemphasise samples near it s edges. This breaks the speech down into a sequence of frames.

Page 13 of 30 Short-Time Analysis of Speech. FFT Wide band Narrow band

Page 14 of 30 Short-Time Analysis of Speech. Spectral Envelope Wide band Narrow band You cannot get good time resolution and good frequency resolution from the same spectrogram Uncertainty Principle

Page 15 of 30 Phoneme Speakers and listeners divide words into component sounds called phonemes. Native speakers agree on the phonemes that make up a particular word There are about 42 phonemes in English The actual sound that corresponds to a particular phoneme depends on: The adjacent phonemes in the word or sentence The accent of the speaker The talking speed Whether it is a formal or informal occasion

Page 16 of 30 Voiced / Unvoiced Phoneme Vowels/Consonants discrimination with Zero Crossing Rate and Short Time Energy Determination of Pitch (Fundamental Frequency) with autocorrelation

Page 17 of 30 Outline Introduction & Applications Analysis of Speech Speech Recognition Problem

Page 18 of 30 Speech Recognition Phonologic rules Phonetic models Dictionary and grammar Task model utterance Acoustic front-end Recognition algorithm Understanding algorithm meaning

Page 19 of 30 Speech Recognition Hz Phonologic rules Phonetic models Dictionary and grammar Task model utterance Acoustic front-end Recognition algorithm Understanding algorithm meaning

Page 20 of 30 Speech Recognition Phonologic rules Phonetic models Dictionary and grammar Task model utterance Acoustic front-end Recognition algorithm Understanding algorithm meaning

Page 21 of 30 Speech Recognition Markov Model Phonologic rules Phonetic models Phoneme k-1 Phoneme k Phoneme k+1 Dictionary and grammar Task model utterance Acoustic front-end Recognition algorithm Understanding algorithm meaning

Page 22 of 30 Speech Recognition Phonologic rules Phonetic models Dictionary and grammar Task model utterance Acoustic front-end Recognition algorithm Understanding algorithm meaning

Page 23 of 30 Speech Recognition Phonologic rules Phonetic models Dictionary and grammar Trigram Pr{ the door was not opened} = Pr{ the} Pr{ door/the} Pr{ was/the door} Pr{ not / the door was} Pr{ opened / the door was not} = Pr{ the} Pr{ door/the} Pr{ was/the door }Pr{ not /door was} Pr{ opened / was not} Task model utterance Acoustic front-end Recognition algorithm Understanding algorithm meaning

Page 24 of 30 Speech Recognition DATABASE voice text TRAINING Acoustic front-end Phonetic modeling Language modeling Phonologic rules Phonetic models Dictionary and grammar Task model utterance Acoustic front-end Recognition algorithm Understanding algorithm meaning

Page 25 of 30 A Snapshot of Acoustic Front-End No standard set of features for speech recognition: acoustic/articulatory/auditory

Page 26 of 30 A Snapshot of Recognition Algorithm (I) Viterbi/Baum- Welch alignment Dynamic Time Warping. Weighted Finite State Transducers (WFST)

Page 27 of 30 A Snapshot of Recognition (II) A simple example of the whole decoding network

Page 28 of 30 A Snapshot of Recognition (III)

Page 29 of 30 State of the Art CORPUS STYLE VOCALUBARY SIZE % WORD ERRORS Digit strings spontaneous 11 2.0 Digit strings conversational 11 5.0 Resource Management read 1.000 2.0 Airline Travel Information System (ATIS) spontaneous 2.500 2.5 North American Business News (NAB) Call Home read 64.000 6.6 conversational telephonic 28.000 40.0

Page 30 of 30 Literature - L. R. Rabiner, R. W. Schafer, Introduction to Digital Speech Processing, Foundations and Trends in Signal Processing, Vol. 1, Nos. 1 2, 2007 - X. Huang, A. Acero, H. Hon, R. Reddy, Spoken Language Processing: A Guide to Theory, Algorithm and System, Prentice Hall, 2001 -D. Jurafsky, J.H. Martin, Speech and Language Processing, Prentice Hall, 2001