Paper Review Seminar Research Issues in Speech Recognition. Bartosz Ziolko

Similar documents
AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Human Emotion Recognition From Speech

Speech Emotion Recognition Using Support Vector Machine

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

A study of speaker adaptation for DNN-based speech synthesis

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Recognition at ICSI: Broadcast News and beyond

Learning Methods in Multilingual Speech Recognition

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

WHEN THERE IS A mismatch between the acoustic

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Modeling function word errors in DNN-HMM based LVCSR systems

Speaker Identification by Comparison of Smart Methods. Abstract

Speaker recognition using universal background model on YOHO database

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Modeling function word errors in DNN-HMM based LVCSR systems

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Segregation of Unvoiced Speech from Nonspeech Interference

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Automatic segmentation of continuous speech using minimum phase group delay functions

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Lecture 9: Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Speaker Recognition. Speaker Diarization and Identification

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

English Language and Applied Linguistics. Module Descriptions 2017/18

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Calibration of Confidence Measures in Speech Recognition

On the Formation of Phoneme Categories in DNN Acoustic Models

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Speech Recognition by Indexing and Sequencing

A Case Study: News Classification Based on Term Frequency

Edinburgh Research Explorer

INPE São José dos Campos

Body-Conducted Speech Recognition and its Application to Speech Support System

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Proceedings of Meetings on Acoustics

Switchboard Language Model Improvement with Conversational Data from Gigaword

Author's personal copy

Investigation on Mandarin Broadcast News Speech Recognition

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Automatic intonation assessment for computer aided language learning

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

THE RECOGNITION OF SPEECH BY MACHINE

Large vocabulary off-line handwriting recognition: A survey

Evolutive Neural Net Fuzzy Filtering: Basic Description

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Automatic Pronunciation Checker

Reducing Features to Improve Bug Prediction

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Affective Classification of Generic Audio Clips using Regression Models

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Mandarin Lexical Tone Recognition: The Gating Paradigm

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

Python Machine Learning

Support Vector Machines for Speaker and Language Recognition

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Learning Methods for Fuzzy Systems

Rhythm-typology revisited.

Lecture 1: Machine Learning Basics

Rule Learning With Negation: Issues Regarding Effectiveness

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

A Neural Network GUI Tested on Text-To-Phoneme Mapping

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

A Reinforcement Learning Variant for Control Scheduling

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Transcription:

Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko 1

Computer speech recognition system Acoustic signal Automatic speech recognition system Sequence of symbols 1870 Alexander Graham Bell 2

Definition & classification Speech recognition allows computers equipped with a microphone to interpret human speech, e.g. for transcription. It is an alternative method of interacting with a computer. Classification: system requires or does not require the user to "train" the system to recognise speech patterns, system is trained for one user only or is speaker independent, system can recognise continuous speech or discrete words only, system is intended for clear speech material (no distorted speech, background noise or other speaker talking simultaneously) or not, vocabulary is small or large. 3

Applications Computer users can create and edit documents and interact with computer more quickly because people are able to speak faster than anyone can type. People who are poor typists (especially people with sight disability) can extraordinarily increase their productivity. Speaking to computer is much faster and easier than typing! 4

Is speech recognition more than 100 year old? 1. 1870 - Alexander Graham Bell - phonoautograph 2. The Swiss linguist Ferdinand de Saussure Course in General Linguistics (1916) 3. Radio Rex - 1920 5

Approaches Isolated word recognition constrains the possible recognized phrases to a small-sized possible responses. L. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition", Proceedings of the IEEE, vol. 77, no. 2 February 1989. Dictation transcribes speech word by word, does not require semantic understanding, the goal is to identify the exact words. Natural language recognition allows the speaker to provide natural, sentence-length patterns. S. Young, "Large Vocabulary Continuous Speech Recognition." IEEE Signal Processing Magazine 13(5): 45-57, (1996). 6

Scheme of the speech recognition system Time-frequency analysis Lexical decoding L. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition", Proceedings of the IEEE, vol. 77, no. 2 February 1989. Speech segmentation Segment parameterization Fitting the nearest basis element Transcription and building the words Syntactic analysis Semantic analysis 7

Pronunciation English language Afghanistan agency heighten Polish language Afganistan agencja wzmagać German language Afganistan agentur steigen Many words in English language sound alike (e.g. night and knight). I helped Apple wreck a nice beach sounds like I helped Apple recognize speech. Context dependency for the phonemes, phonemes with different left and right context have different realizations. A general solution requires human knowledge and experience as well as advanced pattern recognition and artificial intelligence. 8

Difficulties Co-articulation of phonemes and words makes the task of speech recognition difficult, Intonation and sentence stress plays an important role in the interpretation. Utterances "go!", "go?" and "go." can clearly be recognized by a human but are difficult for a computer, In naturally spoken language there are no pauses between words. It is difficult for a computer to decide where word boundaries lie. 9

Acoustic pressure [db] Speech audibility Tadeusiewicz R., Sygnał mowy (Speech Signal), Wydawnictwa Komunikacji i Łączności, Warszawa, Poland, 1988. Pain threshold 140 120 100 80 60 40 Speech area 20 0 Stimulation threshold -20 0 0.1 1 10 Frequency [khz] 10

Amplitude Jean Baptiste Joseph Fourier On the Propagation of Heat in Solid Bodies 1807 Fourier spectrum s ˆ( f ) s( t) exp( 2 jft ) dt 1 0.75 0.50 0.25 100 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 Frequency [khz] 11

Frequency [Hz] Frequency [Mel] Nonlinear scale f f _ Hz 1000 log 1 1000 _ mel 2 5000 2500 4000 2000 3000 1500 2000 1000 1000 500 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Time [s] 0 0 0.2 0.4 0.6 Time [s] 12

Cepstrum The term cepstrum was introduced by Bogert et al. and has come to be accepted terminology for the (inverse) Fourier transform of the logarithm of the power spectrum of a signal. (L.R.Rabiner and R.W.Schafer, Signal Processing of Speech Signals, Prentice Hall, Englewood-cliffs, NJ, 1978) Etymology: "cepstrum" is an anagram of "spectrum", formed by reversing the first four letters. A cepstrum is the result of taking the Fourier transform of the decibel spectrum as if it were a signal. There is a complex cepstrum and a real cepstrum. The cepstrum was defined in a 1963 paper: Tukey, J. W., B. P. Bogert and M. J. R. Healy: "The quefrency alanysis of time series for echoes: cepstrum, pseudo-autocovariance, cross-cepstrum, and saphe-cracking". Proceedings of the Symposium on Time Series Analysis (M. Rosenblatt, Ed) Chapter 15, 209-243. New York: Wiley. 13

Cepstrum Verbally: the cepstrum is the FT of the log of the power spectrum. Signal Frequenc y spectrum Power spectrum Cepstru m FFT Squaring Smoothing Logarithm FFT Many texts incorrectly state that the process is FT log IFT, i.e. that the cepstrum is the "inverse Fourier transform of the log of the spectrum". 14

Mel-Frequency Cepstrum Coefficients S.B. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences", IEEE Trans. on Acoustics, Speech and Signal Processing, vol. ASSP-28, No.4, 1980. M is the number of cepstrum coefficients X k (k = 1,2,,12) represents the log-energy output of the ith filter S. Young, "Large Vocabulary Continuous Speech Recognition." IEEE Signal Processing Magazine 13(5): 45-57, (1996). 15

Other parameters D.Zhu, K.K.Paliwal, "Product of Power Spectrum and Group Delay Function For Speech Recognition", Proceedings of ICASSP 2004, pp.i- 125-8 Mel-frequency Product Spectrum Cepstral Coefficients phase spectrum information K. Ishizuka and N. Miyazaki, "Speech Feature Extraction Method Representing Periodicity and Aperiodicity in Sub Bands for Robust Speech Recognition", Proceedings of ICASSP 2004, pp.i-141-4. It focuses on feture extraction that represents aperiodicity of speech. The method is based on Gammatone filter banks, framing, autocorrelation and comb filters. H. Misra, S. Ikbal, H. Bourlard, H. Hermansky, "Spectral Entropy Based Feature for Robust ASR", Proceedings of ICASSP 2004, pp.i- 193-6. H. Hermansky, "Perceptual linear predictive (PLP) analysis of speech", J. of Acoust. Soc. Amer., vol. 87, no.4, pp. 1738-52, 1990 Normalizing a spectrum into function like probability mass function (PMF) allows to calculate entropy. Yoshizawa, N. Hayasaka, N. Wada and Y. Miyanaga, "Cepstral Gain Normalization For Noise Robust Speech Recognition", Proceedings of ICASSP 2004, pp.i-209-12. 16

Hidden Markov Model A Hidden Markov Model (HMM) is a statistical model where the system being modelled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters, from the observable parameters, based on this assumption. The extracted model parameters can then be used to perform further analysis, for example for speech recognition applications. Speech recognition systems are generally based on HMM or hybrid solutions with artificial neural networks. Statistical model gives the probability of an observed sequence of acoustic data by the application of Bayes rule: P word acoustic p acoustic word P word p word acoustic L. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition", Proceedings of the IEEE, vol. 77, no. 2 February 1989. P(mushroom soup) > P(much rooms hope) It can be similarly applied for phonemes, words, syntactic and semantics 17

Resolution m Scale a Frequency [Hz] Wavelet spectra Daubechies phi of order 12 Daubechies psi of order 12 5000 4000 3000 2000 1000 0.8 0.6 0.4 0.2 0-0.2-0.4 2 4 6 8 10 12 14 F(d12_phi(w)) 0.75 0.5 0.25 0-0.25-0.5-0.75-6 -4-2 0 2 4 6 F(d12_psi(w)) 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Time [s] 6 14 22 30 38 46 54 62 70 78 86 94 102 110 118 126 134 142 150 1 2 3 4 5 6 7 8 8 1000 2000 3000 4000 5000 6000 7000 Time b 1000 2000 3000 4000 5000 6000 7000 Time 2 -m n STFT versus continuous and discrete wavelet spectrum 1 0.8 0.6 0.4 0.2 0-15 -10-5 0 5 10 15 1 0.8 0.6 0.4 0.2 0-15 -10-5 0 5 10 15 I. Daubechies, Orthonormal bases of compactly supported wavelets, Commun. Pure Appl. Math., pp. 909-996, 1988 O. Rioul, M. Vetterli, Wavelets and signal processing, IEEE Signal Processing Mag., vol.8, pp. 14-38, October 1991. O. Farooq, S. Datta, Wavelet based robust subband features for phoneme recognition, IEE Proceedings: Vision, Image & Signal Processing, vol.151, no.3, pp. 187-93, 2004. 18

Reverse Scale Amplitude Speech signal and its discrete wavelet transform 0 0 0 Time Andrzej 19

The frequency band splitting Decomposition level Frequency [Hz] Discretization density D1 2756 5512 2 t Sampling frequency f 0 11025 Hz D2 1378 2756 ] 4 t means t 90.7 μs discretization density D3 689 1378 8 t D4 345 689 16 t D5 172 345 32 t D6 86 172 64 t=5.805 ms 20

Other topics in speech recognition R. Sarikaya, J.H.L. Hansen, High Resolution Speech Feature Parametrization for Monophone Based Stressed Speech Recognition, IEEE Signal Processing Letters, vol. 7, no. 7, pp. 182-5, July 2000. M. Wester, J. Frankel, S. King, "Asynchronous Articulatory Feature Recogntion Using Dynamic Bayesian Networks",Proc. IEICI Beyond HMM Workshop, Kyoto, December 2004. Impact of stress (neutral, angry, loud, Lombard) on monophone speech recognition accuracy. Paper compares sets of parameters: MFCC, Wavelet Packet Parameters (continuous time), SBC (subband-based cepstral) Waveforms are parameterised as 12 MFCCs and energy with 1st and 2nd derivatives appended. Features are here namely: manner, place, voicing, rounding, front-back, static. 21

Others topics in speech recognition M. Bacchiani and B. Roark, "Metadata Conditional Language Modeling", Proceedings of ICASSP 2004, pp.i-241-4. G.Evermann, H.Y. Chan, M.J.F Gales, T. Hain, X.liu, D.Mrva, L.Wang, P.C. Woodland, "Develpment of the 2003 CU-HTK Conversational Telephone Speech Transcription System", Proceedings of ICASSP 2004, pp.i-249-52. It describes an algorithm using meta-data like calling phone number to recognise speaker and adapt ASR system to the user. HTK is the most recognized academic toolkit for automatic speech recognition system, based on HMM and MFCC. It has been designed at the University of Cambridge by the Machine Intelligence Laboratory. http://htk.eng.cam.ac.uk/ H. Van hamme, "Robust Speech Recognition using Cepstral Domain Missing Data Techniques and Noisy Masks", Proceedings of ICASSP 2004, pp.i-213-6. It describes Missing Data Techniques and improved Missing Data Detector. MDD can compute missing data masks from the noisy signal involving harmonic decomposition without long-term noise averageing. 22

Open issues and research topics Large vocabulary Semantic analysis Phoneme segmentation Different languages Dialects supporting 23

Segmentation Andrzej ENTIRE SEGMENTS 24

Thank you for your attention 25