L I T E R AT U R E S U RV E Y - A U T O M AT I C S P E E C H R E C O G N I T I O N

Similar documents
BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Speech Recognition at ICSI: Broadcast News and beyond

Learning Methods in Multilingual Speech Recognition

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Lecture 1: Machine Learning Basics

Human Emotion Recognition From Speech

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Probabilistic Latent Semantic Analysis

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

WHEN THERE IS A mismatch between the acoustic

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speech Emotion Recognition Using Support Vector Machine

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Generative models and adversarial training

A study of speaker adaptation for DNN-based speech synthesis

SARDNET: A Self-Organizing Feature Map for Sequences

Word Segmentation of Off-line Handwritten Documents

An Online Handwriting Recognition System For Turkish

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

English Language and Applied Linguistics. Module Descriptions 2017/18

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Lecture 10: Reinforcement Learning

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Calibration of Confidence Measures in Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Evolutive Neural Net Fuzzy Filtering: Basic Description

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Software Maintenance

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Voice conversion through vector quantization

Axiom 2013 Team Description Paper

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Mandarin Lexical Tone Recognition: The Gating Paradigm

TD(λ) and Q-Learning Based Ludo Players

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Large vocabulary off-line handwriting recognition: A survey

Learning Methods for Fuzzy Systems

Problems of the Arabic OCR: New Attitudes

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Python Machine Learning

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Natural Language Processing. George Konidaris

Florida Reading Endorsement Alignment Matrix Competency 1

Abstractions and the Brain

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Lecture 9: Speech Recognition

On the Formation of Phoneme Categories in DNN Acoustic Models

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Visual CP Representation of Knowledge

Parsing of part-of-speech tagged Assamese Texts

Introduction to Simulation

Using dialogue context to improve parsing performance in dialogue systems

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

INPE São José dos Campos

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

Laboratorio di Intelligenza Artificiale e Robotica

Speaker recognition using universal background model on YOHO database

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Investigation on Mandarin Broadcast News Speech Recognition

The Common European Framework of Reference for Languages p. 58 to p. 82

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Disambiguation of Thai Personal Name from Online News Articles

Reviewed by Florina Erbeli

A Neural Network GUI Tested on Text-To-Phoneme Mapping

DOES RETELLING TECHNIQUE IMPROVE SPEAKING FLUENCY?

Segregation of Unvoiced Speech from Nonspeech Interference

Reducing Features to Improve Bug Prediction

Discriminative Learning of Beam-Search Heuristics for Planning

Content Language Objectives (CLOs) August 2012, H. Butts & G. De Anda

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

22 December Boston University Massachusetts Investigators. Dr. J. Robin Rohlicek Scientist, BBN Inc. Telephone: (617)

Letter-based speech synthesis

Transcription:

L I T E R AT U R E S U RV E Y - A U T O M AT I C S P E E C H R E C O G N I T I O N Heather Sobey Department of Computer Science University Of Cape Town sbyhea001@uct.ac.za ABSTRACT One of the problems faced in speech recognition is that the spoken word can be vastly altered by accents, dialects and mannerisms. In South Africa, there is a large variety of languages and dialects. Even the most basic speech recognition systems perform poorly when trying to recognise words spoken by English second language speakers. The motivation behind this survey is to investigate speech recognition and more specifically what research has been around dealing with the problem of large variations in dialects. 1. INTRODUCTION Speech recognition is the ability of a machine or program to identify words and phrases in spoken language and convert them to a machine-readable format. Many speech recognition applications, such as voice dialing, simple data entry and speech-to-text are in existence today. Automatic speech recognition systems involve numerous separate components drawn from many different disciplines such as statistical pattern recognition, communication theory, signal processing, combinatorial mathematics, and linguistics. Attempts to build automatic speech recognition (ASR) systems were first made in the 1950s. These early speech recognition systems tried to apply a set of grammatical and syntactical rules to identify speech. If the spoken words adhered to a certain rule set, the system could recognise the words. However, human language has numerous exceptions to its own rules. The way words and phrases are spoken can be vastly altered by accents, dialects and mannerisms [3]. Therefore, today s speech recognition systems increasingly rely on statistical methodology, moving away from approaches such as template matching, dynamic time warping, and non-probabilistically motivated distortion measures that were initially proposed [5]. The statistical model that dominates the field today is the hidden Markov Model. 2. OVERVIEW The remainder of this literature survey is structured as follows: Section 3 discusses the different approaches to speech recognition. This is followed in Section 4 by a summary of the speech recognition process. Section 5 briefly discusses hidden Markov Models (HMM) and their application to speech processing. Section 6 looks at some research that has been done on speaker independent systems to handle various dialects. Finally, Section 7 summarises and concludes the paper. 3. VARIOUS APPROACHES TO SPEECH RECOGNITION The three broad approaches to automatic speech recognition are the acoustic-phonetic, pattern recognition and artificial intelligence (AI) approaches [2]. The acoustic phonetic approach to speech recognition has not been very successful in practical speech recognition systems. Both the pattern

recognition and AI approach to speech recognition have achieved higher success rates than the acoustic-phonetic approach. 3.1 Acoustic-Phonetic Approach In this speech recognition algorithm, the system tries to decode the speech signal in a sequential manner based on the observed acoustic features of the speech waveform and the known relations between acoustic features and phonetic symbols. Figure 1 shows a block diagram of the acousticphonetic approach to speech recognition. The first step in the process is the parameter measurement process, which provides an appropriate spectral representation of the speech signal. The next step in the processing is the feature detection stage where the spectral measurements are converted to a set of features that describe the acoustic properties of the various phonetic units. Finally, the recogniser tries to determine the best matching word or sequence of words. 3.2 Pattern Recognition Approach Fig 1. Acoustic Phonetic Approach to Speech Recognition [2] In this approach, the speech patterns are used directly without explicit feature determination and segmentation. The method has two steps-namely, training of speech patterns, and recognition of patterns by way of pattern comparison. Figure 2 shows a block diagram of the pattern-recognition approach. In the parameter measurement phase, a sequence of measurements is made on the input signal to define the test pattern. The unknown test pattern is then compared with each sound reference pattern and a measure of similarity between the test pattern and reference pattern is computed. Finally the decision rule decides which reference pattern best matches the unknown test pattern based on the similarity scores from the pattern classification phase. Fig 2. Pattern Recognition Approach to Speech Recognition [2]

3.3 Artificial Intelligence Recognition Approach This approach is a hybrid of the acoustic-phonetic approach and the pattern recognition approach. In the artificial intelligence approach (AI), an expert system or self-organising (learning) system, implemented by neural networks is used to classify sounds. The basic idea is to compile and incorporate knowledge from a variety of knowledge sources with the problem at hand. 4. SPEECH RECOGNITION PROCESS In essence, the basic task involved in speech recognition is that of going from speech recordings to word labels. As the pattern recognition approach to speech recognition is the most widely used approach, this approach will be discussed in more detail. There are two main variants of the basic speech recognition task, namely isolated word recognition and connected word recognition. 4.1 Variants of the Speech Recognition Task 4.1.1 Isolated word recognition Isolated word recognition refers to the task of recognizing a single spoken word where the choice of words is not constrained to task syntax or semantics. As described in [4], HMMs can be used to build an isolated word recogniser. Briefly, the HMM approach is a well-known and widely used statistical method of characterising the spectral properties of the frames of a pattern. HMMs are particularly suitable for speech recognition as the speech signal can be well characterised as a parametric random process and the parameters of the stochastic process can be determined in a precise, well-defined manner. [2] 4.1.2 Fluent speech Recognition Fluent speech recognition is a more complicated task than isolated word recognition. In this case the task is to recognize a continuous string of words from the vocabulary. 4.2 Feature Extraction and Pattern Recognition The input into an automatic speech recognition system is the speech signal. The two major tasks involved in speech recognition are feature extraction and pattern recognition. 4.2.1 Feature Extraction In all speech recognition systems the first step in the process is signal processing. Initially a spectral and / temporal analysis of the speech signal is performed to give observation vectors which can be used to train the HMMs [4]. One way to obtain observation vectors from speech samples is to perform spectral analysis. A type of spectral analysis that is often used is linear predictive coding (LPC) [4]. 4.2.2 Pattern Recognition Pattern recognition refers to the matching of features. The pattern recognition process consists of training and testing. During training, a model of each vocabulary word must be created. Each model consists of a set of features extracted from the speech signal. The exact form of the model depends on the type of pattern-recognition algorithm used. During testing, a similar model is created for the unknown word. The pattern-recognition algorithm compares the model of the unknown word with the models of known words and selects the word whose model score is highest [7]. There are many different pattern matching techniques. These include templates, Dynamic Time Warping and HMMs.

5. HIDDEN MARKOV MODELS IN SPEECH RECOGNITION Despite huge amounts of research trying to create an intelligent speech recognition machine, we are far from achieving the desired goal of a machine that can understand spoken discourse on any subject by all speakers in all environments [2]. To date, the best results in speech recognition systems have been achieved by those based on hidden Markov models. Hence, most current automatic speech recognition systems are based on HMMs. 5.1 Three Fundamental Problems of HMM design HMM design is characterised by three fundamental problems [4] namely: 1. The evaluation of the probability of a sequence of observations given a specific HMM. 2. The determination of a best sequence of model states. 3. The adjustment of model parameters to best account for the observed signal. There are various methods for solving the above problems discussed in the literature. The most popular technique used to solve problem 1, the Forward-Backward procedure, is an algorithm for computing the probability of a particular observation sequence [8]. The Viterbi algorithm [9] is a dynamic programming algorithm for finding the most likely sequence of hidden states that results in a sequence of observed events. This algorithm is a popular technique for solving problem 2, that of finding the best state sequence for the given observation sequence. The third and most difficult problem in the design of HMMs is the problem of determining a method to maximise the probability of the observation sequence given the model. As Rabiner mentions, there is no known way to analytically solve this problem, neither is there an optimal way of estimating the model parameters. There are various iterative procedures such as the Baum-Welch method, and expectation modification method [10] or gradient techniques [11] that can be used to choose model parameters. The standard criterion for estimation of HMM parameters is maximum likelihood. 5.2 Types of HMMs There are many different types of hidden Markov models. In the ergodic or fully connected HMM, every state of the model can be reached (in a single step) from every other state of the model [4]. In speech processing, the left-right model or Bakis [12] model has been used. The benefit of this model is that it can model signals whose properties change over time [4]. Fig. 3. Illustration of 2 types of HMMs. (a) A 4-state ergodic model. (b) A 4-state left-right model.

5.3 Limitation of HMMs HMMs have been successfully applied to problems in both isolated and connected word recognition. There are however some limitations of this type of statistical model for speech. One major limitation is the assumption made in the model that successive observations are independent. A second limitation is the assumption that the distributions of individual observation parameters can be well represented as a mixture of normal or autoregressive densities. The final assumption limiting the HMM model is the assumption that the probability of being in a given state at a certain time only depends on the previous state. This assumption is inappropriate as speech sound dependencies often extend through several states [4]. 6. SPEAKER INDEPENDENT SPEECH SYSTEMS There are many different languages and dialects throughout the world. Even the most basic (isolated word) speech recognition systems perform poorly when trying to recognise the words spoken by English second language speakers. The following section discusses some research that has been done into various ways of handling difficult situations with large variations in dialects. Improved performance for speaker independent speech recognitions systems requires better modelling of different dialects of the target language. Previous work that has been carried out on this topic, suggests that separate modelling of dialects is needed to accurately capture the many pronunciation differences that occur. Regardless of how much dialect data is included in training, some speakers will not be covered by the resulting model. Those speakers not covered by the model include nonnative speakers of the language and speakers whose speech patterns have been affected by surgery [14]. The BBN BYBLOS system [15] is a continuous speech recognition system that has been used to develop a method of speaker adaptation from limited training. The authors show that the system performs poorly for speakers with strong dialects. They also show how the degradation can be overcome by using speaker adaptation from multiple reference speakers. Their results obtained from testing showed that their current state-of-the-art (SI - speaker independent) models perform poorly when a test speaker s characteristics differ markedly from those of the training speakers. Their SI models have difficulty with a native speaker of English with an African-American dialect for instance. Moreover, nonnative speakers of American English nearly always suffer significantly degraded SI performance. To try to overcome this degradation, they tried to adapt the training models to the new dialects by estimating a probabilistic spectral mapping between each of the training speakers and the test speaker. [15] They found that the overall average word error rate after speaker adaptation was 5 times better than SI recognition for these speakers. These results are evidence of the need and usefulness of speaker adaptation to be able to recognise the speech of speakers whose dialects differ from those found in the training data. The results achieved by the BYBLOS system [15] agree with the testing done on the Cambridge University HTK (CU-HTK) System for the automatic transcription of conversational telephone speech [16]. Testing on the HTK system showed that adaptation to the test speaker and the acoustic environment greatly improves the performance of automatic speech recognisers. [16]

7. CONCLUSION There has been much progress in the field of automatic speech recognition since it s humble beginnings in the 1950s. Various approaches to ASR have been mentioned. Current speech recognition systems are generally based on hidden Markov models as these models have lead to the best results in speech recognition systems thus far. Although HMMs have been very successful, there are a few limitations of the models that were mentioned. The need for, and usefulness of speaker adaptation in speaker independent systems was highlighted. We are a long way from achieving perfect speech recognition and there is much research still to be done in the field of automatic speech recognition.

8. References [1] Speech Recognition. <http://searchcrm.techtarget.com/sdefinition/0,,sid11_gci213033,00.html> Last accessed 28 April 2009. [2] L.R.Rabiner, B.H.Juang. Fundamentals of Speech Recognition, Prentice-Hall, Inc.,Upper Saddle River, NJ. 1993. [3] Grabianowski, Ed. How Speech Recognition Works. 10 November 2006. HowStuffWorks.com. <http://electronics.howstuffworks.com/speech-recognition.htm> Last accessed 26 April 2009. [4] L.R Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE. 77(2):257-286. 1989. [5] J.A Bilmes. Graphical Models and Automatic Speech Recognition. The IMA Volumes in Mathematics and Its Applications. 191-245. [6] Feature Extraction. <http://www.cnel.ufl.edu/~yadu/feature.html>. Last accessed 30 April 2009. [7] Voice Recognition. R.L. Klevans, R.D.Rodman. Voice Recognition. 1997. [8] L.E Baum and J.A Egon. An inequality with applications to statistical estimation for probabilistic functions of a Markov process and to a model for ecology. Bull. Amer. Meteorol. Soc., vol 73. pp. 360-363. 1967. [9] A.J Viterbi. Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Trans. Informat. Theory, vol IT-13, pp 260-269. Apr. 1967. [10] A.P Dempster, N.M Laird and D.B Rubin, Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc, vol.39, no.1, pp 1-38, 1977. [11] S.E. Levinson, L.R Rabiner, and M.M Sondhi. An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. Bell Syst. Tech. J, vol 62, no.4, pp 1035-1074, Apr. 1983. [12] R. Bakis. Continuous speech recognition by statistical methods. Proc. IEEE, vol 64, pp. 532-536. 1976. [13] J.Makhoul, S.Roucos, and H.Gish. Vector quantization in speech coding. Proc IEEE, vol 73, no. 11, pp.1551-1588, Nov 1985. [14] V. Beattie, S.Edmondson, D.Miller, Y.Patel, G.Talvola. An integrated multi-dialect speech recognition system with optional speaker adaptation, In EUROSPEECH-1995, 1123-1126. [15] F. Kubala, S. Austin, C. Barry, J. Makhoul, P. Placeway, R. Schwartz. Byblos. Speech Recognition Benchmark Results. [16] T.Hain, P.C. Woodland, G.Evermann, M.J. F. Gales, X.Liu, G.L. Moore, D.Povey, L.Wang. Automatic Transcription of Conversational Telephone Speech. IEEE transactions on speech and audio processing, vol 13, no.6, November 2005.