Speech Recognition Using Demi-Syllable Neural Prediction Model

Similar documents
AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Learning Methods in Multilingual Speech Recognition

On the Formation of Phoneme Categories in DNN Acoustic Models

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Automatic Pronunciation Checker

A study of speaker adaptation for DNN-based speech synthesis

Human Emotion Recognition From Speech

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Artificial Neural Networks written examination

Modeling function word errors in DNN-HMM based LVCSR systems

SARDNET: A Self-Organizing Feature Map for Sequences

A Neural Network GUI Tested on Text-To-Phoneme Mapping

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Lecture 10: Reinforcement Learning

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speaker Identification by Comparison of Smart Methods. Abstract

An Online Handwriting Recognition System For Turkish

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Proceedings of Meetings on Acoustics

Large vocabulary off-line handwriting recognition: A survey

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Learning Methods for Fuzzy Systems

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Python Machine Learning

Evolutive Neural Net Fuzzy Filtering: Basic Description

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

WHEN THERE IS A mismatch between the acoustic

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

INPE São José dos Campos

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Body-Conducted Speech Recognition and its Application to Speech Support System

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Speech Emotion Recognition Using Support Vector Machine

Statewide Framework Document for:

Mandarin Lexical Tone Recognition: The Gating Paradigm

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

First Grade Curriculum Highlights: In alignment with the Common Core Standards

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Word Segmentation of Off-line Handwritten Documents

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Software Maintenance

Test Effort Estimation Using Neural Network

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Lecture 1: Machine Learning Basics

Investigation on Mandarin Broadcast News Speech Recognition

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

The Strong Minimalist Thesis and Bounded Optimality

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Speaker recognition using universal background model on YOHO database

Softprop: Softmax Neural Network Backpropagation Learning

(Sub)Gradient Descent

Speech Recognition by Indexing and Sequencing

Lecture 9: Speech Recognition

Knowledge Transfer in Deep Convolutional Neural Nets

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Lecture 1: Basic Concepts of Machine Learning

Voice conversion through vector quantization

Speaker Recognition. Speaker Diarization and Identification

Learning to Schedule Straight-Line Code

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Segregation of Unvoiced Speech from Nonspeech Interference

arxiv: v1 [math.at] 10 Jan 2016

Reinforcement Learning by Comparing Immediate Reward

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Edinburgh Research Explorer

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Universal contrastive analysis as a learning principle in CAPT

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

1.11 I Know What Do You Know?

A heuristic framework for pivot-based bilingual dictionary induction

Phonological Processing for Urdu Text to Speech System

**Note: this is slightly different from the original (mainly in format). I would be happy to send you a hard copy.**

Linguistics 220 Phonology: distributions and the concept of the phoneme. John Alderete, Simon Fraser University

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Transcription:

Speech Recognition Using Demi-Syllable Neural Prediction Model Ken-ichi so and Takao Watanabe C & C nformation Technology Research Laboratories NEC Corporation 4-1-1 Miyazaki, Miyamae-ku, Kawasaki 213, JAPAN Abstract The Neural Prediction Model is the speech recognition model based on pattern prediction by multilayer perceptrons. ts effectiveness was confirmed by the speaker-independent digit recognition experiments. This paper presents an improvement in the model and its application to large vocabulary speech recognition, based on subword units. The improvement involves an introduction of "backward prediction," which further improves the prediction accuracy of the original model with only "forward prediction". n application of the model to speaker-dependent large vocabulary speech recognition, the demi-syllable unit is used as a subword recognition unit. Experimental results indicated a 95.2% recognition accuracy for a 5000 word test set and the effectiveness was confirmed for the proposed model improvement and the demi-syllable subword units. 1 NTRODUCTON The Neural Prediction Model (NPM) is the speech recognition model based on pattern prediction by multilayer perceptrons (MLPs). ts effectiveness was confirmed by the speaker-independent digit recognition experiments (so, 1989; so, 1990; Levin, 1990). Advantages in the NPM approach are as follows. The underlying process of the speech production can be regarded as the nonlinear dynamical system. Therefore, it is expected that there is causal relation among the adjacent speech feature vectors. n the NPM, the causality is represented by the nonlinear prediction mapping F, at = F w (at - d, (1 ) where at is the speech feature vector at frame t, and subscript w represents mapping parameters. This causality is not explicitly considered in the conventional 227

228 so and Watanabe speech recognition model, where the adjacent speech feature vectors are treated as independent variables. Another important model characteristic is its applicability to continuous speech recognition. Concatenating the recognition unit models, continuous speech recognition and model training from continuous speech can be implemented without the need for segmentation. This paper presents an improvement in the NPM and its application to large vocabulary speech recognition, based on subword units. t is an introduction of "backward prediction," which further improves the prediction accuracy for the original model with only ''forward prediction". n Section 2, the improved predictor configuration, NPM recognition and training algorithms are described in detail. Section 3 presents the definition of demi-syllables used as subword recognition units. Experimental results obtained from speaker-dependent large vocabulary speech recognition are described in Section 4. 2 NEURAL PREDCTON MODEL 2.1 MODEL CONFGURATON Figure 1 shows the MLP predictor architecture. t is given two groups of feature vectors as input. One is feature vectors for "forward prediction". Another is feature vectors for "backward prediction". The former includes the input speech feature vectors, at-tf'., at-l, which have been implemented in the original formulation. The latter, at+l!..., at+tb' are introduced in this paper to further improve the prediction accuracy over the original method, with only "forward prediction". This, for example, is expected to improve the prediction accuracy for voiceless stop consonants, which are characterized by a period of closure interval, followed by a sudden release. The MLP output, fit, is used as a predicted feature vector for input speech Forward prediction Backward prediction h t Hidden layer a t Output layer Figure 1: Multilayer perceptron predictor

Speech Recognition Using Demi-Syllable Neural Prediction Model 229 feature vector at. The difference between the input speech feature vector at and its prediction at is the prediction error. Also, it can be regarded as an error function for the MLP training, based on the back-propagation technique. The NPM for a recognition class, such as a subword unit, is constructed as a state transition network, where each state has an MLP predictor described above (Figure 2). This configuration is similar in form to the Hidden Markov Model (HMM), in which each state has a vector emission probability distribution (Rabiner, 1989). The concatenation of these subword NPMs enables continuous speech recognition. Figure 2: Neural Prediction Model 2.2 RECOGNTON ALGORTHM This section presents the continuous speech recognition algorithm based on the NPM. The concatenation of subword NPMs, which is also the state transition network, is used as a reference model for the input speech. Figure 3 shows a diagram of the recognition algorithm. n the recognition, the input speech is divided into segments, whose number is equal to the total states in the concatenated NPMs (= N). Each state makes a prediction for the corresponding segment. The local prediction error, between the input speech at frame t and the n-th state, is given by (2) where n means the consecutive number of the state in the concatenated NPM. The accumulation oflocal prediction errors defines the global distance between the input speech and the concatenated NPMs T D = min : dt(nt), {nt} t=1 (3) where nt denotes the state number used for prediction at frame t, and a sequence {nl n2,..., nt,..., nt} determines the segmentation of the input speech. The minimization means that the optimal segmentation, which gives a minimum accumulated prediction error, should be selected. This optimization problem can be solved by the use of dynamic-programming. As a result, the DP recursion formula is obtained () d (). {9t -1 ( n) } 9t n = t n + mm (1). 9t-l n - (4) At the end of Equation (4) recursive application, it is possible to obtain D = 9T(N). Backtracking the result provides the input speech segmentation.

230 so and Watanabe NPM '.6..._..._..._... i : i ;! MLP N MLP 2 MLP 1 at f..._... ala2a3 at nput Speech ar 1 Figure 3: Recognition algorithm based on DP n this algorithm, temporal distortion of the input speech is efficiently absorbed by DP based time-alignment between the input speech and an MLPs sequence. For simplicity, the reference model topology shown above is limited to a sequence of MLPs with no branches. t is obvious that the algorithm is applicable to more general topologies with branches. 2.3 TRANNG ALGORTHM This section presents a training algorithm for estimating NPM parameters from continuous utterances. The training goal is to find a set of MLP predictor parameters, which minimizes the accumulated prediction errors for training utterances. The objective function for the minimization is defined as the average value for accumulated prediction errors for all training utterances 1 M ij = M :L D(m), m=l where M is the number of training utterances and D( m) is the accumulated prediction error between the m-th training utterance and its concatenated NPM, whose expression is given by Equation (3). The optimization can be carried out by an iterative procedure, combining dynamic-programming (DP) and back-propagation (BP) techniques. The algorithm is given as follows: 1. nitialize all MLP predictor parameters. 2. Set m = 1. (5)

Speech Recognition Using Demi-Syllable Neural Prediction Model 231 3. Compute the accumulated prediction error D{m) by DP (Equation (4)) and determine the optimal segmentation {nn, using its backtracking. 4. Correct parameters for each MLP predictor by BP, using the optimal segmentation {nn, which determines the desired output at for the actual output at{n;} of the nt-th MLP predictor. 5. ncrease m by 1. 6. Repeat 3-5, while m ~ M. 7. Repeat 2-6, until convergence occurs. Convergence proof for this iterative procedure was given in (so, 1989; so, 1990). This can be intuitively understood by the fact that both DP and BP decrease the accumulated prediction error and that they are applied successively. 3 Demi-Syllable Recognition Units n applying the model to large vocabulary speech recognition, the demi-syllable unit is used as a subword recognition unit (Yoshida, 1989). The demi-syllable is a half syllable unit, divided at the center of the syllable nucleus. t can treat contextual variations, caused by the co-articulation effect, with a moderate unit number. The units consist of consonant-vowel (CV) and vowel-consonant (VC) segments. Word models are made by concatenation of demi-syllable NPMs, as described in the transcription dictionary. Their segmentation boundaries are basically defined as a consonant start point and a vowel center point (Figure 4). Actually, they are automatically determined in the training algorithm, based on the minimum accumulated prediction error criterion (Section 2.3). [hakata] ha as * ka as * ta a> Figure 4: Demi-syllable unit boundary definition 4 EXPERMENTS 4.1 SPEECH DATA AND MODEL CONFGURATON n order to examine the validity of the proposed model, speaker-dependent Japanese isolated word recognition experiments were carried out. Phonetically balanced 250, 500 and 750 word sets were selected from a Japanese word lexicon as training vocabularies. For word recognition experiments, a 250 word test set was prepared. All

232 so and Watanabe the words in the test set were different from those in the training sets. A Japanese male speaker uttered these word sets in a quiet environment. The speech data was sampled at a 16 khz sampling rate, and analyzed by a 10 msec frame period. As a feature vector for each time frame, 10 mel-scaled cepstral parameters, 10 melscaled delta cepstral parameters and a changing ratio parameter for amplitude were calculated from the FFT based spectrum. The NPMs for demi-syllable units were prepared. Their total number was 241, where each demi-syllable NPM consists of a sequence of four MLP predictors, except for silence and long vowel NPMs, which have one MLP predictor. Every MLP predictor has 20 hidden units and 21 output units, corresponding to the feature vector dimensions. The numbers of input speech feature vectors, denoted by TF, for the forward prediction, and by TB, for the backward prediction, in Figure 1, were chosen for the two configurations, (TF' TB) = (2,1) and (3,0). The former, Type A, uses the forward and backward predictions, while the latter, Type B, uses the forward prediction only. 4.2 WORD RECOGNTON EXPERMENTS All possible combinations between training data amounts (= 250,500,750 words) and MLP input layer configurations (Type A and Type B) were evaluated by 5000 word recognition experiments. To reduce the computational amount in 5000 word recognition experiments, the similar word recognition method described below was employed. For every word in the 250 word recognition vocabulary, a 100 similar word set is chosen from the 5000 word recognition vocabulary, using the distance based on the manually defined phoneme confusion matrix. n the experiments, every word in the 250 word utterances is compared with its 100 similar word set. t has been confirmed that a result approximately equivalent to actual 5000 word recognition can be obtained by this similar word recognition method (Koga, 1989). Recognition Accuracy (%)~--------~--------~ 96.0 ~........,..... 1 ' i! --~----i - A l J 94.0 ~"t~~~~~~~~~--~-~~~- ---+~-------~-~~~---~--. f! ~ i 92.0-1_ -i- 90.0 -<~..,...,r -+~_~ ---! 88.0 --i~------.-+------- 250 500 750 Training Data Amount (words) Figure 5: Recognition accuracy vs. training data amounts

Speech Recognition Using Demi-Syllable Neural Prediction Model 233 The results for 5000 word recognition experiments are shown in Figure 5. As a result, consistently higher recognition accuracies were obtained for the input layer configuration with backward prediction (Type A), compared with the configuration without backward prediction (Type B), and absolute values for recognition accuracies become higher with the increase in training data amount. 5 DSCUSSON AND CONCLUSON This paper presents an improvement in the Neural Prediction Model (NPM), which is the introduction of backward prediction, and its application to large vocabulary speech recognition based on the demi-syllable units. As a result of experiments, the NPM applicability to large vocabulary (5000 words) speech recognition was verified. This suggests the usefulness of the recognition and training algorithms for concatenated subword unit NPMs, without the need for segmentation. t was also reported in (Tebelskis, 1990) (90 % for 924 words), where the subword units (phonemes) were limited to a subset of complete Japanese phoneme set and the duration constraints were heuristically introduced. n this paper, the authors used the demi-syllable units, which can cover any Japanese utterances, and no duration constraints. High recognition accuracies (95.2 %), obtained for 5000 words, indicates the advantages of the use of demi-syllable units and the introduction of the backward prediction in the NPM. Acknowledgements The authors wish to thank members of the Media Technology Research Laboratory for their continuous support. References K. so. (1989), "Speech Recognition Using Neural Prediction Model," ECE Technical Report, SP89-23, pp.81-87 (in Japanese). K. so and T. Watanabe. (1990), "Speaker-ndependent Word Recognition Using A Neural Prediction Model," Proc.CASSP-90, S8.8, pp.441-444. E. Levin. (1990), "Word Recognition Using Hidden Control Neural Architecture," Proc.CASSP-90, S8.6, pp.433-436. J. Tebelskis and A. Waibel. (1990), "Large Vocabulary Recognition Using Linked Predictive Neural Networks," Proc. CASSP-90, S.8.7, pp.437-440. L.R.Rabiner. (1989), "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition", Proc. of EEE, Vo1.11, No.2, pp.257-286., February 1989. K. Yoshida, T. Watanabe and S. Koga. (1989), "Large Vocabulary Word Recognition Based on Demi-Syllable Hidden Markov Model Using Small Amount of Training Data," Proc.CASSP-89, S1.1, pp.-4. S. Koga, K. Yoshida, and T. Watanabe. (1989), "Evaluation of Large Vocabulary Speech Recognition Based on Demi-Syllable HMM," Proc. of ASJ Autumn Meeting (in Japanese).