s. K. Das, P. V. eel Souza, P. s. Gopalakrishnan, F. Jelinck, D. Kanevsky,

Similar documents
Learning Methods in Multilingual Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Modeling function word errors in DNN-HMM based LVCSR systems

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Large vocabulary off-line handwriting recognition: A survey

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

An Online Handwriting Recognition System For Turkish

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Automatic Pronunciation Checker

Switchboard Language Model Improvement with Conversational Data from Gigaword

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A study of speaker adaptation for DNN-based speech synthesis

Rule Learning With Negation: Issues Regarding Effectiveness

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Speech Emotion Recognition Using Support Vector Machine

Lecture 1: Machine Learning Basics

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Lecture 9: Speech Recognition

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Human Emotion Recognition From Speech

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Investigation on Mandarin Broadcast News Speech Recognition

Disambiguation of Thai Personal Name from Online News Articles

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Calibration of Confidence Measures in Speech Recognition

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

WHEN THERE IS A mismatch between the acoustic

Speaker recognition using universal background model on YOHO database

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Deep Neural Network Language Models

Improvements to the Pruning Behavior of DNN Acoustic Models

Discriminative Learning of Beam-Search Heuristics for Planning

SARDNET: A Self-Organizing Feature Map for Sequences

Rule Learning with Negation: Issues Regarding Effectiveness

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Multi-Lingual Text Leveling

arxiv: v1 [math.at] 10 Jan 2016

Word Segmentation of Off-line Handwritten Documents

How to analyze visual narratives: A tutorial in Visual Narrative Grammar

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

English Language and Applied Linguistics. Module Descriptions 2017/18

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Python Machine Learning

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Software Maintenance

Detecting English-French Cognates Using Orthographic Edit Distance

Language properties and Grammar of Parallel and Series Parallel Languages

Assignment 1: Predicting Amazon Review Ratings

Universiteit Leiden ICT in Business

Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Greedy Decoding for Statistical Machine Translation in Almost Linear Time

A Case Study: News Classification Based on Term Frequency

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Probabilistic Latent Semantic Analysis

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

Mandarin Lexical Tone Recognition: The Gating Paradigm

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Voice conversion through vector quantization

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Speech Recognition by Indexing and Sequencing

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Introduction to Causal Inference. Problem Set 1. Required Problems

Computerized Adaptive Psychological Testing A Personalisation Perspective

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Toward a Unified Approach to Statistical Language Modeling for Chinese

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION

Major Milestones, Team Activities, and Individual Deliverables

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Characterizing and Processing Robot-Directed Speech

Transcription:

Large Vocabulary Natural Language Continuous Speech Recognition* L. R. Ba.kis, J. Bellegarda, P. F. Brown, D. Burshtein, s. K. Das, P. V. eel Souza, P. s. Gopalakrishnan, F. Jelinck, D. Kanevsky, R. L. Mercer, A. J. Nadas, D. Nahamoo, M. A. Picheny The present paper describes our current research on automatic speech recognition of continuously read sentences from a naturally-occurring corpus: office correspondence. The recognition system combines features from our current isolated-word recognition system and from our previously developed continuous speech recognition systems. It consists of an acoustic processor, an acoustic channel model, a language model, and a linguistic decoder. Some new features in the recognizer relative to our isolated-word speech recognition system include the use of a fast match to rapidly prune to a manageable number the candidates considered by the detailed match, multiple pronunciations of all function words, and modelling of interphone coarticulatory behavior_ To date, we have recorded training and test data from a set of 10 male talkers. The test data consist of 50 sentences drawn from spontaneously generated memos covered by a 5000 word vocabulary. The perplexity of the test sentences was found to be 93; none of the sentences were part of the data used to generate the language model. Preliminary (speaker-dependent) recognition results on these talkers yielded an average word error rate of 11-0%. 1. Introduction The present paper describes our current research on automatic speech recognition of continuously read sentences from a naturally-occurring corpus: office correspondence. In previous work, we have concentrated on recognition of continuously read sentences from a 250 word vocabulary finite state grammar [1], continuously read sentences from a 1,000 word naturally- * This is a paper originally presented at the Seoul International Conference on Natural Language Processing on November 22,1990. Language Reseach, Volume 26, Number 4, December 1990.0254-4474/699-706 699

700 R. L. Mercer et al. occurring corpus [1], and sentences from 5,000 and 20,000 word naturallyoccurring corpora read with pauses between words [2,3]. This paper extends the previous work towards the recognition of continuously read sentences from a natural corpus covered by a 5,000 word vocabulary. 2. Task Description The office correspondence task was developed by taking a large quantity of IBM internal electronic mail, determining the most frequently occurring 5, 000 words, and selecting from this database sentences fully covered by the 5,000 word vocabulary for test and training purposes [2]. For our experiments, we record a set of 10 male talkers reading training scripts of 2,000 sentences and several test scripts of varying size and recognition difficuities. This paper will report results on one of the test scripts (the "RSX" script) consisting of 50 sentences fully covered by our 5,000 word vocabulary. The training script consisted of sentences fully covered by a 20,000 word vocabulary; the first 500 sentences were 'the same for each talker, while the other 1,500 were different from talker to talker. The average sentence length for the training sentences was 16.4 words, and for the test sentences, 11.8 words. The talkers were from the local New York area. None were IBM employees. All recordings were made in a quiet office environment using a Crown PZM 6S microphone with a 12 bit analog to digital (A/D) converter. The range of the talker's speakil,lg rates was broad; the fastest talker spoke at 170 words per minute (wpm) and the slowest, at 130 wpm. It took each talker approximately one week to record the necessary speech. 3. Description of the Base Recognition System The recognition system is based on features present in our current isolatedword recognition system and in our previously developed continuous speech recognition system. It consists of an acoustic processor, an acoustic channel model, a fast matcher, a language model, and a hypothesis search module. Thus the overall configuration is that described in reference [4]. The acoustic processor extracts a vector of 20 spectral features from the speech signal, and quantizes each feature vector into one of 200 possible

Large Vocabulary Natural Language Continuous Speech Recognition 701 prototype classes. The acoustic channel model describes in a probabilistic fashion the way in which words are realized as sequences of prototypes produced by the acoustic processor. The fast matcher produces a short list of words whose uttering could have caused an indicated acoustic processor prototype string. The language model estimates the probability of the next word in the sentence given the previously hypothesized words in the sentence. The hypothesis search module directs the recognition process, maintaining a tree of currently active hypothesized subsentence paths. It evaluates their likelihood and, accordingly, discards some paths and extends others. There are several modifications that we made to the first two components of our basic recognition system in order to obtain improved continuous speech recognition performance. These will be described in the following sections. This section sketches the basic system. The spectral feature extraction of our acoustic processor is based on an adaptive auditory("ear") model described by Cohen [5]. The processor determines the Euclidean distance of each feature vector produced by the ear model to all 200 prototype vectors, and puts out the identifieroabel) of the nearest prototype. In isolated word experiments Markov model for a word was determined from 10 utterances of the word from 10 speakers, and did not depend on the context [6]. However, to accommodate the requirements of continuous speech, the model for any particular word depends on the immediate word context. The principles of this dependence are outlined in the next section. As recognition proceeds, the hypothesis search module endeavors to extend particular hypothesized paths specified by a sequence of words (or rather lexemes - see the next section) starting with the beginning of the current sentence (our search units are words/lexemes). Since in a natural text task, every word can be followed by any other one with a varying but non-zero probability, the fit of all the words of the vocabulary to the unaccounted-for portion of the acoustic processor output string should be examined. Since the vocabulary is large, the examination is carried out in two steps. The first step, called the Fast Match, reduces the possibilities to a few candidates (30 on average) whose fit, relative to an acoustic Markov model, is then evaluated by the Detailed Match. One version of the Fast Match is described in reference [7]. The current recognizer organizes its Fast Match around a tree constructed from the phonetic baseforms [4,6]

702 R. L. Mercer et al. corresponding to the words of the vocabulary. The branches of the tree are Hidden Markov Models(HMMs) determind by the phone in question. These component HMMs are of a simplified variety (the distributions of all the transitions are identical) allowing fast computation. Thresholding is used to prune this tree by eliminating those paths that do not fit the acoustic label sequence submitted to the Fast Match. The resulting shorter list of acoustically compatible candidate words is further pruned by the language model that eliminates some of the a priori less likely continuations of the hypothesized path being extended. The recognizer uses our standard trigram language model [IJ which is based on an interpolation of rela~ive frequencies of trigrams, bigrams, and unigrams collected from a 200 million word text data base. The interpolation weights are determined by the method of deleted interpolation [I,Section VIII]. The n-igrams used(n= 1,2,3) are sequences of n consecutive words in the training corpl1s that belong to the basic vocabulary. The hypothesis search is, in principle, that described in Section VI of reference [IJ, and is based on the stack algorithm of sequential decoding [8J. The acoustic component of the likelihood score is provided by the acoustic model (see next section), and its linguistic component by the trigram language model. However, path extensions are carried out only for the words specified by the Fast Match component. 4. A Contextual Allophonic Acoustic Model The basic principle of our acoustic model is as follows. To each word of the vocabulary there correspond one of more basic pronunciations, called lexemes. There is also a silence lexeme. The pronunciation of a lexeme is specified by a base/orm, which is a sequence of symbols from a phonetic alphabet of size 64. For instance, the word <either) corresponds to two lexemes with baseforms eel dh ero and ail ixg dh ero, respectively. Our recognizer actually decodes sequences of lexemes rather than words. Each of the 64 phones (phonetic symbols) F can realized by a variety of allophones F(I), F(2),..., F(K), and so a baseform Blt B 2,, Bn is realized by an allophonic sequence Bli/), Bz(i 2),, B.On) (Bb) denotes the i, allophone of the Jk phone of the lexeme) whose identity is determined by the phones of the lexeme and of the hypothesized lexeme string being ex-

Large Vocabulary Natural Language Continuous Speech Recognition 703 tended (between the lexeme and the preceding path there is always inserted a word separation phone). The variant i, of phone B, depends on the class identity of a string of phones centered by B,. The equivalence classification is determined by use of decision trees [9] and depends on pre-training data, as does the variety of allophones of each phone. To each allophone F( i) there corresponds a Markov model, and thus the baseform is the concatenation of the Markov models corresponding to the allophones whose string realizes the lexeme. This baseform them determines the acoustic model of the lexeme in the particular context of the neighboring lexeme string. Each transition in any of the Markov models is identified with one of the arcs in an inventory of 200 arcs. Transitions identified with the same are restricted to have the same output probability distribution over the 200 acoustic processor labels. 5. Supervised Vector Quantization The 200 prototype vectors used by the acoustic processor are selected in an iterative mode intended to optimize the efficiency of the allophonic acoustic model. The procedure is based on the intuitive notion that the individual prototypes should represent the individual arcs in the arc inventory (there is an equal number, 200, of arcs and prototypes), because the latter are the phonetic means used to describe pronunciation. We proceed as follows. We obtain original prototypes by "ordinary" vector quantization. Since the training script determines a lexeme string which in turn models, then determines an allophone string, and each allophone corresponds to a Markov model, then the training feature vector string produced inside the automatic processor [Section 3] corresponds to a particular sequence. Using allophonic Markov models (whose statistics are determined by forward/backward training), we can Viterbi align the feature vectors and arcs of the allophone models. For each arc in the arc inventory we then assemble a collection of feature vectors aligned with it. The 200 collections then lead to a new set of prototypes. This set is the basis of the next iteration of the process: re-iabeling of acoustic processor output; determination of the allophonic varieties of all phones, and of phone string equivalence classification determining the allophone string realization of

704 R. L. Mercer et al. lexemes; estimation of acoustic model statistical parameters; alignment of feature vectors and model arcs; and creation of the next generation of prototypes. Iteration continues as long as a perceptible change in the prototypes is observed. 6. Some Additional Recognizer Adjustments Many of the very frequent words (we call them arbitrarily, function words) are short and are, in continuous speech, carelessly pronounced, so they can benefit by careful treatment [10J. There being only 130 such words, we can afford to model them as inqividual special phones. This can easily be accommodated in the framework of the contextual allophonic acoustic model of Section 4. Speakers sometimes pause at appropriate points m a sentence. The hypothesis search module provides for this possibility by allowing the extension of a path by a silence lexeme. The trigram language model skips this lexeme in the path history when predicting the next word. The hypothesis search determines the match score for a word by dividing the actual probability for the word as computed by the model by the expected value of the match score for the word, given the correct word model. A bias that increases linearly over time is added to force the match score to tend to increase over time on the correct path. The match is terminated when the correct score falls below a preset threshold. In isolated-word speech, the bias term can be set quite high, as the silence that occurs after each word will always allow us to determine when to terminate the match. In continuous speech, a high bias term causes the match to continue over short words, e.g., "do you want us" is recognized as "he was", while a low bias term tends to break long words into short ones. We found that a much smaller bias than used in isolated word speech produced much better performance in continuous speech. 7. Training It was mentioned in Section 2 that ten(10) talkers read 2,000 sentences. The totality of this data is used to determine (in pre-training) the allophonic variety for the phone set, as well as the equivalence classification

Large Vocabulary Natural Language Continuous Speech Recognition 705 determining the desirable allophone from the context preceding and following phones (see Section 4). This specifies the lexeme to allophonic correspondence. The statistical parameters of the HMMs are then determined for the speech of the individual speaker to be recognized. The training will result in proper estimation only if based on the correct lexeme (rather than word) script. The speakers are given an ordinary text to read without being instructed where to pause or how to pronounce each word. Their speech must thus first be subjected to a decision process which determines the location of pauses as well as the identity of the speakers' choices in multi-lexemic words. 8. System Perfonnance For comparison, we will give the result of four(4) experiments dealing with recognition of the natural text covered by a 5,000 word vocabulary: isolated word recognition with context-independent phonetic and fenonic [6] models, and continuous speech recognition with context-independent and allophonic models. The system was trained on all 2,000 sentences from each talker; for isolated word speech, only 100 sentences were available for training, the test script was the "RSX" script (above). Only error rates computed for word deletions and substitutions are reported [10]. Table 1 - Test Results on RSX script under various conditions. Isolated Speech Continuous Speech Phonetic Fenonic Phonetic Allophonic (%) T1 5.6 2.4 25.0 8.8 T2 4.7 3.2 33.1 13.0 T3 5.4 3.7 39.7 18.2 T4 6.3 4.6 28.7 13.0 T5 2.0 1.5 11.7 6.1 T6 7.1 3.0 42.1 13.2 T7 3.5 1.5 24.3 11.5 T8 4.1 1.9 13.4 6.3 T9 5.6 2.5 22.0 8.5 T10 16.2 8.5 28.1 12.2 AVG 6.1 3.3 26.8 11.0

706 R. L. Mercer et al. Approximately 1/3 of the errors for the allophonic models are caused by the fast match not returning the correct lexeme to be processed by the detailed match. Approximately 2.5 CPU hours on a large IBM mainframe is required per talker for continuous speech recognition. This is more than 25 times the CPU time needed for the isolated word task; this figure does not include signal processing, training, and supervision time. Note that the allophonic models produce a larger performance gain relative to context-independent phone models in continuous speech than the fenonic models in isolated word speech. Some of the additional performance gain may be attributable to the use of supervised vector quantization; this was not pursued in the isolated word experiment because of a lack of training data. Future work will include exploring new fast match strategies, better labelling methods, and comparisons to other techniques for performing context-dependent modeling [10]. Continuous Speech Recognition Group Computer Science Department - IBM Research Division Thomas J. Watson Research Center P. O. Box 218, Yorktown Heights, NY 10598 U.S.A.