Continuous Speech Recognition by Linked Predictive Neural Networks

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

A study of speaker adaptation for DNN-based speech synthesis

Human Emotion Recognition From Speech

Modeling function word errors in DNN-HMM based LVCSR systems

Python Machine Learning

On the Formation of Phoneme Categories in DNN Acoustic Models

Modeling function word errors in DNN-HMM based LVCSR systems

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Learning Methods in Multilingual Speech Recognition

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speech Emotion Recognition Using Support Vector Machine

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Word Segmentation of Off-line Handwritten Documents

Artificial Neural Networks written examination

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

WHEN THERE IS A mismatch between the acoustic

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Learning Methods for Fuzzy Systems

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Speech Recognition by Indexing and Sequencing

Knowledge Transfer in Deep Convolutional Neural Nets

Lecture 1: Basic Concepts of Machine Learning

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Deep Neural Network Language Models

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

SARDNET: A Self-Organizing Feature Map for Sequences

An empirical study of learning speed in backpropagation

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

An Online Handwriting Recognition System For Turkish

Reducing Features to Improve Bug Prediction

Automatic Pronunciation Checker

Lecture 10: Reinforcement Learning

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Generative models and adversarial training

Lecture 1: Machine Learning Basics

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Calibration of Confidence Measures in Speech Recognition

INPE São José dos Campos

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Speaker recognition using universal background model on YOHO database

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Voice conversion through vector quantization

Assignment 1: Predicting Amazon Review Ratings

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Softprop: Softmax Neural Network Backpropagation Learning

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Evolutive Neural Net Fuzzy Filtering: Basic Description

AUTOMATED FABRIC DEFECT INSPECTION: A SURVEY OF CLASSIFIERS

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Lecture 9: Speech Recognition

Improvements to the Pruning Behavior of DNN Acoustic Models

CS Machine Learning

A Reinforcement Learning Variant for Control Scheduling

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

A Case Study: News Classification Based on Term Frequency

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

How to Judge the Quality of an Objective Classroom Test

Why Did My Detector Do That?!

Support Vector Machines for Speaker and Language Recognition

Using focal point learning to improve human machine tacit coordination

Body-Conducted Speech Recognition and its Application to Speech Support System

Evaluating Interactive Visualization of Multidimensional Data Projection with Feature Transformation

Knowledge-Based - Systems

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Device Independence and Extensibility in Gesture Recognition

Transcription:

Continuous Speech Recognition by Linked Predictive Neural Networks Joe Tebelskis, Alex Waibel, Bojan Petek, and Otto Schmidbauer School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Abstract We present a large vocabulary, continuous speech recognition system based on Linked Predictive Neural Networks (LPNN's). The system uses neural networks as predictors of speech frames, yielding distortion measures which are used by the One Stage DTW algorithm to perform continuous speech recognition. The system, already deployed in a Speech to Speech Translation system, currently achieves 95%, 58%, and 39% word accuracy on tasks with perplexity 5, 111, and 402 respectively, outperforming several simple HMMs that we tested. We also found that the accuracy and speed of the LPNN can be slightly improved by the judicious use of hidden control inputs. We conclude by discussing the strengths and weaknesses of the predictive approach. 1 INTRODUCTION Neural networks are proving to be useful for difficult tasks such as speech recognition, because they can easily be trained to compute smooth, nonlinear, nonparametric functions from any input space to output space. In speech recognition, the function most often computed by networks is classification, in which spectral frames are mapped into a finite set of classes, such as phonemes. In theory, classification networks approximate the optimal Bayesian discriminant function [1], and in practice they have yielded very high accuracy [2, 3, 4]. However, integrating a phoneme classifier into a speech recognition system is nontrivial, since classification decisions tend to be binary, and binary phoneme-level errors tend to confound word-level hypotheses. To circumvent this problem, neural network training must be carefully integrated into word level training [1, 5]. An alternative function which can be com- 199

200 Tebelskis, Waibel, Petek, and Schmidbauer puted by networks is prediction, where spectral frames are mapped into predicted spectral frames. This provides a simple way to get non-binary distortion measures, with straightforward integration into a speech recognition system. Predictive networks have been used successfully for small vocabulary [6, 7] and large vocabulary [8, 9] speech recognition systems. In this paper we describe our prediction-based LPNN system [9], which performs large vocabulary continuous speech recognition, and which has already been deployed within a Speech to Speech Translation system [10]. We present our experimental results, and discuss the strengths and weaknesses of the predictive approach. 2 LINKED PREDICTIVE NEURAL NETWORKS The LPNN system is based on canonical phoneme models, which can be logically concatenated in any order (using a "linkage pattern") to create templates for different words; this makes the LPNN suitable for large vocabulary recognition. Each canonical phoneme is modeled by a short sequence of neural networks. The number of nets in the sequence, N >= 1, corresponds to the granularity of the phoneme model. These phone modeling networks are nonlinear, multilayered, feedforward, and "predictive" in the sense that, given a short section of speech, the networks are required to extrapolate the raw speech signal, rather than to classify it. Thus, each predictive network produces a time-varying model of the speech signal which will be accurate in regions corresponding to the phoneme for which that network has been trained, but inaccurate in other regions (which are better modeled by other networks). Phonemes are thus "recognized" indirectly, by virtue of the relative accuracies of the different predictive networks in various sections of speech. Note, however, that phonemes are not classified at the frame level. Instead, continuous scores (prediction errors) are accumulated for various word candidates, and a decision is made only at the word level, where it is finally appropriate. 2.1 TRAINING AND TESTING ALGORITHMS The purpose of the training procedure is both (a) to train the networks to become better predictors, and (b) to cause the networks to specialize on different phonemes. Given a known training utterance, the training procedure consists of three steps: 1. Forward Pass: All the networks make their predictions across the speech sample, and we compute the Euclidean distance matrix of prediction errors between predicted and actual speech frames. (See Figure 1.) 2. Alignment Step: We compute the optimal time-alignment path between the input speech and corresponding predictor nets, using Dynamic Time Warping. 3. Backward Pass: Prediction error is backpropagated into the networks according to the segmentation given by the alignment path. (See Figure 2.) Hence backpropagation causes the nets to become better predictors, and the alignment path induces specialization of the networks for different phonemes. Testing is performed using the One Stage algorithm [11], which is a classical extension of the Dynamic Time Warping algorithm for continuous speech.

Continuous Speech Recognition by Linked ftedictive Neural Networks 201 «CtJ «---- prediction errors..--+--+-----if---~.......... +--+---+-+--+-----4:...... t--t---t--+--t---t--~.............. phoneme "a" predictors phoneme "b" predictors Figure 1: The forward pass during training. Canonical phonemes are modeled by sequences of N predictive networks, shown as triangles (here N=3). Words are represented by "linkage patterns" over these canonical phoneme models (shown in the area above the triangles), according to the phonetic spelling of the words. Here we are training on the word "ABA". In the forward pass, prediction errors (shown as black circles) are computed for all predictors, for each frame of the input speech. As these prediction errors are routed through the linkage pattern, they fill a distance matrix (upper right). «+--+---1--1---1--+---1-: : :: : :A(igr)~~~(p.~th... CtJ «\~~-\ti------+- ~;:;:1tt:;:;:;:;:::;::;:;::;::;::;:;:;:;:;::;:::;::;J;;tt Figure 2: The backward pass during training. After the DTW alignment path has been computed, error is backpropagated into the various predictors responsible for each point along the alignment path. The back propagated error signal at each such point is the vector difference between the predicted and actual frame. This teaches the networks to become better predictors, and also causes the networks to specialize on different phonemes.

202 Tebelskis, Waibel, Petek, and Schmidbauer 3 RECOGNITION EXPERIMENTS We have evaluated the LPNN system on a database of continuous speech recorded at CMU. The database consists of 204 English sentences using a vocabulary of 402 words, comprising 12 dialogs in the domain of conference registration. Training and testing versions of this database were recorded in a quiet office by multiple speakers for speaker-dependent experiments. Recordings were digitized at a sampling rate of 16 KHz. A Hamming window and an FFT were computed, to produce 16 melscale spectral coefficients every 10 msec. In our experiments we used 40 context-independent phoneme models (including one for silence), each of which had a 6-state phoneme topology similar to the one used in the SPICOS system [12]. tv>o tt..1s; I UD IS THIS Il IFfll f~ lie ClII'EREKI (seero = 17. 61 "*,,,theol.. d p/l<qm.;, iii EH l (1/ lit Z 111 IH 5 111 1ft III f 1ft 5 f ER KRAHfRIftNS, (1/ I',.,,; n...... 11, 11... " ",,.,...,.,..........,...,....... " 1. 11'..... -. "... 1111.... "... "... 11... ",., llll...,iiiiiiiiu.. ' ".. ''':::::::::~:: 1I11... lh,... llliltlll.. III... '..... : ::~ml;i!e::!~:l;::~:!i;~: U "." U lu,...,iii.,iiii.iii....... "ll... ".....,11... " " h' U,I.',,111"" I""..., U...,... I.II... UI...,... 111... '.,.111.. 11 11.,.111"' 111_1111,,0, 1111111'1'" '1" """ """ '",. '".IIIII... IIII... IIIIUIIIlIlIl"I... " IU... 'IIUIIlIlIlIl... II'h... ',... IIIU'...,,II'." I. ''';;:;;;:;:~::::,1I1I1I1I11t1l11l"11I.1I""","1I1O,It...,, " "III"hl.. "."""!!~ '"... UIlIlI..." '''...'I......,., ",.. II..., ".." 1' ".'" 1111111... 11"11...,...'....'.1'..., '11." ',.,...... 11 '11'...... 1111 ".0,...,.'.'10,,.. 1... 11... """'11'" " " ""'''. 00."" ''1 '. IUIlIl....,...,,. I I '01' '',,,. "'''." """11 Figure 3: Actual and predicted spectrograms. Figure 3 shows the result of testing the LPNN system on a typical sentence. The top portion is the actual spectrogram for this utterance; the bottom portion shows the frame-by-frame predictions made by the networks specified by each point along the optimal alignment path. The similarity of these two spectrograms indicates that the hypothesis forms a good acoustic model of the unknown utterance (in fact the hypothesis was correct in this case). In our speaker-dependent experiments using two males speakers, our system averaged 95%, 58%, and 39% word accuracy on tasks with perplexity 5, 111, and 402 respectively. In order to confirm that the predictive networks were making a positive contribution to the overall system, we performed a set of comparisons between the LPNN and several pure HMM systems. When we replaced each predictive network by a univariate Gaussian whose mean and variance were determined analytically from the labeled training data, the resulting HMM achieved 44% word accuracy, compared to 60% achieved by the LPNN under the same conditions (single speaker, perplexity 111). When we also provided the HMM with delta coefficients (which were not directly available to the LPNN), it achieved 55%. Thus the LPNN was outperforming each of these simple HMMs.

Continuous Speech Recognition by Linked R-edictive Neural Networks 203 4 HIDDEN CONTROL EXPERIMENTS In another series of experiments, we varied the LPNN architecture by introducing hidden control inputs, as proposed by Levin [7]. The idea, illustrated in Figure 4, is that a sequence of independent networks is replaced by a single network which is modulated by an equivalent number of "hidden control" input bits that distinguish the state. Sequence of Predictive Networks Hidden Control Network Figure 4: A sequence of networks corresponds to a single Hidden Control network. A theoretical advantage of hidden control architectures is that they reduce the number offree parameters in the system. As the number of networks is reduced, each one is exposed to more training data, and - up to a certain point - generalization may improve. The system can also run faster, since partial results of redundant forward pass computations can be saved. (Notice, however, that the total number of forward passes is unchanged.) Finally, the savings in memory can be significant. In our experiments, we found that by replacing 2-state phoneme models by equivalent Hidden Control networks, recognition accuracy improved slightly and the system ran much faster. On the other hand, when we replaced all of the phonemic networks in the entire system by a single Hidden Control network (whose hidden control inputs represented the phoneme as well as its state), recognition accuracy degraded significantly. Hence, hidden control may be useful, but only if it is used judiciously. 5 CURRENT LIMITATIONS OF PREDICTIVE NETS While the LPNN system is good at modeling the acoustics of speech, it presently tends to suffer from poor discrimination. In other words, for a given segment of speech, all of the phoneme models tend to make similarly good predictions, rendering all phoneme models fairly confusable. For example, Figure 5 shows an actual spectrogram and the frame-by-frame predictions made by the /eh/ model and the /z/ model. Disappointingly, both models are fairly accurate predictors for the entire utterance. This problem arises because each predictor receives training in only a small region of input acoustic space (i.e., those frames corresponding to that phoneme). Consequently, when a predictor is shown any other input frames, it will compute an

204 Tebelskis, Waibel, Petek, and Schmidbauer " 1.,0, 1~";";":7';l' lti1"rl......,............... " ",................ "...... ' "... 11... 1'.... 1111"........... " ".... II.. '... ' "....."nu... IIII...'U...,'... UI... IIUlllll l ' ''IIU......,... u'i,... II...,II.... U.l lh..... '... "............. 1..., 1111..."1111,1.111111.. "...,..... 111 111"... 11. 111,......... 111... 1'.. '1111111'..... ".111I111".,t"...,.....'."..., 1111... 11"1... 1,.'1'......."111 "... '...'" "...'11,'1 ""... 111....... '1 11'... 1"........ 11 11...,. '....... '1 1. '..... 1 1... 1 1... 1... 1111... 1...,... 111... 11... 111111...,1,...'11... 1.' '.1 '... 1... /eh/...,... ul,." 'IIII'II....",... "...... II'".II '",.,..."... 1.. 1.11'... 111 "...,... 1...'.11'..."... ' "" '...,,,, '111...... 111'..'1.."'.."1 "'... 1' ".. '''.. '.. ' ' 1... 1.. 111111... 1... 11'... 11 ''' '.''''''"1'1...'... 1111'..."".'.".....,,,"",. 11'... 1111'111 '.. "... 11'... 11111111111... 111..."...'. 1.11'... 1'''...'... 1111 11 11 111111... 11......,........ 1. '... 1...... 111 11.. 11111... 11.',..." "1... 1... IIH '..,... IIII'I'..., ".,... I1...,...... '.1.1 "...,...," "'1111 11111"111... 1..'".11"... 1... 1... 1.. 111'"11 "1... 1 " "...,..., " "',I., IIf,.,. ".I."" : : :~~-::::IC:=:~~::~~!:::::::::::::::::::::~::::.~!::::~::~:::~~~:~.:=::::::::::::~~:::::~::::.-:::~~::.:::::::::~~:;u::::: : :::: : :~:~::::: : :: :: :!:: : :~ :...._.--...,..._.,...,...,.,... 1......-......'...,111...... 11...,...,1111 "...,...,...,...,..."... 11.11 '."N"'",... 1...,...,... I'II.! /z/ r"tt~!"."""'""'"tmmtmtmtnm""'tnt""'ijii1""'rrm~!!!'i1tt!".""rrmrrm""'...,.,nt11""';nnilpl_rm""'m;1,m,mn'""rrmtm nntti1ttttt11tttm.mml t.'1..... 1 "" I......,,,,...,... 1.1... '1111,.......,.,,......,...,It."., '''',.,. u.,..., ".'.. '11'...,.... 11.111... 1... 11..... " ", ".. 1... '.11... '",,... '.11'".11 "..,..,,' 1... 1111'... 111.1111.1111"...,...,.... '00........,.,...,...,..., I..'... I1.. ' I.II... I...'...,,'..."III IIII HIIIIII'., 111 ' 11 11111011'..."... 1... 1'"'''.'...",.,.'. 11 ''.'.'.... 10......'.,...,... 111"... 1111."".'... '.... "... 1.11111...".1 1... 11"' 11'".1,...,...,... 111 111.11111"....,.......,......,...,,..."1'1. 11111111 1'......., "".., '... U '.. I IIII "..., 11 1 '.. 111...'11.."....,,....,..,...,....'"11 ', ".1111111'1 11'" "' "..., '. """111".,.IUff'.""I".'.IIIIIIIIIII'". ",. " 11,...,...'.. 1'1 '... 11".'...,... ' 1 11...",,,,,,,,,,"...,11' "',...,.... ".....,..,.. " '1... 111111111111 11.'... '. '.....,.11 ' '"'... '.1.1I1t1l.1... '..'"'... 1, " ".tt'"iiii, I"...... 111111..... ' '.I... II IH... 1...,...',..,,..., 11..., 'II...',....... 1.. "... 1"..."1.. 1'...... 1...'",,11 "....,."...,,"... 111'''''... 11... '1'..... '.... 1... 11'... '... ",...,...,.11101...,., " "... I... U IIl'.......,......,...,... 1'. 1. ',.... ' " '... 11".. I........".'11., 1.,,.....,," " " 01 ' 11" " ' " "... f... '..... 11...' " '1".','11 "",.. III'..."".. 11'1" "" " " 1 1."110 '.... 111..............,.,....,....,,... 1"",... 1,..... 11",0,, I.... ',..,'... "00,...... 1"...".,.......,.,,". 11 tu,....... ' '"..."1... 11.1."... 1.,.... 0011... 11',... 1...'.'..."....,,"III.'h,..... " 00,... I.H III... I..."..... II...'I,'... 1...... 1" " '" 1...... 1... 11, '.' "... 1'.'.', "' ".... 1..."... 111...,...' "'... 11',....,'..."...,.11 I.,.I.. '~~h ~.!~.., Figure 5: Actual spectrogram, and corresponding predictions by the /eh/ and /z/ phoneme models. undefined output, which may overlap with the outputs of other predictors. In other words, the predictors are currently only trained on positive instances, because it is not obvious what predictive output target is meaningful for negative instances; and this leads to problematic "undefined regions" for the predictors. Clearly some type of discriminatory training technique should be introduced, to yield better performance in prediction based recognizers. 6 CONCLUSION We have studied the performance of Linked Predictive Neural Networks for large vocabulary, continuous speech recognition. Using a 6-state phoneme topology, without duration modeling or other optimizations, the LPNN achieved an average of 95%, 58%, and 39% accuracy on tasks with perplexity 5, 111, and 402, respectively. This was better than the performance of several simple HMMs that we tested. Further experiments revealed that the accuracy and speed of the LPNN can be slightly improved by the judicious use of hidden control inputs. The main advantages of predictive networks are that they produce non-binary distortion measures in a simple and elegant way, and that by virtue of their nonlinearity they can model the dynamic properties of speech (e.g., curvature) better than linear predictive models [13]. Their main current weakness is that they have poor discrimination, since their strictly positive training causes them all to make confusably accurate predictions in any context. Future research should concentrate on improving the discriminatory power of the LPNN, by such techniques as corrective training, explicit context dependent phoneme modeling, and function word modeling.

Continuous Speech Recognition by Linked ltedictive Neural Networks 205 Acknowledgements The authors gratefully acknowledge the support of DARPA, the National Science Foundation, ATR Interpreting Telephony Research Laboratories, and NEC Corporation. B. Petek also acknowledges support from the University of Ljubljana and the Research Council of Slovenia. O. Schmidbauer acknowledges support from his employer, Siemens AG, Germany. References [1] H. Bourlard and C. J. Wellekens. Links Between Markov Models and Multilayer Perceptrons. Pattern Analysis and Machine Intelligence, 12:12, December 1990. [2] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. Phoneme Recognition Using Time-Delay Neural Networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, March 1989. [3] M. Miyatake, H. Sawai, and K. Shikano. Integrated Training for Spotting Japanese Phonemes Using Large Phonemic Time-Delay Neural Networks. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, April 1990. [4] E. McDermott and S. Katagiri. Shift-Invariant, Multi-Category Phoneme Recognition using Kohonen's LVQ2. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, May 1989. [5] P. Haffner, M. Franzini, and A. Waibel. Integrating Time Alignment and Connectionist Networks for High Performance Continuous Speech Recognition. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, May 1991. [6] K. Iso and T. Watanabe. Speaker-Independent Word Recognition Using a Neural Prediction Model. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, April 1990. [7] E. Levin. Speech Recognition Using Hidden Control Neural Network Architecture. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, April 1990. [8] J. Tebelskis and A. Waibel. Large Vocabulary Recognition Using Linked Predictive Neural Networks. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, April 1990. [9] J. Tebelskis, A. Waibel, B. Petek, and O. Schmidbauer. Continuous Speech Recognition Using Linked Predictive Neural Networks. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, May 1991. [10] A. Waibel, A. Jain, A. McNair, H. Saito, A. Hauptmann, and J. Tebelskis. A Speechto-Speech Translation System Using Connectionist and Symbolic Processing Strategies. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, May 1991. [11] H. Ney. The Use of a One-Stage Dynamic Programming Algorithm for Connected Word Recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32:2, April 1984. [12] H. Ney, A. Noll. Phoneme Modeling Using Continuous Mixture Densities. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, April 1988. [13] N. Tishby. A Dynamic Systems Approach to Speech Processing. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, April 1990.