Evaluation of Adaptive Mixtures of Competing Experts

Similar documents
Python Machine Learning

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

An empirical study of learning speed in backpropagation

Human Emotion Recognition From Speech

Learning Methods for Fuzzy Systems

Modeling function word errors in DNN-HMM based LVCSR systems

A study of speaker adaptation for DNN-based speech synthesis

Softprop: Softmax Neural Network Backpropagation Learning

Modeling function word errors in DNN-HMM based LVCSR systems

Artificial Neural Networks written examination

Knowledge Transfer in Deep Convolutional Neural Nets

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Speech Recognition at ICSI: Broadcast News and beyond

A Neural Network GUI Tested on Text-To-Phoneme Mapping

On the Formation of Phoneme Categories in DNN Acoustic Models

WHEN THERE IS A mismatch between the acoustic

Learning Methods in Multilingual Speech Recognition

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Voice conversion through vector quantization

Word Segmentation of Off-line Handwritten Documents

Calibration of Confidence Measures in Speech Recognition

Speech Emotion Recognition Using Support Vector Machine

Improvements to the Pruning Behavior of DNN Acoustic Models

Deep Neural Network Language Models

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

phone hidden time phone

NCEO Technical Report 27

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

INPE São José dos Campos

Speaker Identification by Comparison of Smart Methods. Abstract

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Lecture 1: Machine Learning Basics

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

(Sub)Gradient Descent

SARDNET: A Self-Organizing Feature Map for Sequences

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Speaker recognition using universal background model on YOHO database

CS Machine Learning

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Assignment 1: Predicting Amazon Review Ratings

Rule Learning With Negation: Issues Regarding Effectiveness

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Learning From the Past with Experiment Databases

Axiom 2013 Team Description Paper

Learning to Schedule Straight-Line Code

Self-Supervised Acquisition of Vowels in American English

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Using focal point learning to improve human machine tacit coordination

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Self-Supervised Acquisition of Vowels in American English

Classification Using ANN: A Review

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Mandarin Lexical Tone Recognition: The Gating Paradigm

Rule Learning with Negation: Issues Regarding Effectiveness

Generative models and adversarial training

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Speaker Recognition. Speaker Diarization and Identification

The Method of Immersion the Problem of Comparing Technical Objects in an Expert Shell in the Class of Artificial Intelligence Algorithms

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

On-Line Data Analytics

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Knowledge Elicitation Tool Classification. Janet E. Burge. Artificial Intelligence Research Group. Worcester Polytechnic Institute

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Evolutive Neural Net Fuzzy Filtering: Basic Description

Syntactic systematicity in sentence processing with a recurrent self-organizing network

Mining Association Rules in Student s Assessment Data

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Exploration. CS : Deep Reinforcement Learning Sergey Levine

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Linking Task: Identifying authors and book titles in verbose queries

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique

On the Combined Behavior of Autonomous Resource Management Agents

Degeneracy results in canalisation of language structure: A computational model of word learning

Attributed Social Network Embedding

Transcription:

Evaluation of Adaptive Mixtures of Competing Experts Steven J. Nowlan and Geoffrey E. Hinton Computer Science Dept. University of Toronto Toronto, ONT M5S 1A4 Abstract We compare the performance of the modular architecture, composed of competing expert networks, suggested by Jacobs, Jordan, Nowlan and Hinton (1991) to the performance of a single back-propagation network on a complex, but low-dimensional, vowel recognition task. Simulations reveal that this system is capable of uncovering interesting decompositions in a complex task. The type of decomposition is strongly influenced by the nature of the input to the gating network that decides which expert to use for each case. The modular architecture also exhibits consistently better generalization on many variations of the task. 1 Introduction If back-propagation is used to train a single, multilayer network to perform different subtasks on different occasions, there will generally be strong interference effects which lead to slow learning and poor generalization. If we know in advance that a set of training cases may be naturally dividej into subsets that correspond to distinct subtasks, interference can be reduced by using a system (see Fig. 1) composed of several different "expert" networks plus a gating network that decides which of the experts should be used for each training case. Systems of this type have been suggested by a number of authors (Hampshire and Waibel, 1989; Jacobs, Jordan and Barto, 1990; Jacobs et al., 1991) (see also the paper by Jacobs and Jordan in this volume (1991». Jacobs, Jordan, Nowlan and Hinton (1991) show that this system can be trained by performing gradient descent 774

Evaluation of Adaptive Mixtures of Competing Experts 775-10 O2 Expert 1 Expert 2 Expert 3 t Intut~ x1 x 2 x3 Gating Network t Input Figure 1: A system of expert and gating networks. Each expert is a feedforward network and all experts receive the same input and have the same number of outputs. The gating network is also feedforward and may receive a different input than the expert networks. It has normalized outputs Pj = exp(xj)/ L:i exp(xd, where Xj is the total weighted input received by output unit j of the gating network. Pj can be viewed as the probability of selecting expert j for a particular case. in the following error function: E C = _loglvie-lidc-o,cil2/2q'2 (1) where E C is the error on training case c, pi is the output of the gating network for expert i, lc is the desired output vector and o{ is the output vector of expert i, and u is constant. The error defined by Equation 1 is simply the negative log probability of generating the desired output vector under a mixture of gaussians model of the probability distribution of possible output vectors given the current input. The output vector of each expert specifies the mean of a multidimensional gaussian distribution. These means are a function of the inputs to the experts. The outputs of the gating network specify the mixing proportions of the experts, so these too are determined by the current input. During learning, the gradient descent in E has two effects. It raises the mixing proportion of experts that do better than average in predicting the desired output vector for a particular case, and it also makes each expert better at predicting the desired output for those cases for which it has a high mixing proportion. The result of these two effects is that, after learning, the gating network nearly always assigns a mixing proportion near 1 to one expert on each case. So towards the end of the learning, each expert can focus on modelling the cases it is good at without interference from the cases for which it has a negligible mixing proportion.

776 Nowlan and Hinton In this paper, we compare mixtures of experts to single back-propagation networks on a vowel recognition task. We demonstrate that the mixtures are better at fitting the training data and better at generalizing than comparable single backpropagation networks. 2 Data and Experimental Procedures The data used in these experiments consisted of the frequencies of the first and second formants for 10 vowels from 75 speakers (32 Males, 28 Females, and 15 Children) (Peterson and Barney, 1952).1 The vowels, which were uttered in an hvd context, were {heed, hid, head, had, hud, hod, hawed, hood, who'd, heard}. The word list was repeated twice by each speaker, with the words in a different random order for each presentation. The resulting spectrograms were hand segmented and the frequencies of the formants extracted from the middle portion of the vowel. The simulations were performed using a conjugate gradient technique, with one weight change after each pass through the training set. For the back-propagation experiments, each simulation was initialised randomly with weight values in the range [-0.5,0.5]. For the mixture systems, the last layer of weights in the gating network was always initialised to 0 so that all experts initially had equal a priori selection probabilities, Pi,k, while all other weights in the gating and expert networks were initialized randomly with values in the range [-0.5,0.5] to break symmetry. The value of u used was 0.25 for all of the mixture simulations. In all cases, the input formant values were linearly scaled by dividing them by 1000, so the first formant was in the range (0,1.5) and the second was in the range (0,4). Two sets of experiments were performed: one in which the performance of different systems on the training data was compared and a second in which the ability of different systems to generalize was compared. Five different types of input were used in each set of experiments: 1. Frequencies of first and second formants only (Form.). 2. Form. plus a localist encoding of the speaker identity (Form. + Speaker ID). 3. Form. plus a localist encoding of whether the speaker was a male, female, or child (Form. + MFC). 4. Form. plus the minimum and maximum frequency for the first and second formant (as real values) over all samples from the speaker (Form. + Range). 5. Form. + MFC + Range. For the simulations in which a single back-propagation network was used the network received the entire set of input values. However, for the mixture systems the expert networks saw only the formant frequencies, while the gating network saw everything but the formant frequencies (except of course when the input consisted only of the formant frequencies). 1 Obtained, with thanks, from Ray Watrous, who originally obtained the data from Ann Syrdal at AT&T Bell Labs.

Evaluation of Adaptive Mixtures of Competing Experts 777 Type of Input # Experts # Hid per Expert # Hid Gating Form. 20 3-5 10 Form. + Speaker ID 10 25 0 Form. + MFC 10 25 0 Form. + MFC + Range 10 25 5 Form. + Range 10 25 5 Table 1: Summary of mixture architecture used with each type of input. Type of Input Mixture Error % BP Error % Sig.(p) Formants only 13.9 ± 0.9 21.8 ± 0.6» 0.9999 Form. + Speaker ID 4.6 ± 0.7 6.2 ± 0.6 > 0.97 Form. + MFC 13.0 ± 0.4 15.4 ± 0.3» 0.9999 Form. + MFC + Range 5.6 ± 0.6 13.1 ± 1.0 ~ 0.9999 Form. + Range 11.6 ± 0.9 13.5 ± 0.4 > 0.998 Table 2: Performance comparison of associative mixture systems and single backpropagation networks on vowel classification task. Results reported are based on an average over 25 simulations for each back-propagation network or mixture system. The BP networks used in the single network simulations contained one layer of hidden units.2 In the mixture systems, the expert networks also contained one layer of hidden units although the number of hidden units in each expert varied. The gating network in some cases contained hidden units, while in other cases it did not (see Table 1). Further details of the simulations may be found in (Nowlan, 1991). 3 Results of Performance Studies In the set of performance experiments, each system was trained with the entire set of 1494 tokens until the magnitude of the gradient vector was < 10-8. The error rate (as percent of total cases) was evaluated on the training data (generalization studies are described in the next section). The very high degree of class overlap in this task makes it extremely difficult to find good solutions with a gradient descent procedure and this is reflected by the far from optimal average performance of all systems on the training data (see Table 2). For purposes of comparison, the best performance ever obtained on this vowel data using speaker dependant classification methods is about 2.5% (Gerstman, 1968; Watrous, 1990). Table 2 reveals that in every case the mixture system performs significantly better3 than a single network given the same input. The most striking, and interesting, 2The number of hidden units was selected by performing a number of initial simulations with different numbers of hidden units for each network and choosing the smallest number which gave near optimal performance. These numbers were 50, 150, 60, 150, and 80 respectively for the five types of input listed above. 3Based on a t-test with 48 degrees of freedom.

778 Nowlan and Hinton Spec. # % Male % Female % Child % Total 0 0.0 0.0 6.7 1.3 4 3.1 3.6 0.0 2.7 5 84.4 17.8 0.0 42.7 7 9.4 7.1 6.7 8.0 8 3.1 42.9 0.0 17.3 9 0.0 28.6 86.7 28.0 Table 3: Speaker decomposition in terms of Male, Female and Child categories for a mixture with speaker identity as input to the gating network. result in Table 2 is contained in the fourth row of the table. While the associative mixture architecture is able to combine the two separate cues of MFC categories and speaker formant range quite effectively, the single back-propagation network fails to do so. The combination of these two different cues in the associative mixture system was obtained by a hierarchical training procedure in which three different experts were first created using the MFC cue alone, and copies of these networks were further specialized when the formant range cue was added to the input received by the gating network (see (Nowlan, 1990; Nowlan, 1991) for details). Since the single back-propagation network is much less modular than the associative mixture system, it is difficult to implement such a hierarchical training procedure in the single network case. (A variety of techniques were explored and details may again be found in (Nowlan, 1991).) Another interesting aspect of the mixture systems, not revealed in Table 2, is the manner in which the training cases were divided among the different expert networks. Once the network was trained, the training cases were clustered by assigning each case to the expert that was selected most strongly by the gating network. The mixture which used only the formant frequencies as input to both the gating and expert networks tended to cluster training cases according to the position of the tongue hump when the vowel is uttered. In all simulations, the four front vowels were always clustered together and handled by a single expert. The low back and high back vowels also tended to be grouped together, but each of these groups was divided among several experts and not always in exactly the same way. The mixture which received speaker identity as well as formant frequencies as input tended to group speakers roughly according to the categories male, female, and child. A typical grouping of speakers by the mixture is shown in Table 3. 4 Results of Generalization Studies In the set of generalization experiments, for all but the input which contained the speaker identity, each system was trained on data from 65 speakers until the magnitude of the gradient vector was < 10-4. The performance was then tested on the data from the 10 speakers not in the training set. Twenty different test sets were created by leaving out different speakers for each, and results are an average over one simulation with each of the test sets. Each test set consisted of 4 male, 3

Evaluation of Adaptive Mixtures of Competing Experts 779 Type of Input Mixture Error % BP Error % Sig.(p) Formants only 15.1 ± 0.9 23.3 ± 1.2 ~ 0.9999 Form. + Speaker ID 6.4 ± 1.3 - ~ 0.9999 Form. + MFC 13.5 ± 0.6 18.4 ± 1.1» 0.9999 Form. + MFC + Range 6.2 ± 0.9 16.1 ± 1.0 ~ 0.9999 Form. + Range 12.8 ± 0.9 16.2 ± 0.8 > 0.9999 Table 4: Generalization comparison of associative mixture systems and single backpropagation networks on vowel classification task. Results reported are based on an average over 20 simulations for each back-propagation network or mixture system. female and 3 child speakers. The generalization tests for the mixture in which speaker identity was part of the input used a different testing strategy. In this case, the training set consisted of 70 speakers and the testing set contained the remaining 5 speakers (2 male, 2 female, 1 child). Again, results are averaged over 20 different testing sets. After the mixture was trained, an expert was selected for each test speaker using one utterance of each of the first 3 vowels, and the performance of the selected expert was tested on the remaining 17 utterances of that speaker. No generalization results are reported for the single back-propagation network which received the speaker identity as well as the first and second formant values, since there is no straightforward way to perform rapid speaker adaptation with this architecture. (See Watrous (Watrous, 1990) for some approaches to speaker adaptation in single networks.) The percentage of misclassifications on the test set for the mixture systems and corresponding single back-propagation networks are summarized in Table 4, and in all cases the mixture system generalizes significantly better 4 than a single network. The relatively poor generalization performance of the single back-propagation networks is not due to overfitting on the training data because the single backpropagation networks perform worse on the training data than the mixture systems on the test data. Also, the associative mixture systems initially contained even more parameters than the corresponding back-propagation networks. (The associative mixture which received formant range data for gating input initially contained almost 3600 parameters, while the corresponding single back-propagation network contained only slightly more than 1200 parameters.) Part of the explanation for the good generalization performance of the mixt ures is the pruning of excess parameters as the system is trained. The number of effective parameters in the final mixture is very often less than half the number in the original system, because a large number of experts have negligible mixing proportions in the final mixture. 5 Discussion The mixture systems outperform single back-propagation networks which receive the same input, and show much better generalization properties when forced to deal with relatively small training sets. In addition, the mixtures can easily be 4Based on a t-test with 38 degrees of freedom.

780 Nowlan and Hinton refined hierarchically by learning a few experts and then making several copies of each and adding additional contextual input to the gating network. The best performance for either single networks or mixture systems is obtained by including the speaker identity as part of the input. When given such input, the mixture systems are capable of discovering speaker categories which give levels of classification performance close to those obtained by speaker dependent classification schemes. Good performance can also be obtained on novel speakers by determining which existing speaker category the new speaker is most similar to (using a small number oflabelled utterances). If, instead, the speaker is represented in terms of features such as male, female, child, and formant range, the mixtures also exhibit good generalization to novel speakers described in terms of these features. Acknow ledgements This research was supported by grants from the Natural Sciences and Engineering Research Council, the Ontario Information Technology Research Center, and Apple Computer Inc. Hinton is the N orand a fellow of the Canadian Institute for Advanced Research. References Gerstman, L. J. (1968). Classification of self-normalized vowels. IEEE Trans. on Audio and Electroacoustics, AU-16(1 ):78-80. Hampshire, J. and Waibel, A. (1989). The Meta-Pi network: Building distributed knowledge representations for robust pattern recognition. Technical Report CMU-CS-89-166, Carnegie-Mellon, Pittsburgh, PA. Jacobs, R. A. and Jordan, M. I. (1991). A competitive modular connectionist architecture. In Touretzky, D. S., editor, Neural Information Processing Systems 3. Morgan Kauffman, San Mateo, CA. Jacobs, R. A., Jordan, M. I., and Barto, A. G. (1990). Task decomposition through competition in a modular connectionist architecture: The what and where vision tasks. Cognitive Science. In Press. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1). Nowlan, S. J. (1990). Competing experts: An experimental investigation of asssociative mixture models. Technical Report CRG-TR-90-5, Department of Computer Science, University of Toronto. Nowlan, S. J. (1991). Soft Competitive Adaptation: Neural Network Learning Algorithms based on Fitting Statistical Mixtures. PhD thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA. Peterson, G. E. and Barney, H. L. (1952). Control methods used in a study of vowels. The Journal of the Acoustical Society of America, 24:175-184. Watrous, R. L. (1990). Speaker normalization and adaptation using second order connectionist networks. Technical Report CRG-TR-90-6, University of Toronto.