Deep Semantic Encodings for Language Modeling

Similar documents
Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Deep Neural Network Language Models

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Improvements to the Pruning Behavior of DNN Acoustic Models

arxiv: v1 [cs.lg] 7 Apr 2015

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Calibration of Confidence Measures in Speech Recognition

Learning Methods in Multilingual Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

arxiv: v1 [cs.cl] 27 Apr 2016

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Python Machine Learning

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

WHEN THERE IS A mismatch between the acoustic

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A study of speaker adaptation for DNN-based speech synthesis

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Lecture 1: Machine Learning Basics

Investigation on Mandarin Broadcast News Speech Recognition

Probabilistic Latent Semantic Analysis

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Using dialogue context to improve parsing performance in dialogue systems

Assignment 1: Predicting Amazon Review Ratings

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

The A2iA Multi-lingual Text Recognition System at the second Maurdor Evaluation

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

SPEECH RECOGNITION CHALLENGE IN THE WILD: ARABIC MGB-3

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

arxiv: v1 [cs.cv] 10 May 2017

Softprop: Softmax Neural Network Backpropagation Learning

On the Formation of Phoneme Categories in DNN Acoustic Models

Speech Emotion Recognition Using Support Vector Machine

THE world surrounding us involves multiple modalities

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Artificial Neural Networks written examination

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

A Review: Speech Recognition with Deep Learning Methods

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Human Emotion Recognition From Speech

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Word Segmentation of Off-line Handwritten Documents

(Sub)Gradient Descent

A Neural Network GUI Tested on Text-To-Phoneme Mapping

arxiv: v1 [cs.cl] 2 Apr 2017

Model Ensemble for Click Prediction in Bing Search Ads

A Deep Bag-of-Features Model for Music Auto-Tagging

arxiv: v2 [cs.ir] 22 Aug 2016

Learning Methods for Fuzzy Systems

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

arxiv: v1 [cs.lg] 15 Jun 2015

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Noisy SMS Machine Translation in Low-Density Languages

Generative models and adversarial training

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

Second Exam: Natural Language Parsing with Neural Networks

arxiv: v2 [cs.cv] 30 Mar 2017

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Evolutive Neural Net Fuzzy Filtering: Basic Description

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

A Case Study: News Classification Based on Term Frequency

INPE São José dos Campos

Attributed Social Network Embedding

CSL465/603 - Machine Learning

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

COPING WITH LANGUAGE DATA SPARSITY: SEMANTIC HEAD MAPPING OF COMPOUND WORDS

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Dialog-based Language Learning

Knowledge Transfer in Deep Convolutional Neural Nets

SARDNET: A Self-Organizing Feature Map for Sequences

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

Detecting English-French Cognates Using Orthographic Edit Distance

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

arxiv: v1 [cs.cl] 20 Jul 2015

Transcription:

INERSPEECH 2015 Deep Semantic Encodings for Language Modeling Ali Orkan Bayer and Giuseppe Riccardi Signals and Interactive Systems Lab - University of rento, Italy {bayer, riccardi}@disi.unitn.it Abstract Word error rate (WER) is not an appropriate metric for spoken language systems (SLS) because lower WER does not necessarily yield better understanding performance. herefore, language models (LMs) that are used in SLS should be trained to jointly optimize transcription and understanding performance. Semantic LMs (SELMs) are based on the theory of frame semantics and incorporate features of frames and meaning bearing words (target words) as semantic context when training LMs. he performance of SELMs is affected by the errors on the ASR and the semantic parser output. In this paper we address the problem of coping with such noise in the training phase of the neural network-based architecture of LMs. We propose the use of deep autoencoders for the encoding of semantic context while accounting for ASR errors. We investigate the optimization of SELMs both for transcription and understanding by using deep semantic encodings. Deep semantic encodings suppress the noise introduced by the ASR module, and enable SELMs to be optimized adequately. We assess the understanding performance by measuring the errors made on target words and we achieve 3.7% relative improvement over recurrent neural network LMs. Index erms: Language Modeling, Semantic Language Models, Recurrent Neural Networks, Deep Autoencoders 1. Introduction he performance of automatic speech recognition (ASR) systems is measured by word error rate (WER). However, in the literature the use of WER has been criticized because of its nature of poorly capturing the understanding performance [1, 2]. herefore, a joint optimization over transcription and understanding must be employed by accounting the semantic constraints. he most notable LMs that consider semantic constraints are the latent semantic analysis (LSA) work in [3] and the recognition for understanding LM training in [1]. Deep autoencoders can be used to reduce the dimensionality of data with higher precision than principle component analysis [4]. In addition, it has been observed that deep autoencoders outperform LSA for document similarity tasks. Semantic hashing [5] is a method for document retrieval that maps documents to binary vectors such that the Hamming distance between two vectors represents the similarity between those documents. Also deep denoising autoencoders are shown to learn high-level representations of the input which improves the performance of digit recognition systems [6]. Semantic LMs (SELMs) we present in this paper are neural network LMs (NNLMs) [7] that learn distributed representations for words. he architecture of SELMs are similar to the he research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement No. 610916 SENSEI. context dependent recurrent NNLMs (RNNLMs) that use recurrent connections as a short-term memory and embody a feature layer [8]. SELMs are based on the theory of frame semantics and model the linguistic scene based on either the target word or the frame features that are evoked in the utterance [9]. he linguistic scene is obtained from the ASR hypothesis and affected by the ASR noise. he noise can be reduced by pruning the erroneous frames [9]. However, this prevents the model to capture the whole linguistic scene, and also this may not be performed well for the unseen data. In this paper, we propose to use deep autoencoders to encode frames and targets in a noisy representation for handling the ASR noise and to optimize SELMs for the whole linguistic scene. We show that SELMs can be utilized for optimizing spoken language systems both for the transcription and the understanding performance. 2. Semantic LMs raditional LMs model words as sequence of symbols and do not consider any linguistic information related to them [10]. Hence, they fail to capture semantic relationships between the words and the semantic context of utterances. SELMs [9] overcome this problem by incorporating the semantic context of utterances into the LM. SELMs are based on the theory of frame semantics developed in the FrameNet project [11]. In FrameNet, word meanings are defined in the context of semantic frames which are evoked by linguistic forms called target words or targets [11]. he other words that complete the meaning in frames are called frame elements. he following shows an example of a semantic frame: Lee sold a textbook to Abby. In this example, the target word sold evokes the frame COMMERCE-SELL, and the buyer frame element is filled with the phrase to Abby. SELMs use frames and targets for semantic information. For automatic extraction of frames and targets from utterances, we have used the opensource frame-semantic parser, SEMAFOR [12]. SEMAFOR performs semantic parsing by first recognizing targets with a rule-based system, then by identifying frames by using a statistical model. At the final step, frame elements are filled by using another statistical model. SEMAFOR relies on the output of a statistical dependency parser. he reader may refer to [12] for a detailed description of SEMAFOR. he performance of ASR can be improved by re-scoring an n-best list of hypotheses by using a more advanced LM than the one that is used for decoding. here may be various ways to select the hypotheses during re-scoring. Figure 1 shows the transcription versus the understanding performance for possible different selections of hypotheses. We measure the understanding performance by target error rate (ER), which is calculated from the errors made on target words that are the main meaning bearing elements of semantic frames. If the sole purpose of improving the performance is to optimize with respect to the tran- Copyright 2015 ISCA 1448 September 6-10, 2015, Dresden, Germany

scription performance (WER), one may not improve the understanding performance (ER). Hence, LMs for re-scoring must be built to jointly optimize the transcription and the understanding performance. P(w t+1 cl t+1, w t, s t-1, sc) P(cl t+1 w t, s t-1, sc) [Softmax Activation] [Output Layer] Membership Pr. Class Pr. 13.4 13.2 [Sigmoid Activation] (s t ) Hidden Layer 13.0 arget Error Rate (%) 12.8 12.6 12.4 12.2 w t Recurrent Layer (s t-1 ) Semantic Layer semantic enc. 12.0 11.8 11.6 12.8 13.0 13.2 13.4 13.6 13.8 14.0 Word Error Rate (%) Figure 1: Scatter plot of transcription performance (WER) versus understanding performance (ER) for random selections of hypotheses from the 100-best list of the development set of Wall Street Journal corpus. SELMs incorporate semantic information over the frames evoked and the targets occur in an utterance. In this respect, they are well suited for optimizing both the recognition and the understanding performance jointly. SELMs are based on the context-dependent RNNLM architecture given in [8]. he connection between the feature layer and the hidden layer is removed because semantic encodings are high level representations. In this paper, we introduce SELMs which use deep semantic encodings of frames and targets as the semantic context. he structure of SELMs are given in Figure 2. he SELMs we have used have a class-based implementation that estimates the probability of the next word by factorizing them into class and class membership probabilities. he current word is fed into the input layer by 1-of-n encoding. he semantic layer uses the semantic encoding for the current utterance. SELMs are trained by using the backpropagation through time algorithm, which unfolds the network for N time-steps back for the recurrent layer and updates the weights with the standard backpropagation [13]. SELMs also use n-gram maximum entropy features which are implemented as direct connections between n-gram histories and the output layer. he implementation applies hashing on the n-gram histories as given in [14]. 3. Deep semantic encodings A binary vector that is used in semantic hashing [5], compared to a continuous vector, introduces noise to the high-level representation of the document. For that reason, it is suitable to be used as a noisy representation of semantic information for utterances. his section describes how the training of deep autoencoders is performed for obtaining deep semantic encodings for utterances. he training of the deep autoencoder is done in two phases as given in [5]. he phases of training is depicted in Figure 3. he input is represented with normalized bag-of-words (BoW) vectors of frames and targets in both of the phases. he first phase is the unsupervised pretraining phase for finding a good Figure 2: he class-based SELM structure. he network takes the current word w t and the semantic encoding for the current utterance as input. he output layer estimates the probability for the next word w t+1 factorized into class probabilities and class-membership probabilities (cl t+1 denotes the recognized class for the next word). he direct connections for the n-gram maximum entropy features are not shown. initialization of the weights. For this purpose the greedy layerby-layer training [15] is performed. In this approach, each pair of layers are modeled by Restricted Boltzmann Machines (RBMs) and each RBM is trained from bottom to top. During the pretraining phase the bottom RBM (RBM 1) is modeled by a Constrained Poisson Model as given in [5]. herefore, unnormalized BoW vectors are used only when computing the activations of the hidden layer, and the softmax activation function is used for the reconstruction of the input as the normalized BoW vector. he other RBMs use the sigmoid function as the activation function. he network is pretrained by using the single-step contrastive divergence [16]. In the second phase, the network is unrolled as shown in Figure 3, so that the network reconstructs the input at the output layer. he output layer uses the softmax function and reconstructs the normalized BoW input vector, the other layers use the sigmoid activation function. he backpropagation algorithm is used to fine-tune the weights by using the reconstruction error at the output layer. he codes at the code layer is made binary by using stochastic binary units at that layer i.e. the state of each node is set to 1 if its activation value is greater than a random value that is generated at run time; or set to 0 otherwise. his state value is used for the forward-pass. However, when backpropagating the errors the actual activation values are used. After training the autoencoder, deep semantic encodings can be obtained only by using the bottom part of the network (the part inside the dashed box in Figure 3). 4. Wall Street Journal (WSJ) experiments We present the performance of SELMs on N-best re-scoring experiments on the WSJ speech recognition task. he re-scored hypotheses are evaluated on both recognition performance (WER) and the target error rate (ER), a proxy for understanding performance. All of the experiments presented in this section are performed on the publicly available WSJ0/WSJ1 (DARPA November 92 and November 93 Benchmark) sets. All the development data under WSJ1 for speaker independent 20k vocabulary is used as the development set ( Dev 93-503 utterances). he evaluation is done on the November 92 CSR 1449

Code Layer RBM 3 RBM 2 RBM 1 Unsupervised pretraining Output Layer ' ' Semantic Encodings Code Layer Fine-tuning Figure 3: he training phases of the autoencoder for deep semantic encodings. he bottom part of the fine-tuned network (dashed box) is used to obtain semantic encodings. Speaker independent 20k NVP test set ( est 92-333 utterances) and on the November 93 CSR HUB 1 test set ( est 93-213 utterances). 4.1. ASR baseline he baseline ASR system is trained by using the Kaldi speech recognition toolkit [17]. he vocabulary is the 20K open vocabulary word list for non-verbalized punctuation that is available in WSJ0/WSJ1 corpus. he language model that the baseline system uses is the baseline tri-gram backoff model for 20K open vocabulary for non-verbalized punctuation which is available in WSJ0/WSJ1 corpus. he acoustic models are trained on the SI- 284 set, by using the Kaldi recipe with the following settings. MFCCs features are extracted and spliced in time with a context window of [ 3, +3]. Linear discriminant analysis (LDA) and maximum likelihood linear transform (MLL) are applied. riphone Gaussian mixture models are trained over these features. he system performs weighted finite state decoding. We have extracted 100-best lists for both the development set and the evaluation sets. he performance of the ASR baseline is given in able 1. able 2: he WER performance for frame encoding models (SELM - Frame Enc.) target encoding models (SELM - arget Enc.). SELMs use ASR encodings (ASR Enc.) and reference encodings (Ref Enc.). he actual performance is given in bold. Language Model Dev 93 est 92 est 93 KN5 14.6% 9.7% 13.3% RNNME 13.4% 8.8% 12.7% (1) SELM - Frame Enc. ASR Enc. 13.6% 8.4% 12.6% Reference Enc. 13.6% 8.4% 12.3% (2) SELM - arget Enc. ASR Enc. 13.4% 8.7% 12.0% Reference Enc. 13.2% 8.6% 11.9% (1) + (2) (Lin. Interpolation) ASR Enc. 13.3% 8.5% 12.0% Reference Enc. 13.2% 8.4% 11.8% tion. herefore, the LMs used for re-scoring includes a 5-gram modified Kneser-Ney model with singleton cut-offs (KN5), a RNNLM model that has 200 nodes in the hidden layer and uses a maximum-entropy model that has 4-gram features with 10 9 connections (RNNME). RNNME uses 200 word classes that are constructed based on the frequencies of words, however the KN5 do not contain any classes. he SELMs use semantic encodings of frames and targets. he frames and targets for the LM training data is obtained using the SEMAFOR semantic parser. We use the most frequent frames and targets that cover the 80% of the training corpus i.e. 184 distinct frames and 1184 distinct targets. For obtaining deep semantic encodings, we have trained autoencoders of size (184-200-200-12) for frames and of size (1184-400-400-12) for targets. Pretraining is performed for 20 iterations with a mini-batch size of 100 over the frames and targets. Fine-tuning is performed by using stochastic gradient descent by considering the reconstruction error on the development set (Dev93) to avoid overfitting by adjusting the learning rate and by early stopping. est Utterance 1st-Best Hypothesis ASR Semantic Parser N-Best List Bag-of-words frames or targets SELM Semantic Encoding Deep Autoencoder Best Hypothesis able 1: he ASR baseline recognition performance (WER) on Dev 93, est 92, and est 93 sets. Dev 93 est 92 est 93 ASR 1-best 15.3% 10.2% 14.0% Oracle on 100-best 8.3% 5.1% 7.3% 4.2. Re-scoring Experiments he re-scoring experiments are performed on the 100-best lists that are obtained from the ASR baseline system. We have rescored these 100-best lists by using the SELMs that are trained on frames and targets separately. In addition we have trained a RNNLM model and a 5-gram model with modified Kneser-Ney smoothing with singleton cut-offs. All models are trained on the whole WSJ 87, 88, and 89 data with the vocabulary that is limited to the 20K open vocabulary for non-verbalized punctua- Figure 4: he SELM re-scoring diagram. he test utterance is fed into the ASR. he 1st-best ASR hypothesis is passed through the semantic parser and BoW features are given to the autoencoder for extracting semantic encodings for the test utterance. he n-best list is re-scored by using the SELM that uses the semantic encoding as the semantic context for the test utterance. he SELMs are trained by using either frame encodings or target encodings that are obtained with the autoencoders. he SELMs have the same configuration with the RNNME model, i.e. they have 200 nodes in the hidden layer and use a maximum-entropy model that has 4-gram features with 10 9 connections. hey also use the same word classes. All NNLMs (RNNME and SELMs) are initialized with the same random weights to make the experiments more controlled. In addition to that, the training of all NNLMs are done by using the same 1450

arget Error Rate (%) 10.8 10.6 10.4 10.2 10.0 (a) Reference Encodings RNNME - WER: 10.3% (1) SELM Frame Enc. - WER: 9.9% (2) SELM arget Enc. - WER: 9.9% (1) + (2) (Lin. Interpolation) - WER: 9.7% 9.8 9.6 9.4 9.2 9.0 8.8 8.6 8.4 8.2 8.0 7.8 7.6 7.4 7.2 0.6 0.8 1.0 arget Word Coverage arget Error Rate (%) 10.8 10.6 10.4 10.2 10.0 (b) ASR Encodings RNNME - WER: 10.3% (1) SELM Frame Enc. - WER: 10.2% (2) SELM arget Enc. - WER: 10.1% (1) + (2) (Lin. Interpolation) - WER: 9.8% 9.8 9.6 9.4 9.2 9.0 8.8 8.6 8.4 8.2 8.0 7.8 7.6 7.4 7.2 0.6 0.8 1.0 arget Word Coverage Figure 5: ER of LMs at various coverages of target words: (a) SELMs with reference encodings, (b) SELMs with ASR encodings (actual performance). SELMs with reference encodings consistently perform better than RNNME. he target encodings suppress the ASR noise more robustly than the frame encodings. he linear interpolation of the SELMs performs the best. randomization of the training data. Since the training data is randomized we have built independent sentence models i.e. the state of the network is reset after each sentence. Dev93 is used to adjust the learning rate and for early stopping. he flow of the re-scoring experiments for SELMs are shown in Figure 4. he ASR 1st-best hypothesis is passed through SEMAFOR to extract frames and targets, then deep semantic encodings are obtained by feeding them into the relevant autoencoder. herefore, when re-scoring an utterance, semantic encodings for the whole utterance that is based on the 1st-best ASR hypothesis is used. o see how much ASR noise degrades the performance we have also performed rescoring experiments by using the semantic encodings of the reference transcriptions. Apparently, the actual performance is given when the ASR hypothesis is used. Hence, we present two results for SELMs, 1) ASR encodings, refers to the actual performance, where the ASR 1st-best hypotheses are used for the semantic encodings, 2) Reference Encodings, where the reference transcriptions are used for the semantic encodings. In addition, we present the linear interpolation of the two SELMs on frame encodings and target encodings with equal weights. he WER performance of all the models are given in able 2. he SELMs have a better WER performance than RNNME on the test sets. We observe that target encodings are more robust to noise than frame encodings. In addition, the linear interpolation of SELMs achieve 4.9% relative improvement in WER for the combination of est 92 and est 93 sets over RNNME. 4.3. arget Recognition Performance WSJ corpus is designed for the speech recognition task, and it does not have any gold standards for measuring the understanding performance. herefore, we evaluate our models on the targets recognized by the automatic semantic parser on the reference transcriptions of the development and evaluation sets. he target error rates (ER) of all models are given in able 3. Also we analyze the error rate on the most frequent targets that cover the 60%, 80%, and 100% of the training corpus. We present results on the combination of est 92 and est 93 evaluation set in Figure 5. Both results show that if accurate semantic context (reference encodings) is used SELMs are consistently good at optimizing the performance both in terms of WER and ER. When ASR encodings are used the ASR noise affects the ER performance slightly, especially the SELMs with frame encodings. he target encodings, on the other hand, are more robust to noise. he linear interpolation of SELMs achieves 3.7% relative improvement in ER over RNNME. able 3: he ER performance for frame encoding models (SELM - Frame Enc.) and target encoding models (SELM - arget Enc.). SELMs use ASR encodings (ASR Enc.) and reference encodings (Ref Enc.). he actual performance of SELMs are given in bold. Model Dev 93 est 92 est 93 KN5 13.4% 10.4% 13.2% RNNME 12.7% 9.6% 12.6% (1) SELM - Frame Enc. ASR Enc. 12.4% 9.1% 13.3% Reference Enc. 12.1% 9.1% 12.6% (2) SELM - arget Enc. ASR Enc. 12.5% 9.3% 12.5% Reference Enc. 12.1% 9.1% 12.3% (1) + (2) (Lin. Interpolation) ASR Enc. 12.1% 9.1% 12.3% Reference Enc. 11.9% 9.1% 11.9% 5. Conclusion In this paper, we present the use of deep semantic encodings for training SELMs that exploits the semantic constraints in the language. Deep semantic encodings enable SELMs to be optimized both for the transcription and the understanding performance by suppressing the ASR noise. We observe that the target encodings are more robust to ASR noise than the frame encodings. We achieve 4.9% relative improvement in WER and 3.7% relative improvement in ER over the RNNME model for the whole evaluation set with the linear interpolation of SELMs that use frame and target encodings with equal weights. 1451

6. References [1] G. Riccardi and A. L. Gorin, Stochastic language models for speech recognition and understanding, in ICSLP, Sydney, Nov. 1998, 1998. [2] Y.-Y. Wang, A. Acero, and C. Chelba, Is word error rate a good indicator for spoken language understanding accuracy, in Automatic Speech Recognition and Understanding, 2003. ASRU 03. 2003 IEEE Workshop on, Nov 2003, pp. 577 582. [3] J. Bellegarda, Exploiting latent semantic information in statistical language modeling, Proceedings of the IEEE, vol. 88, no. 8, pp. 1279 1296, Aug 2000. [4] G. E. Hinton and R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science, vol. 313, no. 5786, pp. 504 507, Jul. 2006. [5] R. Salakhutdinov and G. Hinton, Semantic hashing, International Journal of Approximate Reasoning, vol. 50, no. 7, pp. 969 978, Jul. 2009. [6] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion, Journal of Machine Learning Research, vol. 11, pp. 3371 3408, Dec. 2010. [7] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, A neural probabilistic language model, Journal of Machine Learning Research, vol. 3, pp. 1137 1155, 2000. [8]. Mikolov and G. Zweig, Context dependent recurrent neural network language model. in Proceedings of SL. IEEE, 2012, pp. 234 239. [9] A. O. Bayer and G. Riccardi, Semantic language models for automatic speech recognition, in Spoken Language echnology Workshop (SL), 2014 IEEE, Dec 2014, pp. 7 12. [10] R. Rosenfeld, wo decades of statistical language modeling: where do we go from here? Proceedings of the IEEE, vol. 88, no. 8, pp. 1270 1278, Aug 2000. [11] C. J. Fillmore, C. R. Johnson, and M. R. L. Petruck, Background to Framenet, International Journal of Lexicography, vol. 16, no. 3, pp. 235 250, Sep. 2003. [12] D. Das, D. Chen, A. F.. Martins, N. Schneider, and N. Smith, Frame-semantic parsing, Computational Linguistics, vol. 40, no. 1, pp. 9 56, 2014. [13]. Mikolov, M. Karafiat, L. Burget, J. Cernock, and S. Khudanpur, Recurrent neural network based language model, in Proceedings of Interspeech. ISCA, 2010, pp. 1045 1048. [14]. Mikolov, A. Deoras, D. Povey, L. Burget, and J. ernock, Strategies for training large scale neural network language models, in Proceedings of ASRU. IEEE, 2011, pp. 196 201. [15] G. E. Hinton, S. Osindero, and Y.-W. eh, A fast learning algorithm for deep belief nets, Neural Comput., vol. 18, no. 7, pp. 1527 1554, Jul. 2006. [16] G. E. Hinton, raining products of experts by minimizing contrastive divergence. Neural Computation, vol. 14, no. 8, pp. 1771 1800, 2002. [17] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, he Kaldi speech recognition toolkit, in Proceedings of ASRU. IEEE, 2011. 1452