Structured OUtput Layer (SOUL) Neural Network Language Model

Similar documents
Deep Neural Network Language Models

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

The KIT-LIMSI Translation System for WMT 2014

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

CS Machine Learning

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

arxiv: v1 [cs.cl] 20 Jul 2015

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Probabilistic Latent Semantic Analysis

Python Machine Learning

Artificial Neural Networks written examination

Calibration of Confidence Measures in Speech Recognition

Improvements to the Pruning Behavior of DNN Acoustic Models

arxiv: v1 [cs.cl] 27 Apr 2016

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

arxiv: v1 [cs.lg] 15 Jun 2015

A study of speaker adaptation for DNN-based speech synthesis

Second Exam: Natural Language Parsing with Neural Networks

A Neural Network GUI Tested on Text-To-Phoneme Mapping

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Investigation on Mandarin Broadcast News Speech Recognition

Learning Methods in Multilingual Speech Recognition

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Model Ensemble for Click Prediction in Bing Search Ads

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Assignment 1: Predicting Amazon Review Ratings

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Evolutive Neural Net Fuzzy Filtering: Basic Description

Learning Methods for Fuzzy Systems

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Introduction to Simulation

INPE São José dos Campos

Modeling function word errors in DNN-HMM based LVCSR systems

Switchboard Language Model Improvement with Conversational Data from Gigaword

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Attributed Social Network Embedding

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Modeling function word errors in DNN-HMM based LVCSR systems

arxiv: v2 [cs.ir] 22 Aug 2016

Rule Learning With Negation: Issues Regarding Effectiveness

Comment-based Multi-View Clustering of Web 2.0 Items

Softprop: Softmax Neural Network Backpropagation Learning

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Dropout improves Recurrent Neural Networks for Handwriting Recognition

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

CSL465/603 - Machine Learning

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

COPING WITH LANGUAGE DATA SPARSITY: SEMANTIC HEAD MAPPING OF COMPOUND WORDS

The stages of event extraction

On the Formation of Phoneme Categories in DNN Acoustic Models

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Residual Stacking of RNNs for Neural Machine Translation

Axiom 2013 Team Description Paper

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

arxiv: v4 [cs.cl] 28 Mar 2016

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

arxiv: v1 [cs.cl] 2 Apr 2017

Test Effort Estimation Using Neural Network

Learning goal-oriented strategies in problem solving

Semantic and Context-aware Linguistic Model for Bias Detection

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Corrective Feedback and Persistent Learning for Information Extraction

Georgetown University at TREC 2017 Dynamic Domain Track

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Beyond the Pipeline: Discrete Optimization in NLP

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

The A2iA Multi-lingual Text Recognition System at the second Maurdor Evaluation

Ensemble Technique Utilization for Indonesian Dependency Parser

An empirical study of learning speed in backpropagation

Language Model and Grammar Extraction Variation in Machine Translation

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A deep architecture for non-projective dependency parsing

Rule Learning with Negation: Issues Regarding Effectiveness

Speaker Identification by Comparison of Smart Methods. Abstract

Multi-Lingual Text Leveling

arxiv: v2 [cs.cv] 30 Mar 2017

Word Segmentation of Off-line Handwritten Documents

Issues in the Mining of Heart Failure Datasets

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Device Independence and Extensibility in Gesture Recognition

Data Fusion Through Statistical Matching

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

An OO Framework for building Intelligence and Learning properties in Software Agents

Knowledge Transfer in Deep Convolutional Neural Nets

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Introduction to Causal Inference. Problem Set 1. Required Problems

Transcription:

Structured OUtput Layer (SOUL) Neural Network Language Model Le Hai Son, Ilya Oparin, Alexandre Allauzen, Jean-Luc Gauvain, François Yvon 25/5/211 L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 1 / 22

Outline 1 Neural Network Language Models 2 Hierarchical Models 3 SOUL Neural Network Language Model L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 2 / 22

Plan Neural Network Language Models 1 Neural Network Language Models 2 Hierarchical Models 3 SOUL Neural Network Language Model L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 3 / 22

N-gram models Neural Network Language Models Very successful but sparsity issues and lack of generalization Flat vocabulary Each word is only a possible outcome of a discrete random variable, an index in the vocabulary L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 4 / 22

Neural Network Language Models Estimate n-gram probabilities in a continuous space NNLMs were introduced in [Bengio et al., 21] and applied to speech recognition in [Schwenk and Gauvain, 22]. hy should it work? similar words are expected to have similar feature vectors Probability function is a smooth function of feature values A small change in features will induce a small change in the probability L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 5 / 22

Neural Network Language Models Project a word sequence in a continuous space epresent words as as 1-of- V vectors Project the word in the continuous space: add a second layer fully connected For a 4-gram, the history is a sequence of 3 words Merge these three vectors to derive a single vector for the history w 1 V : vocabulary size A neuron layer represents a vector of values, one neuron per value L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 6 / 22

Neural Network Language Models Project a word sequence in a continuous space epresent words as as 1-of- V vectors Project the word in the continuous space: add a second layer fully connected For a 4-gram, the history is a sequence of 3 words Merge these three vectors to derive a single vector for the history w 1 The connection between two layers is a matrix operation The matrix contains all the connection weights v is a continuous vector v L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 6 / 22

Neural Network Language Models Project a word sequence in a continuous space shared projection space epresent words as as 1-of- V vectors Project the word in the continuous space: add a second layer fully connected For a 4-gram, the history is a sequence of 3 words Merge these three vectors to derive a single vector for the history w i-1 w i-2 w i-3 1 1 1 v i-1 v i-2 v i-3 L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 6 / 22

Neural Network Language Models Project a word sequence in a continuous space shared projection space epresent words as as 1-of- V vectors Project the word in the continuous space: add a second layer fully connected For a 4-gram, the history is a sequence of 3 words Merge these three vectors to derive a single vector for the history w i-1 1 1 v i-1 v i-2 w i-2 w i-3 1 v i-3 L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 6 / 22

Neural Network Language Models Estimate the n-gram probability shared projection space w i-1 Given the history expressed as a feature vector Create a feature vector for the word to be predicted in the prediction space Estimate probabilities for all words given the history All the parameters must be learned (, ih, ho ). w i-2 context layer ih ho w i-3 L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 7 / 22

Neural Network Language Models Estimate the n-gram probability shared projection space Given the history expressed as a feature vector Create a feature vector for the word to be predicted in the prediction space Estimate probabilities for all words given the history All the parameters must be learned (, ih, ho ). w i-1 w i-2 hidden layer: tanh activation ih ho w i-3 L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 7 / 22

Neural Network Language Models Estimate the n-gram probability shared projection space w i-1 prediction space Given the history expressed as a feature vector Create a feature vector for the word to be predicted in the prediction space Estimate probabilities for all words given the history All the parameters must be learned (, ih, ho ). w i-2 ih output layer (softmax) ho w i-3 L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 7 / 22

Neural Network Language Models Estimate the n-gram probability shared projection space w i-1 prediction space Given the history expressed as a feature vector Create a feature vector for the word to be predicted in the prediction space Estimate probabilities for all words given the history All the parameters must be learned (, ih, ho ). w i-2 ih output layer (softmax) ho w i-3 L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 7 / 22

Early assessment Neural Network Language Models Key points The projection in continuous spaces reduces the sparsity issues Learn simultaneously the projection and the prediction w i-1 Probability estimation based on the similarity among the feature vectors In practice ih ho Significant and systematic improvements In machine translation and speech recognition tasks w i-2 w i-3 L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 8 / 22

Early assessment Neural Network Language Models Key points The projection in continuous spaces reduces the sparsity issues Learn simultaneously the projection and the prediction w i-1 Probability estimation based on the similarity among the feature vectors In practice ih ho Significant and systematic improvements In machine translation and speech recognition tasks w i-2 w i-3 L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 8 / 22

Early assessment Neural Network Language Models Key points The projection in continuous spaces reduces the sparsity issues Learn simultaneously the projection and the prediction w i-1 Probability estimation based on the similarity among the feature vectors In practice Significant and systematic improvements In machine translation and speech recognition tasks Everybody should use it! w i-2 w i-3 ih ho L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 8 / 22

Neural Network Language Models Early assessment Key points The projection in continuous spaces reduces the sparsity issues Learn simultaneously the projection and the prediction ith a small training set In practice Significant and systematic improvements In machine translation and speech recognition tasks Learning and inference time L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 8 / 22

Neural Network Language Models Early assessment Key points The projection in continuous spaces reduces the sparsity issues Learn simultaneously the projection and the prediction ith a large training set In practice Significant and systematic improvements In machine translation and speech recognition tasks Learning and inference time L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 8 / 22

Neural Network Language Models hy it is so long? - Inference Forward propagation of the history The projection: select a row in Compute a vector for the predicted word Estimate the probability for all the words V w i-1 Matrix row selection Complexity issues The input vocabulary can be as large as we want Increasing the order does not drastically increase the complexity The problem is the output vocabulary size w i-2 w i-3 ih ho L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 9 / 22

Neural Network Language Models hy it is so long? - Inference Forward propagation of the history The projection: select a row in Compute a vector for the predicted word Estimate the probability for all the words V w i-1 Matrix multiplication 6x2 Complexity issues The input vocabulary can be as large as we want Increasing the order does not drastically increase the complexity The problem is the output vocabulary size w i-2 w i-3 ih ho L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 9 / 22

Neural Network Language Models hy it is so long? - Inference Forward propagation of the history The projection: select a row in Compute a vector for the predicted word Estimate the probability for all the words V w i-1 Matrix multiplication 2 x V Complexity issues The input vocabulary can be as large as we want Increasing the order does not drastically increase the complexity The problem is the output vocabulary size w i-2 w i-3 ih ho L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 9 / 22

Neural Network Language Models hy it is so long? - Inference Forward propagation of the history The projection: select a row in Compute a vector for the predicted word Estimate the probability for all the words V w i-1 Matrix multiplication 2 x V Complexity issues The input vocabulary can be as large as we want Increasing the order does not drastically increase the complexity The problem is the output vocabulary size w i-2 w i-3 ih ho L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 9 / 22

Neural Network Language Models hy it is so long? - Inference Forward propagation of the history The projection: select a row in Compute a vector for the predicted word Estimate the probability for all the words V w i-1 Matrix multiplication 2 x V Complexity issues The input vocabulary can be as large as we want Increasing the order does not drastically increase the complexity The problem is the output vocabulary size w i-2 w i-3 ih ho L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 9 / 22

Neural Network Language Models Usual tricks to speed-up training (and inference) e-sampling and batch training For each epoch: down-sampling of the training data Forward and Back-propagation for a group of n-grams educe the output vocabulary Use the Neural network to predict only the K most frequent words For a tractable model: K = 6 to 2 equires the normalization of the distribution for the whole vocabulary use the standard n-gram LM L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 1 / 22

Plan Hierarchical Models 1 Neural Network Language Models 2 Hierarchical Models 3 SOUL Neural Network Language Model L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 11 / 22

Hierarchical Models Speeding up MaxEnt models Main ideas as proposed in [Goodman, 21] Instead of computing directly P(w h), make use of clustering of words into classes: P(w h) = P(w c(w), h)p(c(w) h) Any classes can be used, but generalization may be better for classes for which it s easier to learn P(c(w) h) Example of reduction 1 word vocabulary with 1 classes 2 normalizations over 1 outcomes 1 2 (reduction by 5) L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 12 / 22

Hierarchical Models Hierarchical Probabilistic NNLM Main ideas as proposed in [Morin and Bengio, 25] Perform binary hierarchical clustering of the vocabulary Predict words as paths in this clustering tree Details Clustering is constrained by ordnet semantic hierarchy Predicting next bit in hierarchy as P(b node, w t 1,..., w t n+1 ) esults Brown corpus, 1M words, 1 words vocabulary Speed-up but loss in perplexity as compared to a standard NNLM L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 13 / 22

Hierarchical Models Scalable Hierarchical Distributed LM Main ideas as proposed in [Mnih and Hinton, 28] Use automatic clustering instead of ordnet Implement as log-bilinear model One-to-many word class mapping esults APNews dataset, 14M words, 18k vocabulary Perplexity improvements over n-gram model, similar performance to a non-hierarchical LBL No comparison with non-linear NNLMs used in STT L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 14 / 22

Plan SOUL Neural Network Language Model 1 Neural Network Language Models 2 Hierarchical Models 3 SOUL Neural Network Language Model L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 15 / 22

SOUL Neural Network Language Model Structured OUtput Layer NNLM Main ideas Trees are not binary Multiple output layers with a softmax in each No clustering for frequent words Compromise between speed and complexity Efficient clustering scheme ord vectors in projection space are used for clustering Task Improving state-of-the-art STT system that makes use of shortlist NNLMs Large vocabulary and the baseline n-gram LM trained on billions of words L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 16 / 22

SOUL Neural Network Language Model ord clustering Associate each frequent word with a single class c 1 (w) Split other words in sub-classes (c 2 (w)) and so on L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 17 / 22

SOUL Neural Network Language Model ord clustering Associate each frequent word with a single class c 1 (w) Split other words in sub-classes (c 2 (w)) and so on L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 17 / 22

SOUL Neural Network Language Model ord clustering Associate each frequent word with a single class c 1 (w) Split other words in sub-classes (c 2 (w)) and so on C 1 (w) L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 17 / 22

SOUL Neural Network Language Model ord clustering Associate each frequent word with a single class c 1 (w) Split other words in sub-classes (c 2 (w)) and so on C 1 (w) C 2 (w) L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 17 / 22

SOUL Neural Network Language Model ord clustering Associate each frequent word with a single class c 1 (w) Split other words in sub-classes (c 2 (w)) and so on C 1 (w) C 2 (w) C 3 (w) L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 17 / 22

SOUL Neural Network Language Model ord probability C 1 (w) C 2 (w) C 3 (w) D P(w i h) = P(c 1 (w i ) h) P(c d (w i ) h, c 1:d 1 ) c 1:D (w i ) = c 1,..., c D : path for the word w i in the clustering tree, D : depth of the tree, c d (w i ): (sub-)class, c D (w i ): leaf d=2 L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 18 / 22

SOUL Neural Network Language Model The SOUL language model w i-1 C 1 (w) w i-2 ih ho w i-3 L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 19 / 22

SOUL Neural Network Language Model The SOUL language model w i-1 ih w i-2 w i-3 L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 19 / 22

Training algorithm SOUL Neural Network Language Model Step 1: Train a standard NNLM model with the shortlist as an output (3 epochs and a shortlist of 8k words) Step 2: educe the dimension of the context space using with PCA (final dimension is 1 in our experiments) Step 3: Perform a recursive K -means word clustering based on the distributed representation induced by the continuous space (except for words in the shortlist) Step 4: Train the whole model L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 2 / 22

Training algorithm SOUL Neural Network Language Model Step 1: Train a standard NNLM model with the shortlist as an output (3 epochs and a shortlist of 8k words) Step 2: educe the dimension of the context space using with PCA (final dimension is 1 in our experiments) Step 3: Perform a recursive K -means word clustering based on the distributed representation induced by the continuous space (except for words in the shortlist) Step 4: Train the whole model L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 2 / 22

Training algorithm SOUL Neural Network Language Model Step 1: Train a standard NNLM model with the shortlist as an output (3 epochs and a shortlist of 8k words) Step 2: educe the dimension of the context space using with PCA (final dimension is 1 in our experiments) Step 3: Perform a recursive K -means word clustering based on the distributed representation induced by the continuous space (except for words in the shortlist) Step 4: Train the whole model L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 2 / 22

Training algorithm SOUL Neural Network Language Model Step 1: Train a standard NNLM model with the shortlist as an output (3 epochs and a shortlist of 8k words) Step 2: educe the dimension of the context space using with PCA (final dimension is 1 in our experiments) Step 3: Perform a recursive K -means word clustering based on the distributed representation induced by the continuous space (except for words in the shortlist) Step 4: Train the whole model L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 2 / 22

SOUL Neural Network Language Model STT results with SOUL NNLMs Mandarin GALE task LIMSI Mandarin STT system 56k vocabulary Baseline LM trained on 3.2 billion words 4 NNLMs trained on 25M words after resampling model ppx CE dev9 dev9s eval9 Baseline 4-gram 211 9.8% 8.9% +4-gram NNLM 8k 187 9.5% 8.6% +4-gram NNLM 12k 185 9.4% 8.6% +4-gram SOUL NNLM 18 9.3% 8.5% +6-gram NNLM 8k 177 9.4% 8.5% +6-gram NNLM 12k 172 9.3% 8.5% +6-gram SOUL NNLM 162 9.1% 8.3% L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 21 / 22

SOUL Neural Network Language Model Conclusion Neural network and class-based language models combined together SOUL LM is able to deal with vocabularies of arbitrary sizes Speech recognition improvements are achieved on a large-scale task and over challenging baselines SOUL LM improves better for longer contexts L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 22 / 22

SOUL Neural Network Language Model Bengio, Y., Ducharme,., and Vincent, P. (21). A neural probabilistic language model. Neural Information Processing Systems, 13:933 938. Goodman, J. (21). Classes for fast maximum entropy training. In Proc. of ICASSP 1. Mnih, A. and Hinton, G. (28). A scalable hierarchical distributed language model. In Neural Information Processing Systems, volume 21, pages 181 188. Morin, F. and Bengio, Y. (25). Hierarchical probabilistic neural network language model. In Proc. of AISTATS 5, pages 246 252. Schwenk, H. and Gauvain, J.-L. (22). Connectionist language modeling for large vocabulary continuous speech recognition. In Proc. of ICASSP 2, pages 765 768. L.-H. Son, I. Oparin et al. (LIMSI-CNS) SOUL NNLM 25/5/211 22 / 22