Dual Transfer Learning for Neural Machine Translation with Marginal Distribution Regularization

Size: px
Start display at page:

Download "Dual Transfer Learning for Neural Machine Translation with Marginal Distribution Regularization"

Transcription

1 Dual Transfer Learning for Neural Machine Translation with Marginal Distribution Regularization Yijun Wang 1, Yingce Xia 2, Li Zhao 3, Jiang Bian 3, Tao Qin 3, Guiquan Liu 1 and Tie-Yan Liu 3 1 Anhui Province Key Lab. of Big Data Analysis and Application, University of Science and Technology of China 2 University of Science and Technology of China 3 Microsoft Research Asia wyjun@mail.ustc.edu.cn, yingce.xia@gmail.com, {lizo, jiabia, taoqin, tyliu }@microsoft.com, gqliu@ustc.edu.cn Abstract Neural machine translation (NMT) heavily relies on parallel bilingual data for training. Since large-scale, high-quality parallel corpora are usually costly to collect, it is appealing to exploit monolingual corpora to improve NMT. Inspired by the law of total probability, which connects the probability of a given target-side monolingual sentence to the conditional probability of translating from a source sentence to the target one, we propose to explicitly exploit this connection to learn from and regularize the training of NMT models using monolingual data. The key technical challenge of this approach is that there are exponentially many source sentences for a target monolingual sentence while computing the sum of the conditional probability given each possible source sentence. We address this challenge by leveraging the dual translation model (target-to-source translation) to sample several mostly likely source-side sentences and avoid enumerating all possible candidate source sentences. That is, we transfer the knowledge contained in the dual model to boost the training of the primal model (source-to-target translation), and we call such an approach dual transfer learning. Experiment results on English French and German English tasks demonstrate that dual transfer learning achieves significant improvement over several strong baselines and obtains new state-of-the-art results. Introduction Machine translation aims at mapping a sentence from the source language space X into the target language space Y. Recent development of neural networks has witnessed the success of Neural Machine Translation (NMT), which has achieved state-of-the-art performance (Bahdanau, Cho, and Bengio 2015; Britz et al. 2017; Gehring et al. 2017) through end-to-end learning. In particular, given a parallel sentence pair (x, y), where x X and y Y, the learning objective of most NMT algorithms is to maximize the conditional probability P (y x; θ) parameterized by θ. While neural networks have led to better performance, the huge number, usually tens of millions, of parameters in the NMT model raises a major challenge that it heavily relies on large-scale parallel bilingual corpora for model training. Unfortunately, it is usually quite difficult to collect ad- This work was conducted at Microsoft Research Asia. Copyright c 2018, Association for the Advancement of Artificial Intelligence ( All rights reserved. equate high-quality parallel corpora. To address this challenge, increasing attention has been paid to leveraging other more easily obtained information, especially huge amount of monolingual corpora on the web, to improve NMT. (Gulcehre et al. 2015) proposed to train language models (Mikolov et al. 2010; Sundermeyer, Schlüter, and Ney 2012) independently with target-side monolingual sentences, and incorporate them into NMT models during decoding by re-scoring the candidate words according to the weighted sum of the scores provided by the translation model and the language model, or concatenating the two hidden states from translation and language model for further processing. While such an approach can achieve certain improvement, it overlooks the potential of taking advantage of monolingual data into enhancing NMT training, since it is only used to obtain a language model. Other studies attempt to enlarge the parallel bilingual training dataset through translating the monolingual data with a model trained by the given parallel corpora. Such an idea has been used both in NMT (Sennrich, Haddow, and Birch 2016) and statistical machine translation (Bertoldi and Federico 2009; Lambert et al. 2011; Ueffing, Haffari, and Sarkar 2007). Although this approach can increase the volume of parallel training data, it may introduce low-quality pseudo sentence pairs into the NMT training in the mean time. (He et al. 2016a) propose the concept of dual learning, in which two translation models teach each other through a reinforcement learning process, by minimizing the reconstruction error of a monolingual sentences, in either source or target languages. One potential issue of their approach, is that it requires to back-propagate through the sequence of discrete predictions using reinforcement learning-based approaches which are notoriously inefficient. Adopting the same idea of reconstruction error minimization, (Cheng et al. 2016) propose to append a reconstruction term to the training objective. In this work, motivated by the law of total probability, we propose a principled way to exploit monolingual data for NMT base on transfer learning. We transfer the knowledge learned from the dual translation task (target-to-source translation) (He et al. 2016a; Xia et al. 2017b; 2017a) to our primary translation task (source-to-target translation), and we name our method as dual transfer learning.

2 According to the law of total probability, the marginal probability P (y) can be computed using the condi- probability P (y x) in the following way: P (y) = tional x X P (y x)p (x). As a result, ideally the learned conditional probability P (y x; θ) should satisfy the following equation: P (y) = P (y x; θ)p (x). (1) x X However, if P (y x; θ) is learned from bilingual corpora using maximum likelihood estimation, there is no guarantee that the above equation will hold. Inspired by the law of total probability, we propose to learn the translation model θ by maximizing the likelihood of parallel corpora, subject to the constraint of Eqn.(1), for any target-language sentence y in a monolingual corpus M. In this way, the learning objective can explicitly emphasize the probabilistic connection so as to regularize the learning process towards the right direction. To compute x X P (y x; θ)p (x), a technical challenge is that this value is usually intractable due to the exponentially large search space X. Traditionally, this problem can be resolved by sampling the full search space and using the sampled average to approximate the expectation: P (y x; θ)p (x) = E x P (x) P (y x; θ) x X 1 K K P (y x (i) ; θ), x (i) P (x). i=1 That is, given a target-language sentence y Y, one samples K source sentences x (i) according to distribution P (x), and then computes the average conditional probability over the K samples. However, since the values of P (y x; θ) are very sparse and most x from distribution P (x) would get a nearly zero value for P (y x; θ), a plain Monte Carlo sample from the distribution P (x) may not be capable of regularizing the training of NMT models. To deal with this problem, we adopt the method of importance sampling and sample from distribution P (x y) to guarantee the quality of sampled sentences such that the corresponding constraint is valid empirically. Note that P (x y) is actually the dual translation model that translates a target sentence to a source sentence. Thus, by doing so, we transfer the knowledge learned from the dual translation task to our primary translation task. The main contributions of this paper can be summarized as follows: We propose a principled way to leverage monolingual data to enhance the training of NMT, which adopts a probabilistic view and is a kind of transfer learning. When estimating x X P (y x; θ)p (x), we leverage the dual translation model for importance sampling to guarantee the quality of sampled sentences and ensure that the probabilistic constraint is valid empirically. Experiments on the IWSLT and WMT datasets show that our approach can achieve significant improvement in terms of translation quality over baseline methods on both German English and English French translation tasks. (2) Background: Neural Machine Translation Neural machine translation systems are typically implemented based on an encoder-decoder neural network framework, which learns a conditional probability P (y x) from a source language sentence x to a target language sentence y. In this framework, the encoder neural network projects the source sentence into a distributed representation, based on which the decoder generates the target sentence word by word. The encoder and the decoder are learned jointly in an end-to-end way. The standard training objective of existing NMT models is to maximize the likelihood of the training data. With fast development of deep learning, a variety of encoder-decoder architectures have been introduced to enhance the NMT performance, such as recurrent neural networks (RNN) with attention mechanisms (Bahdanau, Cho, and Bengio 2015; Luong, Pham, and Manning 2015; Wu et al. 2016), convolutional neural network (CNN) based frameworks (Gehring et al. 2017; Kalchbrenner et al. 2016), and, most recently, all-attention mechanisms (Vaswani et al. 2017). Beyond the standard encoder-decoder architecture, more elaborate decoder architectures have been proposed to promote the performance of NMT systems (Xia et al. 2017c; He et al. 2017). In the mean time, a trend of recent works is to focus on improving NMT by increasing the model depth, since deeper neural networks usually imply stronger modeling capability (Britz et al. 2017; Zhou et al. 2016). However, even a single layer NMT model has a huge number of parameters to optimize, which requires large-scale data for effective model training, not to mention deep models. Unfortunately, parallel bilingual corpora are usually quite limited in either quantity or coverage, making it appealing to exploit large-scale monolingual corpora to improve NMT. Framework In this section, we present a new approach, dual transfer learning, which is inspired by the law of total probability and leverages the dual translation model to learn from monolingual data. We first introduce our new training objective with a marginal distribution regularizer. Given the difficulty in estimating the regularization term brought by the exponentially large search space, we then address this challenge by using the dual model for importance sampling. After that, we present the whole dual transfer learning algorithm for NMT in details. Training Objective We first define some notations and present the maximum likelihood training objective used in most NMT algorithms. Then, we introduce our marginal distribution regularizer inspired by the law of total probability. Given the source language space X and target language space Y, a translation model takes a sample from X as input and maps to space Y. The translation model is usually represented by a conditional distribution P (y x; θ) parameterized by θ, where x X and y Y. In standard supervised learning, given a parallel corpus B = {(x (n), y (n) )} N n=1, the

3 translation model is learned by maximizing the likelihood of the training data: L(θ) = N log P (y (n) x (n) ; θ). (3) n=1 According to the law of total probability, we should have P (y) = x X P (y x)p (x). Therefore, for any y Y, if the learned translation model θ is perfect, we should have: P (y) = x X P (y x; θ)p (x) = E x P (x) P (y x; θ). (4) Assume that we have a monolingual corpus M which contains S sentences i.i.d. sampled from the space Y, i.e., M = {y (s) } S s=1. Considering the model P (y x; θ) is empirically learned via maximum likelihood training from parallel data, there is no guarantee that Eqn.(4) will hold for sentences in M. Therefore, we can regularize the learning process on monolingual data by forcing all sentences in M to satisfy the probabilistic relation in Eqn.(4). Mathematically, we have the following constrained optimization problem: max N log P (y (n) x (n) ; θ), n=1 s.t.p (y) = E x P (x) P (y x; θ), y M. Since the ground-truth marginal distributions P (x) and P (y) are usually not available, we use the empirical distributions ˆP (x) and ˆP (y) as their surrogates, which we get from well-trained language models. Following the common practice in constrained optimization, we convert the constraint into the following regularization term: (5) S(θ) = [log ˆP (y) log E x ˆP (x) P (y x; θ)] 2, (6) and then add it to the training objective. Formally, we introduce our training objective as minimizing the following function: L(θ) = + λ N log P (y (n) x (n) ; θ) n=1 S [log ˆP (y (s) ) log E x ˆP (x) P (y (s) x; θ)] 2, s=1 where λ is a hyperparameter controlling the tradeoff between the likelihood and the regularization term. We call this new learning scheme maximum likelihood training with marginal distribution regularization, since it adds a datadependent regularization term to the original maximum likelihood training objective. (7) space problem is to build an approximate estimator by sampling the full search space. That is, if we sample K sentences from distribution ˆP (x), an empirical estimate of E x ˆP (x) P (y x; θ) can be computed as 1 K K i=1 P (y xi ; θ). However, since P (y x; θ) is very sparse with respect to x s, most of those samples from distribution ˆP (x) would result in P (y x; θ) very close to zero. Intuitively, given a certain y in the target language, it is almost impossible to sample an x from empirical distribution ˆP (x), through conforming a good source language model, such that x is exactly or close to the translation of y. In other words, most sentences sampled from ˆP (x) are irrelevant to sentence y. Consequently, the regularization term would be constrained by nearly zero valued P (y x; θ), which makes the constraint empirically invalid to regularize the translation model P (y x; θ). Therefore, in order to make the constraint effective, we should get samples that can achieve relatively large P (y x), i.e., making sampled sentences x relevant to the given sentence y. Inspired by the ideas of dual learning (He et al. 2016a) and backtranslation (Cheng et al. 2016), we propose to get relevant source sentence x for a given target sentence y by sampling from a dual translation model P (x y). In this way, we can get constraint on P (y x; θ) with large probability, making our constraint valid empirically. Since we sample from distribution P (x y) instead of ˆP (x) when estimating E x ˆP (x) P (y x; θ), we need to adjust our estimate accordingly: E x ˆP (x) P (y x; θ) = x X P (y x; θ) ˆP (x) = x X P (y x; θ) ˆP (x) P (x y) P (x y) P (y x; θ) = E ˆP (x) x P (x y). P (x y) That is, by making a multiplicative adjustment to P (y x; θ), we compensate for sampling from P (x y) instead of ˆP (x). This procedure is exactly the technique of importance sampling (Cochran 1977; Hesterberg 1988; 1995). Then, the importance sampling estimation of E x ˆP (x) P (y x; θ) is 1 K K i=1 (8) P (y x i ; θ) ˆP (x i ), x i P (x y) (9) P (x i y) where K is the sample size. Therefore, the regularization term can be calculated approximately as follows: Importance Sampling with Dual Model To compute the expectation term E x ˆP (x) P (y x; θ) in our regularizer, a technical challenge arises as this expectation is usually intractable due to the exponential search space of x. A straightforward way to address such large search S(θ) S [ log ˆP (y (s) ) log 1 K K s=1 i=1 ˆP (x (s) i )P (y (s) x (s) i ; θ) ] 2. P (x (s) i y (s) ) (10)

4 +λ Empirically our training objective becomes: L(θ) S s=1 N log P (y (n) x (n) ; θ) n=1 [ log ˆP (y (s) ) log 1 K K i=1 ˆP (x (s) i )P (y (s) x (s) i ; θ) ] 2. P (x (s) i y (s) ) (11) Algorithm We learn the model P (y x; θ) by minimizing the weighted combination between the original loss function and the marginal distribution regularization term as shown in Eqn.(11). The details of our proposed algorithm is shown in Algorithm 1. The input of this algorithm consists of a monolingual corpus M containing sentences from the target language B, a bilingual corpus containing sentence pairs from language A and language B, marginal distributions ˆP (x) and ˆP (y), and a pretrained dual model that can translate sentences from language B to language A. Denote P (y x; θ) parameterized by θ as the translation model we want to learn and P (x y) as the dual translation model used for sampling. During training, in one mini-batch, we get m sentences form M and b sentence pairs from B. Then, for each sentence y from the monolingual corpus, we sample K sentences according to the translation model P (x y). Next we compute the gradient of the objective function with respect to parameter θ and finally update the parameter θ. Algorithm 1 Dual transfer learning with marginal distribution regularization Require: Monolingual corpus M, bilingual corpus B, a dual translation model P (x y), marginal distributions ˆP (x) and ˆP (y), hyperparameter λ, sample size K. 1: repeat 2: Get a mini-batch of monolingual sentences M from M where M = m, and a mini-batch of bilingual sentence pairs B AB from B where B AB = b; 3: For each sentence y in M, sample K sentences ˆx 1,...ˆx K according to the translation model P (x y); 4: Calculate the training objective L according to Eqn. (11) based on B AB, M and the corresponding translations; 5: Update the parameters of θ: 6: until model converged θ θ γ θ L(θ) (12) Experiments We conducted a set of experiments on two translation tasks to test the proposed method. Settings Datasets We evaluated our approach on two translation tasks: English French (En Fr) and German English (De En). For English French task, we used a subset of the bilingual corpus from WMT 14 for training, which contains 12M sentence pairs. We concatenated newstest2012 and newstest2013 as the validation set, and used newstest2014 as the test set. The validation and test sets for English French contain 6k and 3k sentence pairs respectively. We used the News Crawl: articles from 2012 provided by WMT 14 as monolingual data. For German English task, the bilingual corpus is from IWSLT 2014 evaluation campaign (Cettolo et al. 2014), containing about 153k sentence pairs for training, and 7k/6.5k sentence pairs for validation/test. The monolingual data for German English is collected from web. Baseline Methods We compared our approach with several strong baselines, including a well-known attention-based NMT system RNNSearch (Bahdanau, Cho, and Bengio 2015), a deep LSTM structure, and several semi-supervised NMT models: Shallow fusion-nmt. This method incorporates a targetside language model which is trained using monolingual corpora into the translation model during decoding by rescoring the candidate sentences obtained through beam search (Gulcehre et al. 2015). Pseudo-NMT. This method generates pseudo bilingual sentence pairs from monolingual corpora to assist training (Sennrich, Haddow, and Birch 2016). We used the same dual model to generate pseudo bilingual sentence pairs as the sampling model in our method. Dual-NMT. This method reconstructs the monolingual data with both source-to-target and target-to-source translation models and jointly trains the two models with dual learning objective (He et al. 2016a). Marginal Distribution ˆP (x) and ˆP (y) We used LSTMbased language modeling approach to characterize the marginal distribution of a given sentence x. For En Fr, we used a single layer LSTM with word embeddings of 512 dimensions and hidden states of 1024 dimensions. For De En, we trained a language model with 512 dimensions for both word embeddings and hidden states. The language models were fixed during training. Both the models were trained using Adam (Kingma and Ba 2014) with initial learning rate Implementation Details For En Fr translation, we implemented a basic single-layer RNNSearch model (Bahdanau, Cho, and Bengio 2015) to ensure fair comparison with the related work, and a deep LSTM model to see improvement brought by our algorithm combining with more recent techniques. For the basic RNNSearch model, we followed the same setting as that in (Bahdanau, Cho, and Bengio 2015). To be specific, GRUs were applied as the recurrent units. The dimensions of word embedding and hidden state were 620 and 1000 respectively. We constructed the vocabulary with the most common 30K words in the parallel corpora. Out-of-vocabulary words were replaced with a special token UNK. For monolingual corpora, we removed the sentences containing out-of-vocabulary words. In order to prevent over-fitting, we applied dropout during

5 Table 1: BLEU scores on En Fr and De En translation tasks. means the improvement over the basic NMT model, which only used bilingual data for training. The basic model for En Fr is the RNNSearch model (Bahdanau, Cho, and Bengio 2015), and for De En is a two-layer LSTM model. Note that all the methods for the same task share the same model structure. System En Fr De En Basic model Representative semi-supervised NMT systems Shallow fusion-nmt (Gulcehre et al. 2015) Pseudo-NMT (Sennrich, Haddow, and Birch 2016) Dual-NMT (He et al. 2016a) Our dual transfer learning system This work Table 2: Deep NMT systems performances on En Fr translation. System System Configurations BLEU Representative deep NMT systems (Gehring et al. 2017) layers CNN + BPE + 12M parallel data (Britz et al. 2017) 8-8 layers *1024 size + BPE + 36M parallel data (Zhou et al. 2016) 9-7 layers + PosUNK +36M parallel data 39.2 Our dual transfer learning systems this work 4-4 layers LSTM + 512*1024 size + BPE +12M parallel data layers LSTM + 512*1024 size + BPE + 12M parallel data + Monolingual Data training (Zaremba, Sutskever, and Vinyals 2014), where the dropout probability was 0.1. For the deep LSTM model, the dimensions of embedding and hidden states were 512 and 1024 respectively. Both the encoder and decoder had four stacked layers with residual connections (He et al. 2016b). We adopted the byte-pair encoding (BPE) techniques (Sennrich, Haddow, and Birch 2015) to split words into subwords with BPE operations, which can efficiently address rare words 1. For De En translation, we implemented a two-layer LSTM model with both word embedding dimension and hidden state dimension 256. We apply dropout with probability 0.1. We also adopted BPE to split the words with BPE operations. Note that our algorithm needs a dual translation model. We trained a Fr En NMT model with test BLEU and a De En model with test BLEU Training Procedure Following (Tu et al. 2017; He et al. 2016a), to speed up training, for each task, we first trained NMT models on their own parallel corpora and then used them to initialize our algorithm. To obtain the models used to initialize our algorithm, (1) for the single-layer RNNSearch model in English French translation, we followed the same training procedure as that proposed by (Jean et al. 2015); (2) for deep LSTM architectures, we trained the model with mini-batch size 128 for En Fr translation and 32 for De En translation. Gradient clipping was used with clipping value 1.0 and 2.5 for English French and German English respectively. Models were optimized by AdaDelta (Zeiler 2012) on M40 GPU until convergence. For our algorithm, we used AdaDelta with the mini-batch 1 of 32 bilingual sentence pairs and 32 monolingual sentences for both tasks. The sample size K and the hyperparameter λ in our method were set as 2 and 0.05 respectively according to the trade-off between validation performance and training time. Evaluation Metrics The translation qualities were measured by case-insensitive BLEU (Papineni et al. 2002) as calculated by the multi-bleu.perl script 2. A larger BLEU score indicates a better translation quality. During testing, for the single-layer model in En Fr translation, we used beam search (Sutskever, Vinyals, and Le 2014) with beam size 12 as in many previous works; for deep LSTM models, the beam size was set to 5. Main Results We report the experiment results in this subsection. Table 1 shows the results of our method and three semisupervised baselines with the aligned network structure. We can see that our dual transfer learning method outperforms all the baseline algorithms on both the language pairs. For the translation from English to French, our method outperforms the RNNSearch model with MLE training objective by 2.93 points, and outperforms the strongest baseline dual- NMT by 0.79 point. For the translation from German to English, our method outperforms the basic NMT model by 1.36 points, and outperforms dual-nmt by 0.3 points. Improvements brought by our algorithm are significant compared with the basic NMT model. These results demonstrate the effectiveness of our algorithm. Table 2 shows the comparison between our proposed algorithm and several deep NMT systems on the En Fr trans- 2 scripts/generic/multi-bleu.perl

6 BLEU unlabeled data ratio BLEU training iterations (*10k) Figure 1: Impact of unlabeled data ratio on German English validation set. lation task. We can see that given a strong baseline, our algorithm can still make significant improvement, i.e., from to This sets a new record on En Fr translation with 12M bilingual data. We leave leveraging more bilingual/monolingual data as a future work. Given a parallel corpus, one may be curious about that how many unlabeled sentences are most beneficial to improve translation quality. To answer this question, we investigated the impact of unlabeled data ratio on translation quality, which is defined as the number of unlabeled sentences divided by the number of labeled sentence pairs in each mini-batch. Figure 1 shows the BLEU scores of the German English validation set with different unlabeled data ratios. We constructed monolingual corpora with unlabeled data ratio from 0.2 to 1.2. We find that when unlabeled data ratio is no more than 0.8, increasing unlabeled data ratio leads to apparent improvement on translation quality, while the improvement tends to be marginal if further increasing the ratio. Therefore, considering the balance between model performance and training time, we set the ratio to 1 in all other experiments. Impact of hyperparameters There are some hyperparameters in our marginal distribution regularization algorithm. In this subsection, we conducted several experiments to investigate their impact. Impact of λ Hyperparameter λ is introduced to balance the MLE training objective and the regularization term in our algorithm. We conducted experiments on German English translation to study the impact of λ. We plot the validation BLEU scores of different λ s in Figure 2 with respect to training iterations. From this figure, we can see that λ [0.005, 0.2] can improve translation quality significantly and consistently against baseline, and λ = 0.05 reaches the best performance. Reducing or increasing λ from 0.05 hurts translation quality. Similar findings are also observed on the English French dataset. Therefore, we set λ = 0.05 for all the experiments. Impact of sample size K As the inference of our approach is intractable and a plain Monte Carlo sample is Figure 2: Impact of λ on German English validation set. highly ineffective, we propose to use the dual model to sample the top-k list from distribution P (x y; θ y x ). We conducted some experiments on IWSLT German English dataset to study the impact of sample size K. Intuitively, a larger sample size leads to a better translation accuracy while increasing training time. To investigate the balance between translation performance and training efficiency, we trained our model with different sample sizes. Figure 3 shows the BLEU scores of various settings of K on the validation set with respect to training hours. From this figure, we can observe that a smaller K leads to a more rapid increase of the BLEU score on the validation set, while limiting the potential to achieve a higher final accuracy. On the contrary, a larger K achieves a higher final accuracy while taking more time to reach the good accuracy. Similar findings are also observed on the En Fr dataset. Due to limited computation resources, we set K = 2 in all experiments. Impact of the dual model for sampling When training model P (y x; θ), we adopted the dual translation model P (x y) to generate samples. We conducted several experiments with dual models of different qualities on German English translation. We used different En De translation models with test BLEU score from to to sample sentences. As can be seen from Figure 4, using a dual model P (x y) with a larger BLEU score for sampling generally leads to higher final accuracy. Therefore, we expect we can further improve the accuracy if we are give a better dual model. Related Work Exploring monolingual data for machine translation has attracted intensive attention in recent years. The methods proposed for this purpose could be divided into three categories: (1) integrating language model trained with monolingual data into NMT model, (2) generating pseudo sentence pairs from monolingual data and (3) jointly training of both source-to-target and target-to-source translation models by minimizing reconstruction errors of monolingual sentences. In the first category, a separately trained language model with monolingual data is integrated into the NMT model.

7 BLEU sample size=1 sample size=2 sample size= sample size=4 sample size= training time (hours) BLEU BLEU of dual translation model on En-De test set Figure 3: Impact of sample size K on German English validation set. Figure 4: Impact of the dual translation model on German English validation set. (Gulcehre et al. 2015) trained language models independently with target-side monolingual sentences, and incorporated them into the neural network during decoding by rescoring of the beam or adding the recurrent hidden state of the language model to the decoder states. (Jean et al. 2015) also reported experiments of reranking NMT outputs with a 5-gram language model. These methods only used monolingual data to train language models and improve NMT decoding, but do not touch the training of NMT models. In the second category, monolingual data is translated using translation model trained from bilingual sentence pairs, and being paired with its translations to form a pseudo parallel corpus to enlarge the training data. Specifically, (Bertoldi and Federico 2009; Lambert et al. 2011) have back-translated target-side monolingual data into the sourceside sentence to produce synthetic parallel data for phrasebased SMT. Similar approach also has been applied to NMT, and back-translated synthetic parallel data has been found to have a more general use in NMT than in SMT, with positive effects that go beyond domain adaption (Sennrich, Haddow, and Birch 2016). (Ueffing, Haffari, and Sarkar 2007) iteratively translated source-side monolingual data and added the reliable translations to the training data in an SMT system, and thus improved the translation model from its own translation. For these methods, there is no guarantee on the quality of generated pseudo bilingual sentence pairs, which may limit the performance gain. In the third category, the monolingual data is reconstructed with both source-to-target and target-to-source translation models, and the two models are jointly trained. (He et al. 2016a) proposed dual learning for NMT, in which two translation models taught each other through a reinforcement learning process, based on the feedback signals generated during this process. (Cheng et al. 2016) proposed to append a reconstruction term to the training objective, which aims to reconstruct the observed monolingual corpora using an autoencoder. To some extent, the reconstruction methods could be seen as an iteration extension of (Sennrich, Haddow, and Birch 2016) s method, since after updating model parameters on the pseudo parallel corpus, the learned models are used to produce a better pseudo corpus (Cheng et al. 2016). Different form those methods, which focus on reconstruction of monolingual sentences, our approach focuses on the endogenous probabilistic connection between the marginal distribution of monolingual data and the conditional distribution represented by the translation model. To some extent, our approach is a more principled way. Transfer learning is a broad research direction in machine learning. Different from most transfer learning methods (Raina et al. 2007; Long et al. 2015; 2016), our algorithm leverages the dual structure of machine translation and achieves knowledge transfer through data sampling. Conclusion In this paper, we have proposed a new method, dual transfer learning, to leverage monolingual corpora from a probabilistic perspective for neural machine translation. The central idea is to exploit the probabilistic connection between the marginal distribution and the conditional distribution using the law of total probability. A data-dependent regularization term is introduced to guide the training procedure to satisfy the probabilistic connection. The key technical challenge is addressed by using the dual translation model for important sampling. Experiments on English French and German English translation tasks show that our approach has achieved significant improvements over baseline methods. For future work, we plan to apply our method to more applications, such as speech recognition and image captioning. Furthermore, we will enrich theoretical study to better understand dual transfer learning with marginal distribution regularization. We will also investigate the limit of our approach with respect to the increase of the size of monolingual data as well as sample size K. Acknowledgements This research was partially supported by grants from the National Key Research and Development Program of China (Grant No.2016YFB ), and the National Natural Science Foundation of China (Grants No ).

8 References Bahdanau, D.; Cho, K.; and Bengio, Y Neural machine translation by jointly learning to align and translate. ICLR. Bertoldi, N., and Federico, M Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the fourth workshop on statistical machine translation, Association for Computational Linguistics. Britz, D.; Goldie, A.; Luong, T.; and Le, Q Massive exploration of neural machine translation architectures. ACL. Cettolo, M.; Niehues, J.; Stüker, S.; Bentivogli, L.; and Federico, M Report on the 11th iwslt evaluation campaign, iwslt In Proceedings of the International Workshop on Spoken Language Translation, Hanoi, Vietnam. Cheng, Y.; Xu, W.; He, Z.; He, W.; Wu, H.; Sun, M.; and Liu, Y Semi-supervised learning for neural machine translation. arxiv preprint arxiv: Cochran, W. G Sampling techniques. John Wiley. Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; and Dauphin, Y. N Convolutional sequence to sequence learning. arxiv preprint arxiv: Gulcehre, C.; Firat, O.; Xu, K.; Cho, K.; Barrault, L.; Lin, H.- C.; Bougares, F.; Schwenk, H.; and Bengio, Y On using monolingual corpora in neural machine translation. arxiv preprint arxiv: He, D.; Xia, Y.; Qin, T.; Wang, L.; Yu, N.; Liu, T.; and Ma, W.- Y. 2016a. Dual learning for machine translation. In Advances in Neural Information Processing Systems, He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016b. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, He, D.; Lu, H.; Xia, Y.; Qin, T.; Wang, L.; and Liu, T.-Y Decoding with value networks for neural machine translation. In Advances in Neural Information Processing Systems. Hesterberg, T. C Advances in importance sampling. Ph.D. Dissertation, Stanford University. Hesterberg, T Weighted average importance sampling and defensive mixture distributions. Technometrics 37(2): Jean, S.; Firat, O.; Cho, K.; Memisevic, R.; and Bengio, Y Montreal neural machine translation systems for wmt15. In Proceedings of the Tenth Workshop on Statistical Machine Translation, Kalchbrenner, N.; Espeholt, L.; Simonyan, K.; Oord, A. v. d.; Graves, A.; and Kavukcuoglu, K Neural machine translation in linear time. arxiv preprint arxiv: Kingma, D., and Ba, J Adam: A method for stochastic optimization. arxiv preprint arxiv: Lambert, P.; Schwenk, H.; Servan, C.; and Abdul-Rauf, S Investigations on translation model adaptation using monolingual data. In Proceedings of the Sixth Workshop on Statistical Machine Translation, Association for Computational Linguistics. Long, M.; Cao, Y.; Wang, J.; and Jordan, M Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, Long, M.; Zhu, H.; Wang, J.; and Jordan, M. I Unsupervised domain adaptation with residual transfer networks. In Advances in Neural Information Processing Systems, Luong, M.-T.; Pham, H.; and Manning, C. D Effective approaches to attention-based neural machine translation. EMNLP. Mikolov, T.; Karafiát, M.; Burget, L.; Cernockỳ, J.; and Khudanpur, S Recurrent neural network based language model. In Interspeech, volume 2, 3. Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics. Raina, R.; Battle, A.; Lee, H.; Packer, B.; and Ng, A. Y Self-taught learning: transfer learning from unlabeled data. In Proceedings of the 24th international conference on Machine learning, ACM. Sennrich, R.; Haddow, B.; and Birch, A Neural machine translation of rare words with subword units. arxiv preprint arxiv: Sennrich, R.; Haddow, B.; and Birch, A Improving neural machine translation models with monolingual data. Annual Meeting of the Association for Computational Linguistics Sundermeyer, M.; Schlüter, R.; and Ney, H Lstm neural networks for language modeling. In Thirteenth Annual Conference of the International Speech Communication Association. Sutskever, I.; Vinyals, O.; and Le, Q. V Sequence to sequence learning with neural networks. In Advances in neural information processing systems, Tu, Z.; Liu, Y.; Shang, L.; Liu, X.; and Li, H Neural machine translation with reconstruction. In AAAI, Ueffing, N.; Haffari, G.; and Sarkar, A Semi-supervised model adaptation for statistical machine translation. Machine Translation 21(2): Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; N. Gomez, A.; Kaiser, L.; and Polosukhin, I Attention is all you need. arxiv preprint arxiv: Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al Google s neural machine translation system: Bridging the gap between human and machine translation. arxiv preprint arxiv: Xia, Y.; Bian, J.; Qin, T.; Yu, N.; and Liu, T.-Y. 2017a. Dual inference for machine learning. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Xia, Y.; Qin, T.; Chen, W.; Bian, J.; Yu, N.; and Liu, T.-Y. 2017b. Dual supervised learning. In International Conference on Machine Learning, Xia, Y.; Tian, F.; Wu, L.; Lin, J.; Qin, T.; and Liu, T.-Y. 2017c. Deliberation networks: Sequence generation beyond one-pass decoding. In Advances in Neural Information Processing Systems. Zaremba, W.; Sutskever, I.; and Vinyals, O Recurrent neural network regularization. arxiv preprint arxiv: Zeiler, M. D Adadelta: an adaptive learning rate method. arxiv preprint arxiv: Zhou, J.; Cao, Y.; Wang, X.; Li, P.; and Xu, W Deep recurrent models with fast-forward connections for neural machine translation. arxiv preprint arxiv:

Residual Stacking of RNNs for Neural Machine Translation

Residual Stacking of RNNs for Neural Machine Translation Residual Stacking of RNNs for Neural Machine Translation Raphael Shu The University of Tokyo shu@nlab.ci.i.u-tokyo.ac.jp Akiva Miura Nara Institute of Science and Technology miura.akiba.lr9@is.naist.jp

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering

More information

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Xinying Song, Xiaodong He, Jianfeng Gao, Li Deng Microsoft Research, One Microsoft Way, Redmond, WA 98052, U.S.A.

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention Damien Teney 1, Peter Anderson 2*, David Golub 4*, Po-Sen Huang 3, Lei Zhang 3, Xiaodong He 3, Anton van den Hengel 1 1

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

arxiv: v1 [cs.cl] 2 Apr 2017

arxiv: v1 [cs.cl] 2 Apr 2017 Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

Georgetown University at TREC 2017 Dynamic Domain Track

Georgetown University at TREC 2017 Dynamic Domain Track Georgetown University at TREC 2017 Dynamic Domain Track Zhiwen Tang Georgetown University zt79@georgetown.edu Grace Hui Yang Georgetown University huiyang@cs.georgetown.edu Abstract TREC Dynamic Domain

More information

The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017

The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017 The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017 Jan-Thorsten Peter, Andreas Guta, Tamer Alkhouli, Parnia Bahar, Jan Rosendahl, Nick Rossenbach, Miguel

More information

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation SLSP-2016 October 11-12 Natalia Tomashenko 1,2,3 natalia.tomashenko@univ-lemans.fr Yuri Khokhlov 3 khokhlov@speechpro.com Yannick

More information

arxiv: v1 [cs.lg] 15 Jun 2015

arxiv: v1 [cs.lg] 15 Jun 2015 Dual Memory Architectures for Fast Deep Learning of Stream Data via an Online-Incremental-Transfer Strategy arxiv:1506.04477v1 [cs.lg] 15 Jun 2015 Sang-Woo Lee Min-Oh Heo School of Computer Science and

More information

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration INTERSPEECH 2013 Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration Yan Huang, Dong Yu, Yifan Gong, and Chaojun Liu Microsoft Corporation, One

More information

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Navdeep Jaitly 1, Vincent Vanhoucke 2, Geoffrey Hinton 1,2 1 University of Toronto 2 Google Inc. ndjaitly@cs.toronto.edu,

More information

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling Pratyush Banerjee, Sudip Kumar Naskar, Johann Roturier 1, Andy Way 2, Josef van Genabith

More information

Deep Neural Network Language Models

Deep Neural Network Language Models Deep Neural Network Language Models Ebru Arısoy, Tara N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran IBM T.J. Watson Research Center Yorktown Heights, NY, 10598, USA {earisoy, tsainath, bedk, bhuvana}@us.ibm.com

More information

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks POS tagging of Chinese Buddhist texts using Recurrent Neural Networks Longlu Qin Department of East Asian Languages and Cultures longlu@stanford.edu Abstract Chinese POS tagging, as one of the most important

More information

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing Ask Me Anything: Dynamic Memory Networks for Natural Language Processing Ankit Kumar*, Ozan Irsoy*, Peter Ondruska*, Mohit Iyyer*, James Bradbury, Ishaan Gulrajani*, Victor Zhong*, Romain Paulus, Richard

More information

arxiv: v1 [cs.lg] 7 Apr 2015

arxiv: v1 [cs.lg] 7 Apr 2015 Transferring Knowledge from a RNN to a DNN William Chan 1, Nan Rosemary Ke 1, Ian Lane 1,2 Carnegie Mellon University 1 Electrical and Computer Engineering, 2 Language Technologies Institute Equal contribution

More information

arxiv: v4 [cs.cl] 28 Mar 2016

arxiv: v4 [cs.cl] 28 Mar 2016 LSTM-BASED DEEP LEARNING MODELS FOR NON- FACTOID ANSWER SELECTION Ming Tan, Cicero dos Santos, Bing Xiang & Bowen Zhou IBM Watson Core Technologies Yorktown Heights, NY, USA {mingtan,cicerons,bingxia,zhou}@us.ibm.com

More information

Second Exam: Natural Language Parsing with Neural Networks

Second Exam: Natural Language Parsing with Neural Networks Second Exam: Natural Language Parsing with Neural Networks James Cross May 21, 2015 Abstract With the advent of deep learning, there has been a recent resurgence of interest in the use of artificial neural

More information

The NICT Translation System for IWSLT 2012

The NICT Translation System for IWSLT 2012 The NICT Translation System for IWSLT 2012 Andrew Finch Ohnmar Htun Eiichiro Sumita Multilingual Translation Group MASTAR Project National Institute of Information and Communications Technology Kyoto,

More information

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Transfer Learning Action Models by Measuring the Similarity of Different Domains Transfer Learning Action Models by Measuring the Similarity of Different Domains Hankui Zhuo 1, Qiang Yang 2, and Lei Li 1 1 Software Research Institute, Sun Yat-sen University, Guangzhou, China. zhuohank@gmail.com,lnslilei@mail.sysu.edu.cn

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING Sheng Li 1, Xugang Lu 2, Shinsuke Sakai 1, Masato Mimura 1 and Tatsuya Kawahara 1 1 School of Informatics, Kyoto University, Sakyo-ku, Kyoto 606-8501,

More information

Model Ensemble for Click Prediction in Bing Search Ads

Model Ensemble for Click Prediction in Bing Search Ads Model Ensemble for Click Prediction in Bing Search Ads Xiaoliang Ling Microsoft Bing xiaoling@microsoft.com Hucheng Zhou Microsoft Research huzho@microsoft.com Weiwei Deng Microsoft Bing dedeng@microsoft.com

More information

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках Тарасов Д. С. (dtarasov3@gmail.com) Интернет-портал reviewdot.ru, Казань,

More information

On the Combined Behavior of Autonomous Resource Management Agents

On the Combined Behavior of Autonomous Resource Management Agents On the Combined Behavior of Autonomous Resource Management Agents Siri Fagernes 1 and Alva L. Couch 2 1 Faculty of Engineering Oslo University College Oslo, Norway siri.fagernes@iu.hio.no 2 Computer Science

More information

arxiv: v1 [cs.cv] 10 May 2017

arxiv: v1 [cs.cv] 10 May 2017 Inferring and Executing Programs for Visual Reasoning Justin Johnson 1 Bharath Hariharan 2 Laurens van der Maaten 2 Judy Hoffman 1 Li Fei-Fei 1 C. Lawrence Zitnick 2 Ross Girshick 2 1 Stanford University

More information

The KIT-LIMSI Translation System for WMT 2014

The KIT-LIMSI Translation System for WMT 2014 The KIT-LIMSI Translation System for WMT 2014 Quoc Khanh Do, Teresa Herrmann, Jan Niehues, Alexandre Allauzen, François Yvon and Alex Waibel LIMSI-CNRS, Orsay, France Karlsruhe Institute of Technology,

More information

Language Model and Grammar Extraction Variation in Machine Translation

Language Model and Grammar Extraction Variation in Machine Translation Language Model and Grammar Extraction Variation in Machine Translation Vladimir Eidelman, Chris Dyer, and Philip Resnik UMIACS Laboratory for Computational Linguistics and Information Processing Department

More information

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-6) Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors Sang-Woo Lee,

More information

Deep Facial Action Unit Recognition from Partially Labeled Data

Deep Facial Action Unit Recognition from Partially Labeled Data Deep Facial Action Unit Recognition from Partially Labeled Data Shan Wu 1, Shangfei Wang,1, Bowen Pan 1, and Qiang Ji 2 1 University of Science and Technology of China, Hefei, Anhui, China 2 Rensselaer

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Good Judgment Project: A large scale test of different methods of combining expert predictions The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania

More information

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach #BaselOne7 Deep search Enhancing a search bar using machine learning Ilgün Ilgün & Cedric Reichenbach We are not researchers Outline I. Periscope: A search tool II. Goals III. Deep learning IV. Applying

More information

arxiv: v3 [cs.cl] 7 Feb 2017

arxiv: v3 [cs.cl] 7 Feb 2017 NEWSQA: A MACHINE COMPREHENSION DATASET Adam Trischler Tong Wang Xingdi Yuan Justin Harris Alessandro Sordoni Philip Bachman Kaheer Suleman {adam.trischler, tong.wang, eric.yuan, justin.harris, alessandro.sordoni,

More information

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011 The Karlsruhe Institute of Technology Translation Systems for the WMT 2011 Teresa Herrmann, Mohammed Mediani, Jan Niehues and Alex Waibel Karlsruhe Institute of Technology Karlsruhe, Germany firstname.lastname@kit.edu

More information

Attributed Social Network Embedding

Attributed Social Network Embedding JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2017 1 Attributed Social Network Embedding arxiv:1705.04969v1 [cs.si] 14 May 2017 Lizi Liao, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua Abstract Embedding

More information

Generative models and adversarial training

Generative models and adversarial training Day 4 Lecture 1 Generative models and adversarial training Kevin McGuinness kevin.mcguinness@dcu.ie Research Fellow Insight Centre for Data Analytics Dublin City University What is a generative model?

More information

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski Problem Statement and Background Given a collection of 8th grade science questions, possible answer

More information

Lip Reading in Profile

Lip Reading in Profile CHUNG AND ZISSERMAN: BMVC AUTHOR GUIDELINES 1 Lip Reading in Profile Joon Son Chung http://wwwrobotsoxacuk/~joon Andrew Zisserman http://wwwrobotsoxacuk/~az Visual Geometry Group Department of Engineering

More information

Calibration of Confidence Measures in Speech Recognition

Calibration of Confidence Measures in Speech Recognition Submitted to IEEE Trans on Audio, Speech, and Language, July 2010 1 Calibration of Confidence Measures in Speech Recognition Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IEEE

More information

arxiv: v1 [cs.cl] 27 Apr 2016

arxiv: v1 [cs.cl] 27 Apr 2016 The IBM 2016 English Conversational Telephone Speech Recognition System George Saon, Tom Sercu, Steven Rennie and Hong-Kwang J. Kuo IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598 gsaon@us.ibm.com

More information

Evolutive Neural Net Fuzzy Filtering: Basic Description

Evolutive Neural Net Fuzzy Filtering: Basic Description Journal of Intelligent Learning Systems and Applications, 2010, 2: 12-18 doi:10.4236/jilsa.2010.21002 Published Online February 2010 (http://www.scirp.org/journal/jilsa) Evolutive Neural Net Fuzzy Filtering:

More information

A study of speaker adaptation for DNN-based speech synthesis

A study of speaker adaptation for DNN-based speech synthesis A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, Simon King The Centre for Speech Technology Research (CSTR) University of Edinburgh,

More information

THE world surrounding us involves multiple modalities

THE world surrounding us involves multiple modalities 1 Multimodal Machine Learning: A Survey and Taxonomy Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency arxiv:1705.09406v2 [cs.lg] 1 Aug 2017 Abstract Our experience of the world is multimodal

More information

arxiv: v2 [cs.cv] 30 Mar 2017

arxiv: v2 [cs.cv] 30 Mar 2017 Domain Adaptation for Visual Applications: A Comprehensive Survey Gabriela Csurka arxiv:1702.05374v2 [cs.cv] 30 Mar 2017 Abstract The aim of this paper 1 is to give an overview of domain adaptation and

More information

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES Po-Sen Huang, Kshitiz Kumar, Chaojun Liu, Yifan Gong, Li Deng Department of Electrical and Computer Engineering,

More information

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF Read Online and Download Ebook ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF Click link bellow and free register to download

More information

Improvements to the Pruning Behavior of DNN Acoustic Models

Improvements to the Pruning Behavior of DNN Acoustic Models Improvements to the Pruning Behavior of DNN Acoustic Models Matthias Paulik Apple Inc., Infinite Loop, Cupertino, CA 954 mpaulik@apple.com Abstract This paper examines two strategies that positively influence

More information

Noisy SMS Machine Translation in Low-Density Languages

Noisy SMS Machine Translation in Low-Density Languages Noisy SMS Machine Translation in Low-Density Languages Vladimir Eidelman, Kristy Hollingshead, and Philip Resnik UMIACS Laboratory for Computational Linguistics and Information Processing Department of

More information

Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews

Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews Kang Liu, Liheng Xu and Jun Zhao National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy

More information

Lecture 10: Reinforcement Learning

Lecture 10: Reinforcement Learning Lecture 1: Reinforcement Learning Cognitive Systems II - Machine Learning SS 25 Part III: Learning Programs and Strategies Q Learning, Dynamic Programming Lecture 1: Reinforcement Learning p. Motivation

More information

ON THE USE OF WORD EMBEDDINGS ALONE TO

ON THE USE OF WORD EMBEDDINGS ALONE TO ON THE USE OF WORD EMBEDDINGS ALONE TO REPRESENT NATURAL LANGUAGE SEQUENCES Anonymous authors Paper under double-blind review ABSTRACT To construct representations for natural language sequences, information

More information

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics Machine Learning from Garden Path Sentences: The Application of Computational Linguistics http://dx.doi.org/10.3991/ijet.v9i6.4109 J.L. Du 1, P.F. Yu 1 and M.L. Li 2 1 Guangdong University of Foreign Studies,

More information

The Strong Minimalist Thesis and Bounded Optimality

The Strong Minimalist Thesis and Bounded Optimality The Strong Minimalist Thesis and Bounded Optimality DRAFT-IN-PROGRESS; SEND COMMENTS TO RICKL@UMICH.EDU Richard L. Lewis Department of Psychology University of Michigan 27 March 2010 1 Purpose of this

More information

A Review: Speech Recognition with Deep Learning Methods

A Review: Speech Recognition with Deep Learning Methods Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 5, May 2015, pg.1017

More information

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX, IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX, 2017 1 Small-footprint Highway Deep Neural Networks for Speech Recognition Liang Lu Member, IEEE, Steve Renals Fellow,

More information

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Jung-Tae Lee and Sang-Bum Kim and Young-In Song and Hae-Chang Rim Dept. of Computer &

More information

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen TRANSFER LEARNING OF WEAKLY LABELLED AUDIO Aleksandr Diment, Tuomas Virtanen Tampere University of Technology Laboratory of Signal Processing Korkeakoulunkatu 1, 33720, Tampere, Finland firstname.lastname@tut.fi

More information

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Distributed Learning of Multilingual DNN Feature Extractors using GPUs Distributed Learning of Multilingual DNN Feature Extractors using GPUs Yajie Miao, Hao Zhang, Florian Metze Language Technologies Institute, School of Computer Science, Carnegie Mellon University Pittsburgh,

More information

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Neural Network GUI Tested on Text-To-Phoneme Mapping A Neural Network GUI Tested on Text-To-Phoneme Mapping MAARTEN TROMPPER Universiteit Utrecht m.f.a.trompper@students.uu.nl Abstract Text-to-phoneme (T2P) mapping is a necessary step in any speech synthesis

More information

arxiv: v3 [cs.cl] 24 Apr 2017

arxiv: v3 [cs.cl] 24 Apr 2017 A Network-based End-to-End Trainable Task-oriented Dialogue System Tsung-Hsien Wen 1, David Vandyke 1, Nikola Mrkšić 1, Milica Gašić 1, Lina M. Rojas-Barahona 1, Pei-Hao Su 1, Stefan Ultes 1, and Steve

More information

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING Gábor Gosztolya 1, Tamás Grósz 1, László Tóth 1, David Imseng 2 1 MTA-SZTE Research Group on Artificial

More information

Dialog-based Language Learning

Dialog-based Language Learning Dialog-based Language Learning Jason Weston Facebook AI Research, New York. jase@fb.com arxiv:1604.06045v4 [cs.cl] 20 May 2016 Abstract A long-term goal of machine learning research is to build an intelligent

More information

Learning to Rank with Selection Bias in Personal Search

Learning to Rank with Selection Bias in Personal Search Learning to Rank with Selection Bias in Personal Search Xuanhui Wang, Michael Bendersky, Donald Metzler, Marc Najork Google Inc. Mountain View, CA 94043 {xuanhui, bemike, metzler, najork}@google.com ABSTRACT

More information

Detecting English-French Cognates Using Orthographic Edit Distance

Detecting English-French Cognates Using Orthographic Edit Distance Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National

More information

Knowledge Transfer in Deep Convolutional Neural Nets

Knowledge Transfer in Deep Convolutional Neural Nets Knowledge Transfer in Deep Convolutional Neural Nets Steven Gutstein, Olac Fuentes and Eric Freudenthal Computer Science Department University of Texas at El Paso El Paso, Texas, 79968, U.S.A. Abstract

More information

Speech Emotion Recognition Using Support Vector Machine

Speech Emotion Recognition Using Support Vector Machine Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,

More information

Summarizing Answers in Non-Factoid Community Question-Answering

Summarizing Answers in Non-Factoid Community Question-Answering Summarizing Answers in Non-Factoid Community Question-Answering Hongya Song Zhaochun Ren Shangsong Liang hongya.song.sdu@gmail.com zhaochun.ren@ucl.ac.uk shangsong.liang@ucl.ac.uk Piji Li Jun Ma Maarten

More information

Artificial Neural Networks written examination

Artificial Neural Networks written examination 1 (8) Institutionen för informationsteknologi Olle Gällmo Universitetsadjunkt Adress: Lägerhyddsvägen 2 Box 337 751 05 Uppsala Artificial Neural Networks written examination Monday, May 15, 2006 9 00-14

More information

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Exploration. CS : Deep Reinforcement Learning Sergey Levine Exploration CS 294-112: Deep Reinforcement Learning Sergey Levine Class Notes 1. Homework 4 due on Wednesday 2. Project proposal feedback sent Today s Lecture 1. What is exploration? Why is it a problem?

More information

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION Atul Laxman Katole 1, Krishna Prasad Yellapragada 1, Amish Kumar Bedi 1, Sehaj Singh Kalra 1 and Mynepalli Siva Chaitanya 1 1 Samsung

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma Adam Abdulhamid Stanford University 450 Serra Mall, Stanford, CA 94305 adama94@cs.stanford.edu Abstract With the introduction

More information

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches Yu-Chun Wang Chun-Kai Wu Richard Tzong-Han Tsai Department of Computer Science

More information

(Sub)Gradient Descent

(Sub)Gradient Descent (Sub)Gradient Descent CMSC 422 MARINE CARPUAT marine@cs.umd.edu Figures credit: Piyush Rai Logistics Midterm is on Thursday 3/24 during class time closed book/internet/etc, one page of notes. will include

More information

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract

More information

CS Machine Learning

CS Machine Learning CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing

More information

Semi-Supervised Face Detection

Semi-Supervised Face Detection Semi-Supervised Face Detection Nicu Sebe, Ira Cohen 2, Thomas S. Huang 3, Theo Gevers Faculty of Science, University of Amsterdam, The Netherlands 2 HP Research Labs, USA 3 Beckman Institute, University

More information

A Reinforcement Learning Variant for Control Scheduling

A Reinforcement Learning Variant for Control Scheduling A Reinforcement Learning Variant for Control Scheduling Aloke Guha Honeywell Sensor and System Development Center 3660 Technology Drive Minneapolis MN 55417 Abstract We present an algorithm based on reinforcement

More information

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search Using Deep Convolutional Neural Networks in Monte Carlo Tree Search Tobias Graf (B) and Marco Platzner University of Paderborn, Paderborn, Germany tobiasg@mail.upb.de, platzner@upb.de Abstract. Deep Convolutional

More information

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM Proceedings of 28 ISFA 28 International Symposium on Flexible Automation Atlanta, GA, USA June 23-26, 28 ISFA28U_12 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM Amit Gil, Helman Stern, Yael Edan, and

More information

SORT: Second-Order Response Transform for Visual Recognition

SORT: Second-Order Response Transform for Visual Recognition SORT: Second-Order Response Transform for Visual Recognition Yan Wang 1, Lingxi Xie 2( ), Chenxi Liu 2, Siyuan Qiao 2 Ya Zhang 1( ), Wenjun Zhang 1, Qi Tian 3, Alan Yuille 2 1 Cooperative Medianet Innovation

More information

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &

More information

Word Embedding Based Correlation Model for Question/Answer Matching

Word Embedding Based Correlation Model for Question/Answer Matching Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Word Embedding Based Correlation Model for Question/Answer Matching Yikang Shen, 1 Wenge Rong, 2 Nan Jiang, 2 Baolin

More information

Learning Methods for Fuzzy Systems

Learning Methods for Fuzzy Systems Learning Methods for Fuzzy Systems Rudolf Kruse and Andreas Nürnberger Department of Computer Science, University of Magdeburg Universitätsplatz, D-396 Magdeburg, Germany Phone : +49.39.67.876, Fax : +49.39.67.8

More information

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten How to read a Paper ISMLL Dr. Josif Grabocka, Carlotta Schatten Hildesheim, April 2017 1 / 30 Outline How to read a paper Finding additional material Hildesheim, April 2017 2 / 30 How to read a paper How

More information

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology Tiancheng Zhao CMU-LTI-16-006 Language Technologies Institute School of Computer Science Carnegie Mellon

More information

arxiv: v4 [cs.cv] 13 Aug 2017

arxiv: v4 [cs.cv] 13 Aug 2017 Ruben Villegas 1 * Jimei Yang 2 Yuliang Zou 1 Sungryull Sohn 1 Xunyu Lin 3 Honglak Lee 1 4 arxiv:1704.05831v4 [cs.cv] 13 Aug 17 Abstract We propose a hierarchical approach for making long-term predictions

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

A deep architecture for non-projective dependency parsing

A deep architecture for non-projective dependency parsing Universidade de São Paulo Biblioteca Digital da Produção Intelectual - BDPI Departamento de Ciências de Computação - ICMC/SCC Comunicações em Eventos - ICMC/SCC 2015-06 A deep architecture for non-projective

More information

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Email Marilyn A. Walker Jeanne C. Fromer Shrikanth Narayanan walker@research.att.com jeannie@ai.mit.edu shri@research.att.com

More information

arxiv: v2 [cs.cl] 26 Mar 2015

arxiv: v2 [cs.cl] 26 Mar 2015 Effective Use of Word Order for Text Categorization with Convolutional Neural Networks Rie Johnson RJ Research Consulting Tarrytown, NY, USA riejohnson@gmail.com Tong Zhang Baidu Inc., Beijing, China Rutgers

More information