A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks

Size: px
Start display at page:

Download "A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks"

Transcription

1 A Joint Many-Tak Model: Growing a Neural Network for Multiple NLP Tak Kazuma Hahimoto, Caiming Xiong, Yohimaa Turuoka, and Richard Socher The Univerity of Tokyo {hay, turuoka}@logo.t.u-tokyo.ac.jp Saleforce Reearch {cxiong, rocher}@aleforce.com Abtract Tranfer and multi-tak learning have traditionally focued on either a ingle ource-target pair or very few, imilar tak. Ideally, the linguitic level of morphology, yntax and emantic would benefit each other by being trained in a ingle model. We introduce a joint many-tak model together with a trategy for ucceively growing it depth to olve increaingly complex tak. Higher layer include hortcut connection to lower-level tak prediction to reflect linguitic hierarchie. We ue a imple regularization term to allow for optimizing all model weight to improve one tak lo without exhibiting catatrophic interference of the other tak. Our ingle end-to-end model obtain tate-of-the-art or competitive reult on five different tak from tagging, paring, relatedne, and entailment tak. 1 Introduction The potential for leveraging multiple level of repreentation ha been demontrated in variou way in the field of Natural Language Proceing (NLP). For example, Part-Of-Speech (POS) tag are ued for yntactic parer. The parer are ued to improve higher-level tak, uch a natural language inference (Chen et al., 2016) and machine tranlation (Eriguchi et al., 2016). Thee ytem are often pipeline and not trained end-to-end. Deep NLP model have yet hown benefit from predicting many increaingly complex tak each at a ucceively deeper layer. Exiting model often ignore linguitic hierarchie by predicting Work wa done while the firt author wa an intern at Saleforce Reearch. Correponding author. emantic level yntactic level word level Entailment encoder Relatedne encoder DEP CHUNK POS word repreentation Sentence 1 Entailment Relatedne Entailment encoder Relatedne encoder DEP CHUNK POS word repreentation Sentence 2 Figure 1: Overview of the joint many-tak model predicting different linguitic output at ucceively deeper layer. different tak either entirely eparately or at the ame depth (Collobert et al., 2011). We introduce a Joint Many-Tak (JMT) model, outlined in Figure 1, which predict increaingly complex NLP tak at ucceively deeper layer. Unlike traditional pipeline ytem, our ingle JMT model can be trained end-to-end for POS tagging, chunking, dependency paring, emantic relatedne, and textual entailment, by conidering linguitic hierarchie. We propoe an adaptive training and regularization trategy to grow thi model in it depth. With the help of thi trategy we avoid catatrophic interference between the tak. Our model i motivated by Søgaard and Goldberg (2016) who howed that predicting two different tak i more accurate when performed in different layer than in the ame layer (Collobert et al., 2011). Experimental reult how that our ingle model achieve competitive reult for all of the five different tak, demontrating that u- 446 Proceeding of the 2017 Conference on Empirical Method in Natural Language Proceing, page Copenhagen, Denmark, September 7 11, c 2017 Aociation for Computational Linguitic

2 ing linguitic hierarchie i more important than handling different tak in the ame layer. 2 The Joint Many-Tak Model Thi ection decribe the inference procedure of our model, beginning at the lowet level and working our way to higher layer and more complex tak; our model handle the five different tak in the order of POS tagging, chunking, dependency paring, emantic relatedne, and textual entailment, by conidering linguitic hierarchie. The POS tag are ued for chunking, and the chunking tag are ued for dependency paring (Attardi and DellOrletta, 2008). Tai et al. (2015) have hown that dependencie improve the relatedne tak. The relatedne and entailment tak are cloely related to each other. If the emantic relatedne between two entence i very low, they are unlikely to entail each other. Baed on thi obervation, we make ue of the information from the relatedne tak for improving the entailment tak. 2.1 Word Repreentation For each word w t in the input entence of length L, we ue two type of embedding. Word embedding: We ue Skip-gram (Mikolov et al., 2013) to train word embedding. Character embedding: Character n-gram embedding are trained by the ame Skip-gram objective. We contruct the character n-gram vocabulary in the training data and aign an embedding for each entry. The final character embedding i the average of the unique character n-gram embedding of w t. For example, the character n- gram (n = 1, 2, 3) of the word Cat are {C, a, t, #B#C, Ca, at, t#e#, #B#Ca, Cat, at#e#}, where #B# and #E# repreent the beginning and the end of each word, repectively. Uing the character embedding efficiently provide morphological feature. Each word i ubequently repreented a x t, the concatenation of it correponding word and character embedding hared acro the tak Word-Level Tak: POS Tagging The firt layer of the model i a bi-directional LSTM (Grave and Schmidhuber, 2005; Hochreiter and Schmidhuber, 1997) whoe hidden tate 1 Bojanowki et al. (2017) previouly propoed to train the character n-gram embedding by the Skip-gram objective. are ued to predict POS tag. We ue the following Long Short-Term Memory (LSTM) unit for the forward direction: i t = σ (W i g t + b i ), f t = σ (W f g t + b f ), u t = tanh (W u g t + b u ), c t = i t u t + f t c t 1, (1) o t = σ (W o g t + b o ), h t = o t tanh (c t ), where we define the input g t a g t = [ h t 1 ; x t ], i.e. the concatenation of the previou hidden tate and the word repreentation of w t. The backward pa i expanded in the ame way, but a different et of weight are ued. For predicting the POS tag of w t, we ue the concatenation of the forward and backward tate in a one-layer bi-lstm layer correponding to the t-th word: h t = [ h t ; h t ]. Then each h t (1 t L) i fed into a tandard oftmax claifier with a ingle ReLU layer which output the probability vector y (1) for each of the POS tag. 2.3 Word-Level Tak: Chunking Chunking i alo a word-level claification tak which aign a chunking tag (B-NP, I-VP, etc.) for each word. The tag pecifie the region of major phrae (e.g., noun phrae) in the entence. Chunking i performed in the econd bi-lstm layer on top of the POS layer. When tacking the bi-lstm layer, we ue Eq. (1) with input g (2) t = [h (2) t 1 ; h(1) t ; x t ; y (po) t ], where h (1) t i the hidden tate of the firt (POS) layer. We define the weighted label embedding y (po) t a follow: y (po) t = C j=1 p(y (1) t = j h (1) t )l(j), (2) where C i the number of the POS tag, p(y (1) j h (1) t t = ) i the probability value that the j-th POS tag i aigned to w t, and l(j) i the correponding label embedding. The probability value are predicted by the POS layer, and thu no gold POS tag are needed. Thi output embedding i imilar to the K-bet POS tag feature which ha been hown to be effective in yntactic tak (Andor et al., 2016; Alberti et al., 2015). For predicting the chunking tag, we employ the ame trategy a POS tagging by uing the concatenated bidirectional hidden tate h (2) t = [ h (2) t ; h (2) t ] in the chunking layer. We alo ue a ingle ReLU hidden layer before the oftmax claifier. 447

3 2.4 Syntactic Tak: Dependency Paring Dependency paring identifie yntactic relation (uch a an adjective modifying a noun) between word pair in a entence. We ue the third bi- LSTM layer to claify relation between all pair of word. The input vector for the LSTM include hidden tate, word repreentation, and the label embedding for the two previou tak: g (3) t = [h (3) t 1 ; h(2) t ; x t ; (y (po) t + y (chk) t )], where we computed the chunking vector in a imilar fahion a the POS vector in Eq. (2). We predict the parent node (head) for each word. Then a dependency label i predicted for each child-parent pair. Thi approach i related to Dozat and Manning (2017) and Zhang et al. (2017), where the main difference i that our model work on a multi-tak framework. To predict the parent node of w t, we define a matching function between w t and the candidate of the parent node a m (t, j) = h (3) t (W d h (3) j ), where W d i a parameter matrix. For the root, we define h (3) L+1 = r a a parameterized vector. To compute the probability that w j (or the root node) i the parent of w t, the core are normalized: p(j h (3) t ) = exp (m (t, j)) L+1 (3) k=1,k t exp (m (t, k)). The dependency label are predicted uing [h (3) t ; h (3) j ] a input to a oftmax claifier with a ingle ReLU layer. We greedily elect the parent node and the dependency label for each word. When the paring reult i not a well-formed tree, we apply the firt-order Einer algorithm (Einer, 1996) to obtain a well-formed tree from it. 2.5 Semantic Tak: Semantic relatedne The next two tak model the emantic relationhip between two input entence. The firt tak meaure the emantic relatedne between two entence. The output i a real-valued relatedne core for the input entence pair. The econd tak i textual entailment, which require one to determine whether a premie entence entail a hypothei entence. There are typically three clae: entailment, contradiction, and neutral. We ue the fourth and fifth bi-lstm layer for the relatedne and entailment tak, repectively. Now it i required to obtain the entence-level repreentation rather than the word-level repreentation h (4) t ued in the firt three tak. We compute the entence-level repreentation h (4) a the element-wie maximum value acro all of the word-level repreentation in the fourth layer: h (4) ( ) = max h (4) 1, h(4) 2,..., h(4) L. (4) Thi max-pooling technique ha proven effective in text claification tak (Lai et al., 2015). To model the emantic relatedne between and, we follow Tai et al. (2015). The feature vector for repreenting the emantic relatedne i computed a follow: where d 1 (, ) = [ ] h (4) h (4) (4) ; h h (4), (5) h (4) h (4) i the abolute value of the element-wie ubtraction, and h (4) h (4) i the element-wie multiplication. Then d 1 (, ) i fed into a oftmax claifier with a ingle Maxout hidden layer (Goodfellow et al., 2013) to output a relatedne core (from 1 to 5 in our cae). 2.6 Semantic Tak: Textual entailment For entailment claification, we alo ue the maxpooling technique a in the emantic relatedne tak. To claify the premie-hypothei pair (, ) into one of the three clae, we compute the feature vector d 2 (, ) a in Eq. (5) except that we do not ue the abolute value of the element-wie ubtraction, becaue we need to identify which i the premie (or hypothei). Then d 2 (, ) i fed into a oftmax claifier. To ue the output from the relatedne layer directly, we ue the label embedding for the relatedne tak. More concretely, we compute the cla label embedding for the emantic relatedne tak imilar to Eq. (2). The final feature vector that are concatenated and fed into the entailment claifier are the weighted relatedne label embedding and the feature vector d 2 (, ). We ue three Maxout hidden layer before the claifier. 3 Training the JMT Model The model i trained jointly over all dataet. During each epoch, the optimization iterate over each full training dataet in the ame order a the correponding tak decribed in the modeling ection. 3.1 Pre-Training Word Repreentation We pre-train word embedding uing the Skipgram model with negative ampling (Mikolov 448

4 et al., 2013). We alo pre-train the character n- gram embedding uing Skip-gram. 2 The only difference i that each input word embedding i replaced with it correponding average character n- gram embedding decribed in Section 2.1. Thee embedding are fine-tuned during the model training. We denote the embedding parameter a θ e. 3.2 Training the POS Layer Let θ POS = (W POS, b POS, θ e ) denote the et of model parameter aociated with the POS layer, where W POS i the et of the weight matrice in the firt bi-lstm and the claifier, and b POS i the et of the bia vector. The objective function to optimize θ POS i defined a follow: J 1 (θ POS ) = log p(y (1) t = α h (1) t ) t (6) + λ W POS 2 + δ θ e θ e 2, where p(y (1) t = α wt h (1) t ) i the probability value that the correct label α i aigned to w t in the entence, λ W POS 2 i the L2-norm regularization term, and λ i a hyperparameter. We call the econd regularization term δ θ e θ e 2 a ucceive regularization term. The ucceive regularization i baed on the idea that we do not want the model to forget the information learned for the other tak. In the cae of POS tagging, the regularization i applied to θ e, and θ e i the embedding parameter after training the final tak in the top-mot layer at the previou training epoch. δ i a hyperparameter. 3.3 Training the Chunking Layer The objective function i defined a follow: J 2 (θ chk ) = log p(y (2) t = α h (2) t ) t + λ W chk 2 + δ θ POS θ POS 2, (7) which i imilar to that of POS tagging, and θ chk i (W chk, b chk, E POS, θ e ), where W chk and b chk are the weight and bia parameter including thoe in θ POS, and E POS i the et of the POS label embedding. θ POS i the one after training the POS layer at the current training epoch. 2 The training code and the pre-trained embedding are available at charngram2vec. 3.4 Training the Dependency Layer The objective function i defined a follow: J 3 (θ dep ) = log p(α h (3) t )p(β h (3) t, h (3) α ) t + λ( W dep 2 + W d 2 ) + δ θ chk θ chk 2, (8) where p(α h (3) t ) i the probability value aigned to the correct parent node α for w t, and p(β h (3) t, h (3) α ) i the probability value aigned to the correct dependency label β for the child-parent pair (w t, α). θ dep i defined a (W dep, b dep, W d, r, E POS, E chk, θ e ), where W dep and b dep are the weight and bia parameter including thoe in θ chk, and E chk i the et of the chunking label embedding. 3.5 Training the Relatedne Layer Following Tai et al. (2015), the objective function i defined a follow: J 4 (θ rel ) = ( ) KL ˆp(, ) p(h (4), h (4) ) (, ) (9) + λ W rel 2 + δ θ dep θ dep 2, where ˆp(, ) i the gold ditribution over the defined relatedne core, p(h (4), h (4) ) i the predicted ditribution given ( the the entence) repre- entation, and KL ˆp(, ) p(h (4), h (4) ) i the KL-divergence between the two ditribution. θ rel i defined a (W rel, b rel, E POS, E chk, θ e ). 3.6 Training the Entailment Layer The objective function i defined a follow: J 5 (θ ent ) = log p(y (5) (, ) = α h(5), h (5) ) (, ) + λ W ent 2 + δ θ rel θ rel 2, (10) where p(y (5) (, ) = α h (5), h (5) ) i the probability value that the correct label α i aigned to the premie-hypothei pair (, ). θ ent i defined a (W ent, b ent, E POS, E chk, E rel, θ e ), where E rel i the et of the relatedne label embedding. 4 Related Work Many deep learning approache have proven to be effective in a variety of NLP tak and are becoming more and more complex. They are typically 449

5 deigned to handle ingle tak, or ome of them are deigned a general-purpoe model (Kumar et al., 2016; Sutkever et al., 2014) but applied to different tak independently. For handling multiple NLP tak, multi-tak learning model with deep neural network have been propoed (Collobert et al., 2011; Luong et al., 2016), and more recently Søgaard and Goldberg (2016) have uggeted that uing different layer for different tak i more effective than uing the ame layer in jointly learning cloely-related tak, uch a POS tagging and chunking. However, the number of tak wa limited or they have very imilar tak etting like word-level tagging, and it wa not clear how lower-level tak could be alo improved by combining higher-level tak. More related to our work, Godwin et al. (2016) alo followed Søgaard and Goldberg (2016) to jointly learn POS tagging, chunking, and language modeling, and Zhang and Wei (2016) have hown that it i effective to jointly learn POS tagging and dependency paring by haring internal repreentation. In the field of relation extraction, Miwa and Banal (2016) propoed a joint learning model for entity detection and relation extraction. All of them ugget the importance of multi-tak learning, and we invetigate the potential of handling different type of NLP tak rather than cloely-related one in a ingle hierarchical deep model. In the field of computer viion, ome tranfer and multi-tak learning approache have alo been propoed (Li and Hoiem, 2016; Mira et al., 2016). For example, Mira et al. (2016) propoed a multi-tak learning model to handle different tak. However, they aume that each data ample ha annotation for the different tak, and do not explicitly conider tak hierarchie. Recently, Ruu et al. (2016) have propoed a progreive neural network model to handle multiple reinforcement learning tak, uch a Atari game. Like our JMT model, their model i alo ucceively trained according to different tak uing different layer called column in their paper. In their model, once the firt tak i completed, the model parameter for the firt tak are fixed, and then the econd tak i handled with new model parameter. Therefore, accuracy of the previouly trained tak i never improved. In NLP tak, multi-tak learning ha the potential to improve not only higher-level tak, but alo lowerlevel tak. Rather than fixing the pre-trained model parameter, our ucceive regularization allow our model to continuouly train the lowerlevel tak without ignificant accuracy drop. 5 Experimental Setting 5.1 Dataet POS tagging: To train the POS tagging layer, we ued the Wall Street Journal (WSJ) portion of Penn Treebank, and followed the tandard plit for the training (Section 0-18), development (Section 19-21), and tet (Section 22-24) et. The evaluation metric i the word-level accuracy. Chunking: For chunking, we alo ued the WSJ corpu, and followed the tandard plit for the training (Section 15-18) and tet (Section 20) et a in the CoNLL 2000 hared tak. We ued Section 19 a the development et and employed the IOBES tagging cheme. The evaluation metric i the F1 core defined in the hared tak. Dependency paring: We alo ued the WSJ corpu for dependency paring, and followed the tandard plit for the training (Section 2-21), development (Section 22), and tet (Section 23) et. We obtained Stanford tyle dependencie uing the verion of the Stanford converter. The evaluation metric are the Unlabeled Attachment Score (UAS) and the Labeled Attachment Score (LAS), and punctuation are excluded for the evaluation. Semantic relatedne: For the emantic relatedne tak, we ued the SICK dataet (Marelli et al., 2014), and followed the tandard plit for the training, development, and tet et. The evaluation metric i the Mean Squared Error (MSE) between the gold and predicted core. Textual entailment: For textual entailment, we alo ued the SICK dataet and exactly the ame data plit a the emantic relatedne dataet. The evaluation metric i the accuracy. 5.2 Training Detail We et the dimenionality of the embedding and the hidden tate in the bi-lstm to 100. At each training epoch, we trained our model in the order of POS tagging, chunking, dependency paring, emantic relatedne, and textual entailment. We ued mini-batch tochatic gradient decent and empirically found it effective to ue a gradient clipping method with growing clipping value for the different tak; concretely, we employed the imple function: min(3.0, depth), where depth i 450

6 the number of bi-lstm layer involved in each tak, and 3.0 i the maximum value. We applied our ucceive regularization to our model, along with L2-norm regularization and dropout (Srivatava et al., 2014). More detail are ummarized in the upplemental material. 6 Reult and Dicuion Table 1 how our reult on the tet et of the five tak. 3 The column Single how the reult of handling each tak eparately uing inglelayer bi-lstm, and the column JMT all how the reult of our JMT model. The ingle tak etting only ue the annotation of their own tak. For example, when handling dependency paring a a ingle tak, the POS and chunking tag are not ued. We can ee that all reult of the five tak are improved in our JMT model, which how that our JMT model can handle the five different tak in a ingle model. Our JMT model allow u to acce arbitrary information learned from the different tak. If we want to ue the model jut a a POS tagger, we can ue only firt bi-lstm layer. Table 1 alo how the reult of five ubet of the different tak. For example, in the cae of JMT ABC, only the firt three layer of the bi-lstm are ued to handle the three tak. In the cae of JMT DE, only the top two layer are ued a a two-layer bi-lstm by omitting all information from the firt three layer. The reult of the cloely-related tak ( AB, ABC, and DE ) how that our JMT model improve both of the high-level and low-level tak. The reult of JMT CD and JMT CE how that the paring tak can be improved by the emantic tak. It hould be noted that in our analyi on the greedy paring reult of the JMT ABC etting, we have found that more than 95% are wellformed dependency tree on the development et. In the 1,700 entence of the development data, 11 reult have multiple root note, 11 reult have no root node, and 61 reult have cycle. Thee 83 paring reult are converted into well-formed tree by Einer algorithm, and the accuracy doe not ignificantly change (UAS: 94.52% 94.53%, LAS: 92.61% 92.62%). 3 In chunking evaluation, we only how the reult of Single and JMT AB becaue the entence for chunking evaluation overlap the training data for dependency paring. 6.1 Comparion with Publihed Reult POS tagging Table 2 how the reult of POS tagging, and our JMT model achieve the core cloe to the tate-of-the-art reult. The bet reult to date ha been achieved by Ling et al. (2015), which ue character-baed LSTM. Incorporating the character-baed encoder into our JMT model would be an intereting direction, but we have hown that the imple pre-trained character n-gram embedding lead to the promiing reult. Chunking Table 3 how the reult of chunking, and our JMT model achieve the tate-of-theart reult. Søgaard and Goldberg (2016) propoed to jointly learn POS tagging and chunking in different layer, but they only howed improvement for chunking. By contrat, our reult how that the low-level tak are alo improved. Dependency paring Table 4 how the reult of dependency paring by uing only the WSJ corpu in term of the dependency annotation. 4 It i notable that our imple greedy dependency parer outperform the model in Andor et al. (2016) which i baed on beam earch with global information. The reult ugget that the bi-lstm efficiently capture global information neceary for dependency paring. Moreover, our ingle tak reult already achieve high accuracy without the POS and chunking information. The bet reult to date ha been achieved by the model propoed in Dozat and Manning (2017), which ue higher dimenional repreentation than our and propoe a more ophiticated attention mechanim called biaffine attention. It hould be promiing to incorporate their attention mechanim into our paring component. Semantic relatedne Table 5 how the reult of the emantic relatedne tak, and our JMT model achieve the tate-of-the-art reult. The reult of JMT DE i already better than the previou tate-of-the-art reult. Both of Zhou et al. (2016) and Tai et al. (2015) explicitly ued yntactic tree, and Zhou et al. (2016) relied on attention mechanim. However, our method ue the imple maxpooling trategy, which ugget that it i worth 4 Choe and Charniak (2016) employed a tri-training method to expand the training data with 400,000 tree in addition to the WSJ data, and they reported 95.9 UAS and 94.1 LAS by converting their contituency tree into dependency tree. Kuncoro et al. (2017) alo reported high accuracy (95.8 UAS and 94.6 LAS) by uing a converter. 451

7 Single JMT all JMT AB JMT ABC JMT DE JMT CD JMT CE A POS n/a n/a n/a B Chunking n/a n/a n/a n/a n/a C Dependency UAS n/a n/a Dependency LAS n/a n/a D Relatedne n/a n/a n/a E Entailment n/a n/a 86.8 n/a 82.4 Table 1: Tet et reult for the five tak. In the relatedne tak, the lower core are better. Method Acc. JMT all Ling et al. (2015) Kumar et al. (2016) Ma and Hovy (2016) Søgaard (2011) Collobert et al. (2011) Turuoka et al. (2011) Toutanova et al. (2003) Table 2: POS tagging reult. Method MSE JMT all JMT DE Zhou et al. (2016) Tai et al. (2015) Method F1 JMT AB Single Søgaard and Goldberg (2016) Suzuki and Iozaki (2008) Collobert et al. (2011) Kudo and Matumoto (2001) Turuoka et al. (2011) Table 3: Chunking reult. Method UAS LAS JMT all Single Dozat and Manning (2017) Andor et al. (2016) Alberti et al. (2015) Zhang et al. (2017) Wei et al. (2015) Dyer et al. (2015) Bohnet (2010) Table 4: Dependency reult. Method Acc. JMT all 86.2 JMT DE 86.8 Yin et al. (2016) 86.2 Lai and Hockenmaier (2014) 84.6 Table 5: Semantic relatedne reult. Table 6: Textual entailment reult. JMT all w/o SC w/o LE w/o SC&LE POS Chunking Dependency UAS Dependency LAS Relatedne Entailment Table 7: Effectivene of the Shortcut Connection (SC) and the Label Embedding (LE). invetigating uch imple method before developing complex method for imple tak. Currently, our JMT model doe not explicitly ue the learned dependency tructure, and thu the explicit ue of the output from the dependency layer hould be an intereting direction of future work. Textual entailment Table 6 how the reult of textual entailment, and our JMT model achieve the tate-of-the-art reult. The previou tate-ofthe-art reult in Yin et al. (2016) relied on attention mechanim and dataet-pecific data preproceing and feature. Again, our imple maxpooling trategy achieve the tate-of-the-art reult booted by the joint training. Thee reult how the importance of jointly handling related tak. 6.2 Analyi on the Model Architecture We invetigate the effectivene of our model in detail. All of the reult hown in thi ection are the development et reult. JMT ABC w/o SC&LE All-3 POS Chunking Dependency UAS Dependency LAS Table 8: Effectivene of uing different layer for different tak. Shortcut connection Our JMT model feed the word repreentation into all of the bi-lstm layer, which i called the hortcut connection. Table 7 how the reult of JMT all with and without the hortcut connection. The reult without the hortcut connection are hown in the column of w/o SC. Thee reult clearly how that the importance of the hortcut connection, and in particular, the emantic tak in the higher layer trongly rely on the hortcut connection. That i, imply tacking the LSTM layer i not ufficient to handle a variety of NLP tak in a ingle model. In the upplementary material, it i qualitatively hown how the hortcut connection work in our model. Output label embedding Table 7 alo how the reult without uing the output label of the POS, chunking, and relatedne layer, in the column of w/o LE. Thee reult how that the explicit ue of the output information from the claifier of the lower layer i important in our JMT 452

8 JMT all w/o SR w/o VC POS Chunking Dependency UAS Dependency LAS Relatedne Entailment Table 9: Effectivene of the Succeive Regularization (SR) and the Vertical Connection (VC). JMT all Random POS Chunking Dependency UAS Dependency LAS Relatedne Entailment Table 10: Effect of the order of training. model. The reult in the column of w/o SC&LE are the one without both of the hortcut connection and the label embedding. Different layer for different tak Table 8 how the reult of our JMT ABC etting and that of not uing the hortcut connection and the label embedding ( w/o SC&LE ) a in Table 7. In addition, in the column of All-3, we how the reult of uing the highet (i.e., the third) layer for all of the three tak without any hortcut connection and label embedding, and thu the two etting w/o SC&LE and All-3 require exactly the ame number of the model parameter. The All-3 etting i imilar to the multi-tak model of Collobert et al. (2011) in that tak-pecific output layer are ued but mot of the model parameter are hared. The reult how that uing the ame layer for the three different tak hamper the effectivene of our JMT model, and the deign of the model i much more important than the number of the model parameter. Succeive regularization In Table 9, the column of w/o SR how the reult of omitting the ucceive regularization term decribed in Section 3. We can ee that the accuracy of chunking i improved by the ucceive regularization, while other reult are not affected o much. The chunking dataet ued here i relatively mall compared with other low-level tak, POS tagging and dependency paring. Thu, thee reult ugget that the ucceive regularization i effective when dataet ize are imbalanced. Vertical connection We invetigated our JMT reult without uing the vertical connection in Single Single+ POS Chunking Dependency UAS Dependency LAS Relatedne Entailment Table 11: Effect of depth for the ingle tak. Single W&C Only W POS Chunking Dependency UAS Dependency LAS Table 12: Effect of the character embedding. the five-layer bi-lstm. More concretely, when contructing the input vector g t, we do not ue the bi-lstm hidden tate of the previou layer. Table 9 alo how the JMT all reult with and without the vertical connection. A hown in the column of w/o VC, we oberved the competitive reult. Therefore, in the target tak ued in our model, haring the word repreentation and the output label embedding i more effective than jut tacking the bi-lstm layer. Order of training Our JMT model iterate the training proce in the order decribed in Section 3. Our hypothei i that it i important to tart from the lower-level tak and gradually move to the higher-level tak. Table 10 how the reult of training our model by randomly huffling the order of the tak for each epoch in the column of Random. We ee that the core of the emantic tak drop by the random trategy. In our preliminary experiment, we have found that contructing the mini-batch ample from different tak alo hamper the effectivene of our model, which alo upport our hypothei. Depth The ingle tak etting hown in Table 1 are obtained by uing ingle layer bi-lstm, but in our JMT model, the higher-level tak ue ucceively deeper layer. To invetigate the gap between the different number of the layer for each tak, we alo how the reult of uing multi-layer bi-lstm for the ingle tak etting, in the column of Single+ in Table 11. More concretely, we ue the ame number of the layer with our JMT model; for example, three layer are ued for dependency paring, and five layer are ued for textual entailment. A hown in thee reult, deeper layer do not alway lead to better reult, and the joint learning i more important than mak- 453

9 ing the model complex only for ingle tak. Character n-gram embedding Finally, Table 12 how the reult for the three ingle tak with and without the pre-trained character n-gram embedding. The column of W&C correpond to uing both of the word and character n-gram embedding, and that of Only W correpond to uing only the word embedding. Thee reult clearly how that jointly uing the pre-trained word and character n-gram embedding i helpful in improving the reult. The pre-training of the character n-gram embedding i alo effective; for example, without the pre-training, the POS accuracy drop from 97.52% to 97.38% and the chunking accuracy drop from 95.65% to 95.14%. 6.3 Dicuion Training trategie In our JMT model, it i not obviou when to top the training while trying to maximize the core of all the five tak. We focued on maximizing the accuracy of dependency paring on the development data in our experiment. However, the ize of the training data are different acro the different tak; for example, the emantic tak include only 4,500 entence pair, and the dependency paring dataet include 39,832 entence with word-level annotation. Thu, in general, dependency paring require more training epoch than the emantic tak, but currently, our model train all of the tak for the ame training epoch. The ame trategy for decreaing the learning rate i alo hared acro all the different tak, although our growing gradient clipping method decribed in Section 5.2 help improve the reult. Indeed, we oberved that better core of the emantic tak can be achieved before the accuracy of dependency paring reache the bet core. Developing a method for achieving the bet core for all of the tak at the ame time i important future work. More tak Our JMT model ha the potential of handling more tak than the five tak ued in our experiment; example include entity detection and relation extraction a in Miwa and Banal (2016) a well a language modeling (Godwin et al., 2016). It i alo a promiing direction to train each tak for multiple domain by focuing on domain adaptation (Søgaard and Goldberg, 2016). In particular, incorporating language modeling tak provide an opportunity to ue large text data. Such large text data wa ued in our experiment to pre-train the word and character n- gram embedding. However, it would be preferable to efficiently ue it for improving the entire model. Tak-oriented learning of low-level tak Each tak in our JMT model i upervied by it correponding dataet. However, it would be poible to learn low-level tak by optimizing highlevel tak, becaue the model parameter of the low-level tak can be directly modified by learning the high-level tak. One example ha already been preented in Hahimoto and Turuoka (2017), where our JMT model i extended to learning tak-oriented latent graph tructure of entence by training our dependency paring component according to a neural machine tranlation objective. 7 Concluion We preented a joint many-tak model to handle multiple NLP tak with growing depth in a ingle end-to-end model. Our model i ucceively trained by conidering linguitic hierarchie, directly feeding word repreentation into all layer, explicitly uing low-level prediction, and applying ucceive regularization. In experiment on five NLP tak, our ingle model achieve the tate-of-the-art or competitive reult on chunking, dependency paring, emantic relatedne, and textual entailment. Acknowledgment We thank the anonymou reviewer and the Saleforce Reearch team member for their fruitful comment and dicuion. Reference Chri Alberti, David Wei, Greg Coppola, and Slav Petrov Improved Tranition-Baed Paring and Tagging with Neural Network. In Proceeding of the 2015 Conference on Empirical Method in Natural Language Proceing, page Daniel Andor, Chri Alberti, David Wei, Aliakei Severyn, Aleandro Preta, Kuzman Ganchev, Slav Petrov, and Michael Collin Globally Normalized Tranition-Baed Neural Network. In Proceeding of the 54th Annual Meeting of the Aociation for Computational Linguitic (Volume 1: Long Paper), page Giueppe Attardi and Felice DellOrletta Chunking and Dependency Paring. In Proceeding of LREC 2008 Workhop on Partial Paring. 454

10 Bernd Bohnet Top Accuracy and Fat Dependency Paring i not a Contradiction. In Proceeding of the 23rd International Conference on Computational Linguitic, page Piotr Bojanowki, Edouard Grave, Armand Joulin, and Toma Mikolov Enriching Word Vector with Subword Information. Tranaction of the Aociation for Computational Linguitic, 5: Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, and Hui Jiang Enhancing and Combining Sequential and Tree LSTM for Natural Language Inference. arxiv, c.cl Do Kook Choe and Eugene Charniak Paring a Language Modeling. In Proceeding of the 2016 Conference on Empirical Method in Natural Language Proceing, page Ronan Collobert, Jaon Weton, Leon Bottou, Michael Karlen nad Koray Kavukcuoglu, and Pavel Kuka Natural Language Proceing (Almot) from Scratch. Journal of Machine Learning Reearch, 12: Timothy Dozat and Chritopher D. Manning Deep Biaffine Attention for Neural Dependency Paring. In Proceeding of the 5th International Conference on Learning Repreentation. Chri Dyer, Miguel Balletero, Wang Ling, Autin Matthew, and Noah A. Smith Tranition- Baed Dependency Paring with Stack Long Short- Term Memory. In Proceeding of the 53rd Annual Meeting of the Aociation for Computational Linguitic and the 7th International Joint Conference on Natural Language Proceing (Volume 1: Long Paper), page Jaon Einer Efficient Normal-Form Paring for Combinatory Categorial Grammar. In Proceeding of the 34th Annual Meeting of the Aociation for Computational Linguitic, page Akiko Eriguchi, Kazuma Hahimoto, and Yohimaa Turuoka Tree-to-Sequence Attentional Neural Machine Tranlation. In Proceeding of the 54th Annual Meeting of the Aociation for Computational Linguitic (Volume 1: Long Paper), page Jonathan Godwin, Pontu Stenetorp, and Sebatian Riedel Deep Semi-Supervied Learning with Linguitically Motivated Sequence Labeling Tak Hierarchie. arxiv, c.cl Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yohua Bengio Maxout Network. In Proceeding of The 30th International Conference on Machine Learning, page Alex Grave and Jurgen Schmidhuber Framewie Phoneme Claification with Bidirectional LSTM and Other Neural Network Architecture. Neural Network, 18(5): Kazuma Hahimoto and Yohimaa Turuoka Neural Machine Tranlation with Source-Side Latent Graph Paring. In Proceeding of the 2017 Conference on Empirical Method in Natural Language Proceing. To appear. Sepp Hochreiter and Jurgen Schmidhuber Long hort-term memory. Neural Computation, 9(8): Taku Kudo and Yuji Matumoto Chunking with Support Vector Machine. In Proceeding of the Second Meeting of the North American Chapter of the Aociation for Computational Linguitic. Ankit Kumar, Ozan Iroy, Peter Ondruka, Mohit Iyyer, Jame Bradbury, Ihaan Gulrajani, Victor Zhong, Romain Paulu, and Richard Socher Ak Me Anything: Dynamic Memory Network for Natural Language Proceing. In Proceeding of The 33rd International Conference on Machine Learning, page Adhiguna Kuncoro, Miguel Balletero, Lingpeng Kong, Chri Dyer, Graham Neubig, and Noah A. Smith What Do Recurrent Neural Network Grammar Learn About Syntax? In Proceeding of the 15th Conference of the European Chapter of the Aociation for Computational Linguitic, page Alice Lai and Julia Hockenmaier Illinoi-LH: A Denotational and Ditributional Approach to Semantic. In Proceeding of the 8th International Workhop on Semantic Evaluation, page Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao Recurrent Convolutional Neural Network for Text Claification. In Proceeding of the Twenty-Ninth AAAI Conference on Artificial Intelligence, page Zhizhong Li and Derek Hoiem Learning without Forgetting. CoRR, ab/ Wang Ling, Chri Dyer, Alan W Black, Iabel Trancoo, Ramon Fermandez, Silvio Amir, Lui Marujo, and Tiago Lui Finding Function in Form: Compoitional Character Model for Open Vocabulary Word Repreentation. In Proceeding of the 2015 Conference on Empirical Method in Natural Language Proceing, page Minh-Thang Luong, Ilya Sutkever, Quoc V. Le, Oriol Vinyal, and Lukaz Kaier Multi-tak Sequence to Sequence Learning. In Proceeding of the 4th International Conference on Learning Repreentation. Xuezhe Ma and Eduard Hovy End-to-end Sequence Labeling via Bi-directional LSTM-CNN- CRF. In Proceeding of the 54th Annual Meeting of the Aociation for Computational Linguitic (Volume 1: Long Paper), page

11 Marco Marelli, Luia Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini, and Roberto Zamparelli SemEval-2014 Tak 1: Evaluation of Compoitional Ditributional Semantic Model on Full Sentence through Semantic Relatedne and Textual Entailment. In Proceeding of the 8th International Workhop on Semantic Evaluation (SemEval 2014), page 1 8. Toma Mikolov, Ilya Sutkever, Kai Chen, Greg S Corrado, and Jeff Dean Ditributed Repreentation of Word and Phrae and their Compoitionality. In Advance in Neural Information Proceing Sytem 26, page Ihan Mira, Abhinav Shrivatava, Abhinav Gupta, and Martial Hebert Cro-titch Network for Multi-tak Learning. CoRR, ab/ Makoto Miwa and Mohit Banal End-to-End Relation Extraction uing LSTM on Sequence and Tree Structure. In Proceeding of the 54th Annual Meeting of the Aociation for Computational Linguitic (Volume 1: Long Paper), page Andrei A. Ruu, Neil C. Rabinowitz, Guillaume Dejardin, Hubert Soyer, Jame Kirkpatrick, Koray Kavukcuoglu, Razvan Pacanu, and Raia Hadell Progreive Neural Network. CoRR, ab/ Ander Søgaard Semi-upervied condened nearet neighbor for part-of-peech tagging. In Proceeding of the 49th Annual Meeting of the Aociation for Computational Linguitic: Human Language Technologie, page Ander Søgaard and Yoav Goldberg Deep multi-tak learning with low level tak upervied at lower layer. In Proceeding of the 54th Annual Meeting of the Aociation for Computational Linguitic (Volume 2: Short Paper), page Nitih Srivatava, Geoffrey Hinton, Alex Krizhevky, Ilya Sutkever, and Rulan Salakhutdinov Dropout: A Simple Way to Prevent Neural Network from Overfitting. Journal of Machine Learning Reearch, 15: Ilya Sutkever, Oriol Vinyal, and Quoc V Le Sequence to Sequence Learning with Neural Network. In Advance in Neural Information Proceing Sytem 27, page Meeting of the Aociation for Computational Linguitic and the 7th International Joint Conference on Natural Language Proceing (Volume 1: Long Paper), page Kritina Toutanova, Dan Klein, Chritopher D Manning, and Yoram Singer Feature-Rich Partof-Speech Tagging with a Cyclic Dependency Network. In Proceeding of the 2003 Human Language Technology Conference of the North American Chapter of the Aociation for Computational Linguitic, page Yohimaa Turuoka, Yuuke Miyao, and Jun ichi Kazama Learning with Lookahead: Can Hitory-Baed Model Rival Globally Optimized Model? In Proceeding of the Fifteenth Conference on Computational Natural Language Learning, page David Wei, Chri Alberti, Michael Collin, and Slav Petrov Structured Training for Neural Network Tranition-Baed Paring. In Proceeding of the 53rd Annual Meeting of the Aociation for Computational Linguitic and the 7th International Joint Conference on Natural Language Proceing (Volume 1: Long Paper), page Wenpeng Yin, Hinrich Schtze, Bing Xiang, and Bowen Zhou ABCNN: Attention-Baed Convolutional Neural Network for Modeling Sentence Pair. Tranaction of the Aociation for Computational Linguitic, 4: Xingxing Zhang, Jianpeng Cheng, and Mirella Lapata Dependency Paring a Head Selection. In Proceeding of the 15th Conference of the European Chapter of the Aociation for Computational Linguitic, page Yuan Zhang and David Wei Stackpropagation: Improved Repreentation Learning for Syntax. In Proceeding of the 54th Annual Meeting of the Aociation for Computational Linguitic (Volume 1: Long Paper), page Yao Zhou, Cong Liu, and Yan Pan Modelling Sentence Pair with Tree-tructured Attentive Encoder. In Proceeding of the 26th International Conference on Computational Linguitic, page Jun Suzuki and Hideki Iozaki Semi-Supervied Sequential Labeling and Segmentation Uing Giga- Word Scale Unlabeled Data. In Proceeding of the 46th Annual Meeting of the Aociation for Computational Linguitic: Human Language Technologie, page Kai Sheng Tai, Richard Socher, and Chritopher D. Manning Improved Semantic Repreentation From Tree-Structured Long Short-Term Memory Network. In Proceeding of the 53rd Annual 456

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka & Richard Socher The University of Tokyo {hassy, tsuruoka}@logos.t.u-tokyo.ac.jp

More information

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering

More information

A deep architecture for non-projective dependency parsing

A deep architecture for non-projective dependency parsing Universidade de São Paulo Biblioteca Digital da Produção Intelectual - BDPI Departamento de Ciências de Computação - ICMC/SCC Comunicações em Eventos - ICMC/SCC 2015-06 A deep architecture for non-projective

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks POS tagging of Chinese Buddhist texts using Recurrent Neural Networks Longlu Qin Department of East Asian Languages and Cultures longlu@stanford.edu Abstract Chinese POS tagging, as one of the most important

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski Problem Statement and Background Given a collection of 8th grade science questions, possible answer

More information

Second Exam: Natural Language Parsing with Neural Networks

Second Exam: Natural Language Parsing with Neural Networks Second Exam: Natural Language Parsing with Neural Networks James Cross May 21, 2015 Abstract With the advent of deep learning, there has been a recent resurgence of interest in the use of artificial neural

More information

Georgetown University at TREC 2017 Dynamic Domain Track

Georgetown University at TREC 2017 Dynamic Domain Track Georgetown University at TREC 2017 Dynamic Domain Track Zhiwen Tang Georgetown University zt79@georgetown.edu Grace Hui Yang Georgetown University huiyang@cs.georgetown.edu Abstract TREC Dynamic Domain

More information

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017 What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017 Supervised Training of Neural Networks for Language Training Data Training Model this is an example the cat went to

More information

arxiv: v4 [cs.cl] 28 Mar 2016

arxiv: v4 [cs.cl] 28 Mar 2016 LSTM-BASED DEEP LEARNING MODELS FOR NON- FACTOID ANSWER SELECTION Ming Tan, Cicero dos Santos, Bing Xiang & Bowen Zhou IBM Watson Core Technologies Yorktown Heights, NY, USA {mingtan,cicerons,bingxia,zhou}@us.ibm.com

More information

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

PDF hosted at the Radboud Repository of the Radboud University Nijmegen PDF hoted at the Radboud Repoitory of the Radboud Univerity Nijmegen The folloing full text i a publiher' verion. For additional information about thi publication click thi link. http://hdl.handle.net/2066/43776

More information

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Xinying Song, Xiaodong He, Jianfeng Gao, Li Deng Microsoft Research, One Microsoft Way, Redmond, WA 98052, U.S.A.

More information

A Syntactic Description of German in a Formalism Designed for Machine Translation

A Syntactic Description of German in a Formalism Designed for Machine Translation A Syntactic Decription of German in a Formalim Deigned for Machine Tranlation Paul Schmldt A-Eurotra-D Martln-Luther-Str. 14 D-6600 Saarbrlickcn Wet-Germany Abtract: Thi paper preent a yntactic decription

More information

arxiv: v5 [cs.ai] 18 Aug 2015

arxiv: v5 [cs.ai] 18 Aug 2015 When Are Tree Structures Necessary for Deep Learning of Representations? Jiwei Li 1, Minh-Thang Luong 1, Dan Jurafsky 1 and Eduard Hovy 2 1 Computer Science Department, Stanford University, Stanford, CA

More information

Probing for semantic evidence of composition by means of simple classification tasks

Probing for semantic evidence of composition by means of simple classification tasks Probing for semantic evidence of composition by means of simple classification tasks Allyson Ettinger 1, Ahmed Elgohary 2, Philip Resnik 1,3 1 Linguistics, 2 Computer Science, 3 Institute for Advanced

More information

INFORMATION SEEKING BEHAVIOR OF USERS OF ICT ORIENTED COLLEGES: A CASE STUDY

INFORMATION SEEKING BEHAVIOR OF USERS OF ICT ORIENTED COLLEGES: A CASE STUDY Review Of Reearch Impact Factor :.40(UIF) ISSN 49-894X Volume - 5 Iue - Oct - 05 INFORMATION SEEKING BEHAVIOR OF USERS OF ICT ORIENTED COLLEGES: A CASE STUDY Dr. Sachin D. Sakarkar Shri. R. R. Lahoti Science

More information

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention Damien Teney 1, Peter Anderson 2*, David Golub 4*, Po-Sen Huang 3, Lei Zhang 3, Xiaodong He 3, Anton van den Hengel 1 1

More information

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

arxiv: v1 [cs.cl] 20 Jul 2015

arxiv: v1 [cs.cl] 20 Jul 2015 How to Generate a Good Word Embedding? Siwei Lai, Kang Liu, Liheng Xu, Jun Zhao National Laboratory of Pattern Recognition (NLPR) Institute of Automation, Chinese Academy of Sciences, China {swlai, kliu,

More information

Ensemble Technique Utilization for Indonesian Dependency Parser

Ensemble Technique Utilization for Indonesian Dependency Parser Ensemble Technique Utilization for Indonesian Dependency Parser Arief Rahman Institut Teknologi Bandung Indonesia 23516008@std.stei.itb.ac.id Ayu Purwarianti Institut Teknologi Bandung Indonesia ayu@stei.itb.ac.id

More information

Residual Stacking of RNNs for Neural Machine Translation

Residual Stacking of RNNs for Neural Machine Translation Residual Stacking of RNNs for Neural Machine Translation Raphael Shu The University of Tokyo shu@nlab.ci.i.u-tokyo.ac.jp Akiva Miura Nara Institute of Science and Technology miura.akiba.lr9@is.naist.jp

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation SLSP-2016 October 11-12 Natalia Tomashenko 1,2,3 natalia.tomashenko@univ-lemans.fr Yuri Khokhlov 3 khokhlov@speechpro.com Yannick

More information

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation tatistical Parsing (Following slides are modified from Prof. Raymond Mooney s slides.) tatistical Parsing tatistical parsing uses a probabilistic model of syntax in order to assign probabilities to each

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing Ask Me Anything: Dynamic Memory Networks for Natural Language Processing Ankit Kumar*, Ozan Irsoy*, Peter Ondruska*, Mohit Iyyer*, James Bradbury, Ishaan Gulrajani*, Victor Zhong*, Romain Paulus, Richard

More information

arxiv: v1 [cs.cl] 2 Apr 2017

arxiv: v1 [cs.cl] 2 Apr 2017 Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,

More information

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках Тарасов Д. С. (dtarasov3@gmail.com) Интернет-портал reviewdot.ru, Казань,

More information

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma Adam Abdulhamid Stanford University 450 Serra Mall, Stanford, CA 94305 adama94@cs.stanford.edu Abstract With the introduction

More information

Attributed Social Network Embedding

Attributed Social Network Embedding JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2017 1 Attributed Social Network Embedding arxiv:1705.04969v1 [cs.si] 14 May 2017 Lizi Liao, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua Abstract Embedding

More information

Prediction of Maximal Projection for Semantic Role Labeling

Prediction of Maximal Projection for Semantic Role Labeling Prediction of Maximal Projection for Semantic Role Labeling Weiwei Sun, Zhifang Sui Institute of Computational Linguistics Peking University Beijing, 100871, China {ws, szf}@pku.edu.cn Haifeng Wang Toshiba

More information

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Navdeep Jaitly 1, Vincent Vanhoucke 2, Geoffrey Hinton 1,2 1 University of Toronto 2 Google Inc. ndjaitly@cs.toronto.edu,

More information

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art

More information

arxiv: v1 [cs.lg] 15 Jun 2015

arxiv: v1 [cs.lg] 15 Jun 2015 Dual Memory Architectures for Fast Deep Learning of Stream Data via an Online-Incremental-Transfer Strategy arxiv:1506.04477v1 [cs.lg] 15 Jun 2015 Sang-Woo Lee Min-Oh Heo School of Computer Science and

More information

arxiv: v1 [cs.lg] 7 Apr 2015

arxiv: v1 [cs.lg] 7 Apr 2015 Transferring Knowledge from a RNN to a DNN William Chan 1, Nan Rosemary Ke 1, Ian Lane 1,2 Carnegie Mellon University 1 Electrical and Computer Engineering, 2 Language Technologies Institute Equal contribution

More information

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

Online Updating of Word Representations for Part-of-Speech Tagging

Online Updating of Word Representations for Part-of-Speech Tagging Online Updating of Word Representations for Part-of-Speech Tagging Wenpeng Yin LMU Munich wenpeng@cis.lmu.de Tobias Schnabel Cornell University tbs49@cornell.edu Hinrich Schütze LMU Munich inquiries@cislmu.org

More information

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval Yelong Shen Microsoft Research Redmond, WA, USA yeshen@microsoft.com Xiaodong He Jianfeng Gao Li Deng Microsoft Research

More information

Dialog-based Language Learning

Dialog-based Language Learning Dialog-based Language Learning Jason Weston Facebook AI Research, New York. jase@fb.com arxiv:1604.06045v4 [cs.cl] 20 May 2016 Abstract A long-term goal of machine learning research is to build an intelligent

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting El Moatez Billah Nagoudi Laboratoire d Informatique et de Mathématiques LIM Université Amar

More information

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches Yu-Chun Wang Chun-Kai Wu Richard Tzong-Han Tsai Department of Computer Science

More information

ON THE USE OF WORD EMBEDDINGS ALONE TO

ON THE USE OF WORD EMBEDDINGS ALONE TO ON THE USE OF WORD EMBEDDINGS ALONE TO REPRESENT NATURAL LANGUAGE SEQUENCES Anonymous authors Paper under double-blind review ABSTRACT To construct representations for natural language sequences, information

More information

Deep Neural Network Language Models

Deep Neural Network Language Models Deep Neural Network Language Models Ebru Arısoy, Tara N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran IBM T.J. Watson Research Center Yorktown Heights, NY, 10598, USA {earisoy, tsainath, bedk, bhuvana}@us.ibm.com

More information

Context Free Grammars. Many slides from Michael Collins

Context Free Grammars. Many slides from Michael Collins Context Free Grammars Many slides from Michael Collins Overview I An introduction to the parsing problem I Context free grammars I A brief(!) sketch of the syntax of English I Examples of ambiguous structures

More information

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com

More information

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,

More information

A Vector Space Approach for Aspect-Based Sentiment Analysis

A Vector Space Approach for Aspect-Based Sentiment Analysis A Vector Space Approach for Aspect-Based Sentiment Analysis by Abdulaziz Alghunaim B.S., Massachusetts Institute of Technology (2015) Submitted to the Department of Electrical Engineering and Computer

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

Discriminative Learning of Beam-Search Heuristics for Planning

Discriminative Learning of Beam-Search Heuristics for Planning Discriminative Learning of Beam-Search Heuristics for Planning Yuehua Xu School of EECS Oregon State University Corvallis,OR 97331 xuyu@eecs.oregonstate.edu Alan Fern School of EECS Oregon State University

More information

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION Atul Laxman Katole 1, Krishna Prasad Yellapragada 1, Amish Kumar Bedi 1, Sehaj Singh Kalra 1 and Mynepalli Siva Chaitanya 1 1 Samsung

More information

Indian Institute of Technology, Kanpur

Indian Institute of Technology, Kanpur Indian Institute of Technology, Kanpur Course Project - CS671A POS Tagging of Code Mixed Text Ayushman Sisodiya (12188) {ayushmn@iitk.ac.in} Donthu Vamsi Krishna (15111016) {vamsi@iitk.ac.in} Sandeep Kumar

More information

Word Embedding Based Correlation Model for Question/Answer Matching

Word Embedding Based Correlation Model for Question/Answer Matching Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Word Embedding Based Correlation Model for Question/Answer Matching Yikang Shen, 1 Wenge Rong, 2 Nan Jiang, 2 Baolin

More information

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology Tiancheng Zhao CMU-LTI-16-006 Language Technologies Institute School of Computer Science Carnegie Mellon

More information

arxiv: v1 [cs.cv] 10 May 2017

arxiv: v1 [cs.cv] 10 May 2017 Inferring and Executing Programs for Visual Reasoning Justin Johnson 1 Bharath Hariharan 2 Laurens van der Maaten 2 Judy Hoffman 1 Li Fei-Fei 1 C. Lawrence Zitnick 2 Ross Girshick 2 1 Stanford University

More information

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES Po-Sen Huang, Kshitiz Kumar, Chaojun Liu, Yifan Gong, Li Deng Department of Electrical and Computer Engineering,

More information

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-6) Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors Sang-Woo Lee,

More information

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence. NLP Lab Session Week 8 October 15, 2014 Noun Phrase Chunking and WordNet in NLTK Getting Started In this lab session, we will work together through a series of small examples using the IDLE window and

More information

Generative models and adversarial training

Generative models and adversarial training Day 4 Lecture 1 Generative models and adversarial training Kevin McGuinness kevin.mcguinness@dcu.ie Research Fellow Insight Centre for Data Analytics Dublin City University What is a generative model?

More information

Parsing of part-of-speech tagged Assamese Texts

Parsing of part-of-speech tagged Assamese Texts IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal

More information

Model Ensemble for Click Prediction in Bing Search Ads

Model Ensemble for Click Prediction in Bing Search Ads Model Ensemble for Click Prediction in Bing Search Ads Xiaoliang Ling Microsoft Bing xiaoling@microsoft.com Hucheng Zhou Microsoft Research huzho@microsoft.com Weiwei Deng Microsoft Bing dedeng@microsoft.com

More information

A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books

A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books Yoav Goldberg Bar Ilan University yoav.goldberg@gmail.com Jon Orwant Google Inc. orwant@google.com Abstract We created

More information

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models Richard Johansson and Alessandro Moschitti DISI, University of Trento Via Sommarive 14, 38123 Trento (TN),

More information

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &

More information

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,

More information

SARDNET: A Self-Organizing Feature Map for Sequences

SARDNET: A Self-Organizing Feature Map for Sequences SARDNET: A Self-Organizing Feature Map for Sequences Daniel L. James and Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin Austin, TX 78712 dljames,risto~cs.utexas.edu

More information

Bibliography Deep Learning Papers

Bibliography Deep Learning Papers Bibliography Deep Learning Papers * May 15, 2017 References [1] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin,

More information

Building a Semantic Role Labelling System for Vietnamese

Building a Semantic Role Labelling System for Vietnamese Building a emantic Role Labelling ystem for Vietnamese Thai-Hoang Pham FPT University hoangpt@fpt.edu.vn Xuan-Khoai Pham FPT University khoaipxse02933@fpt.edu.vn Phuong Le-Hong Hanoi University of cience

More information

CS 598 Natural Language Processing

CS 598 Natural Language Processing CS 598 Natural Language Processing Natural language is everywhere Natural language is everywhere Natural language is everywhere Natural language is everywhere!"#$%&'&()*+,-./012 34*5665756638/9:;< =>?@ABCDEFGHIJ5KL@

More information

THE world surrounding us involves multiple modalities

THE world surrounding us involves multiple modalities 1 Multimodal Machine Learning: A Survey and Taxonomy Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency arxiv:1705.09406v2 [cs.lg] 1 Aug 2017 Abstract Our experience of the world is multimodal

More information

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach #BaselOne7 Deep search Enhancing a search bar using machine learning Ilgün Ilgün & Cedric Reichenbach We are not researchers Outline I. Periscope: A search tool II. Goals III. Deep learning IV. Applying

More information

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1 Basic Parsing with Context-Free Grammars Some slides adapted from Julia Hirschberg and Dan Jurafsky 1 Announcements HW 2 to go out today. Next Tuesday most important for background to assignment Sign up

More information

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels Jörg Tiedemann Uppsala University Department of Linguistics and Philology firstname.lastname@lingfil.uu.se Abstract

More information

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Twitter Sentiment Classification on Sanders Data using Hybrid Approach IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders

More information

Developing a TT-MCTAG for German with an RCG-based Parser

Developing a TT-MCTAG for German with an RCG-based Parser Developing a TT-MCTAG for German with an RCG-based Parser Laura Kallmeyer, Timm Lichte, Wolfgang Maier, Yannick Parmentier, Johannes Dellert University of Tübingen, Germany CNRS-LORIA, France LREC 2008,

More information

FBK-HLT-NLP at SemEval-2016 Task 2: A Multitask, Deep Learning Approach for Interpretable Semantic Textual Similarity

FBK-HLT-NLP at SemEval-2016 Task 2: A Multitask, Deep Learning Approach for Interpretable Semantic Textual Similarity FBK-HLT-NLP at SemEval-2016 Task 2: A Multitask, Deep Learning Approach for Interpretable Semantic Textual Similarity Simone Magnolini Fondazione Bruno Kessler University of Brescia Brescia, Italy magnolini@fbkeu

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen TRANSFER LEARNING OF WEAKLY LABELLED AUDIO Aleksandr Diment, Tuomas Virtanen Tampere University of Technology Laboratory of Signal Processing Korkeakoulunkatu 1, 33720, Tampere, Finland firstname.lastname@tut.fi

More information

Semi-supervised Training for the Averaged Perceptron POS Tagger

Semi-supervised Training for the Averaged Perceptron POS Tagger Semi-supervised Training for the Averaged Perceptron POS Tagger Drahomíra johanka Spoustová Jan Hajič Jan Raab Miroslav Spousta Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics,

More information

Grammars & Parsing, Part 1:

Grammars & Parsing, Part 1: Grammars & Parsing, Part 1: Rules, representations, and transformations- oh my! Sentence VP The teacher Verb gave the lecture 2015-02-12 CS 562/662: Natural Language Processing Game plan for today: Review

More information

Artificial Neural Networks written examination

Artificial Neural Networks written examination 1 (8) Institutionen för informationsteknologi Olle Gällmo Universitetsadjunkt Adress: Lägerhyddsvägen 2 Box 337 751 05 Uppsala Artificial Neural Networks written examination Monday, May 15, 2006 9 00-14

More information

(Sub)Gradient Descent

(Sub)Gradient Descent (Sub)Gradient Descent CMSC 422 MARINE CARPUAT marine@cs.umd.edu Figures credit: Piyush Rai Logistics Midterm is on Thursday 3/24 during class time closed book/internet/etc, one page of notes. will include

More information

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Exploiting Wikipedia as External Knowledge for Named Entity Recognition Exploiting Wikipedia as External Knowledge for Named Entity Recognition Jun ichi Kazama and Kentaro Torisawa Japan Advanced Institute of Science and Technology (JAIST) Asahidai 1-1, Nomi, Ishikawa, 923-1292

More information

Semantic and Context-aware Linguistic Model for Bias Detection

Semantic and Context-aware Linguistic Model for Bias Detection Semantic and Context-aware Linguistic Model for Bias Detection Sicong Kuang Brian D. Davison Lehigh University, Bethlehem PA sik211@lehigh.edu, davison@cse.lehigh.edu Abstract Prior work on bias detection

More information

Matching Similarity for Keyword-Based Clustering

Matching Similarity for Keyword-Based Clustering Matching Similarity for Keyword-Based Clustering Mohammad Rezaei and Pasi Fränti University of Eastern Finland {rezaei,franti}@cs.uef.fi Abstract. Semantic clustering of objects such as documents, web

More information

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING Sheng Li 1, Xugang Lu 2, Shinsuke Sakai 1, Masato Mimura 1 and Tatsuya Kawahara 1 1 School of Informatics, Kyoto University, Sakyo-ku, Kyoto 606-8501,

More information

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Dropout improves Recurrent Neural Networks for Handwriting Recognition 2014 14th International Conference on Frontiers in Handwriting Recognition Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham,Théodore Bluche, Christopher Kermorvant, and Jérôme

More information

Beyond the Pipeline: Discrete Optimization in NLP

Beyond the Pipeline: Discrete Optimization in NLP Beyond the Pipeline: Discrete Optimization in NLP Tomasz Marciniak and Michael Strube EML Research ggmbh Schloss-Wolfsbrunnenweg 33 69118 Heidelberg, Germany http://www.eml-research.de/nlp Abstract We

More information

arxiv: v2 [cs.ir] 22 Aug 2016

arxiv: v2 [cs.ir] 22 Aug 2016 Exploring Deep Space: Learning Personalized Ranking in a Semantic Space arxiv:1608.00276v2 [cs.ir] 22 Aug 2016 ABSTRACT Jeroen B. P. Vuurens The Hague University of Applied Science Delft University of

More information

arxiv: v1 [cs.cl] 27 Apr 2016

arxiv: v1 [cs.cl] 27 Apr 2016 The IBM 2016 English Conversational Telephone Speech Recognition System George Saon, Tom Sercu, Steven Rennie and Hong-Kwang J. Kuo IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598 gsaon@us.ibm.com

More information

arxiv: v2 [cs.cl] 26 Mar 2015

arxiv: v2 [cs.cl] 26 Mar 2015 Effective Use of Word Order for Text Categorization with Convolutional Neural Networks Rie Johnson RJ Research Consulting Tarrytown, NY, USA riejohnson@gmail.com Tong Zhang Baidu Inc., Beijing, China Rutgers

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,

More information

Deep Facial Action Unit Recognition from Partially Labeled Data

Deep Facial Action Unit Recognition from Partially Labeled Data Deep Facial Action Unit Recognition from Partially Labeled Data Shan Wu 1, Shangfei Wang,1, Bowen Pan 1, and Qiang Ji 2 1 University of Science and Technology of China, Hefei, Anhui, China 2 Rensselaer

More information

Calibration of Confidence Measures in Speech Recognition

Calibration of Confidence Measures in Speech Recognition Submitted to IEEE Trans on Audio, Speech, and Language, July 2010 1 Calibration of Confidence Measures in Speech Recognition Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IEEE

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

Distant Supervised Relation Extraction with Wikipedia and Freebase

Distant Supervised Relation Extraction with Wikipedia and Freebase Distant Supervised Relation Extraction with Wikipedia and Freebase Marcel Ackermann TU Darmstadt ackermann@tk.informatik.tu-darmstadt.de Abstract In this paper we discuss a new approach to extract relational

More information

Accurate Unlexicalized Parsing for Modern Hebrew

Accurate Unlexicalized Parsing for Modern Hebrew Accurate Unlexicalized Parsing for Modern Hebrew Reut Tsarfaty and Khalil Sima an Institute for Logic, Language and Computation, University of Amsterdam Plantage Muidergracht 24, 1018TV Amsterdam, The

More information