A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks

A Joint Many-Tak Model: Growing a Neural Network for Multiple NLP Tak Kazuma Hahimoto, Caiming Xiong, Yohimaa Turuoka, and Richard Socher The Univerity of Tokyo {hay, turuoka}@logo.t.u-tokyo.ac.jp Saleforce Reearch {cxiong, rocher}@aleforce.com Abtract Tranfer and multi-tak learning have traditionally focued on either a ingle ource-target pair or very few, imilar tak. Ideally, the linguitic level of morphology, yntax and emantic would benefit each other by being trained in a ingle model. We introduce a joint many-tak model together with a trategy for ucceively growing it depth to olve increaingly complex tak. Higher layer include hortcut connection to lower-level tak prediction to reflect linguitic hierarchie. We ue a imple regularization term to allow for optimizing all model weight to improve one tak lo without exhibiting catatrophic interference of the other tak. Our ingle end-to-end model obtain tate-of-the-art or competitive reult on five different tak from tagging, paring, relatedne, and entailment tak. 1 Introduction The potential for leveraging multiple level of repreentation ha been demontrated in variou way in the field of Natural Language Proceing (NLP). For example, Part-Of-Speech (POS) tag are ued for yntactic parer. The parer are ued to improve higher-level tak, uch a natural language inference (Chen et al., 2016) and machine tranlation (Eriguchi et al., 2016). Thee ytem are often pipeline and not trained end-to-end. Deep NLP model have yet hown benefit from predicting many increaingly complex tak each at a ucceively deeper layer. Exiting model often ignore linguitic hierarchie by predicting Work wa done while the firt author wa an intern at Saleforce Reearch. Correponding author. emantic level yntactic level word level Entailment encoder Relatedne encoder DEP CHUNK POS word repreentation Sentence 1 Entailment Relatedne Entailment encoder Relatedne encoder DEP CHUNK POS word repreentation Sentence 2 Figure 1: Overview of the joint many-tak model predicting different linguitic output at ucceively deeper layer. different tak either entirely eparately or at the ame depth (Collobert et al., 2011). We introduce a Joint Many-Tak (JMT) model, outlined in Figure 1, which predict increaingly complex NLP tak at ucceively deeper layer. Unlike traditional pipeline ytem, our ingle JMT model can be trained end-to-end for POS tagging, chunking, dependency paring, emantic relatedne, and textual entailment, by conidering linguitic hierarchie. We propoe an adaptive training and regularization trategy to grow thi model in it depth. With the help of thi trategy we avoid catatrophic interference between the tak. Our model i motivated by Søgaard and Goldberg (2016) who howed that predicting two different tak i more accurate when performed in different layer than in the ame layer (Collobert et al., 2011). Experimental reult how that our ingle model achieve competitive reult for all of the five different tak, demontrating that u- 446 Proceeding of the 2017 Conference on Empirical Method in Natural Language Proceing, page 446 456 Copenhagen, Denmark, September 7 11, 2017. c 2017 Aociation for Computational Linguitic

ing linguitic hierarchie i more important than handling different tak in the ame layer. 2 The Joint Many-Tak Model Thi ection decribe the inference procedure of our model, beginning at the lowet level and working our way to higher layer and more complex tak; our model handle the five different tak in the order of POS tagging, chunking, dependency paring, emantic relatedne, and textual entailment, by conidering linguitic hierarchie. The POS tag are ued for chunking, and the chunking tag are ued for dependency paring (Attardi and DellOrletta, 2008). Tai et al. (2015) have hown that dependencie improve the relatedne tak. The relatedne and entailment tak are cloely related to each other. If the emantic relatedne between two entence i very low, they are unlikely to entail each other. Baed on thi obervation, we make ue of the information from the relatedne tak for improving the entailment tak. 2.1 Word Repreentation For each word w t in the input entence of length L, we ue two type of embedding. Word embedding: We ue Skip-gram (Mikolov et al., 2013) to train word embedding. Character embedding: Character n-gram embedding are trained by the ame Skip-gram objective. We contruct the character n-gram vocabulary in the training data and aign an embedding for each entry. The final character embedding i the average of the unique character n-gram embedding of w t. For example, the character n- gram (n = 1, 2, 3) of the word Cat are {C, a, t, #B#C, Ca, at, t#e#, #B#Ca, Cat, at#e#}, where #B# and #E# repreent the beginning and the end of each word, repectively. Uing the character embedding efficiently provide morphological feature. Each word i ubequently repreented a x t, the concatenation of it correponding word and character embedding hared acro the tak. 1 2.2 Word-Level Tak: POS Tagging The firt layer of the model i a bi-directional LSTM (Grave and Schmidhuber, 2005; Hochreiter and Schmidhuber, 1997) whoe hidden tate 1 Bojanowki et al. (2017) previouly propoed to train the character n-gram embedding by the Skip-gram objective. are ued to predict POS tag. We ue the following Long Short-Term Memory (LSTM) unit for the forward direction: i t = σ (W i g t + b i ), f t = σ (W f g t + b f ), u t = tanh (W u g t + b u ), c t = i t u t + f t c t 1, (1) o t = σ (W o g t + b o ), h t = o t tanh (c t ), where we define the input g t a g t = [ h t 1 ; x t ], i.e. the concatenation of the previou hidden tate and the word repreentation of w t. The backward pa i expanded in the ame way, but a different et of weight are ued. For predicting the POS tag of w t, we ue the concatenation of the forward and backward tate in a one-layer bi-lstm layer correponding to the t-th word: h t = [ h t ; h t ]. Then each h t (1 t L) i fed into a tandard oftmax claifier with a ingle ReLU layer which output the probability vector y (1) for each of the POS tag. 2.3 Word-Level Tak: Chunking Chunking i alo a word-level claification tak which aign a chunking tag (B-NP, I-VP, etc.) for each word. The tag pecifie the region of major phrae (e.g., noun phrae) in the entence. Chunking i performed in the econd bi-lstm layer on top of the POS layer. When tacking the bi-lstm layer, we ue Eq. (1) with input g (2) t = [h (2) t 1 ; h(1) t ; x t ; y (po) t ], where h (1) t i the hidden tate of the firt (POS) layer. We define the weighted label embedding y (po) t a follow: y (po) t = C j=1 p(y (1) t = j h (1) t )l(j), (2) where C i the number of the POS tag, p(y (1) j h (1) t t = ) i the probability value that the j-th POS tag i aigned to w t, and l(j) i the correponding label embedding. The probability value are predicted by the POS layer, and thu no gold POS tag are needed. Thi output embedding i imilar to the K-bet POS tag feature which ha been hown to be effective in yntactic tak (Andor et al., 2016; Alberti et al., 2015). For predicting the chunking tag, we employ the ame trategy a POS tagging by uing the concatenated bidirectional hidden tate h (2) t = [ h (2) t ; h (2) t ] in the chunking layer. We alo ue a ingle ReLU hidden layer before the oftmax claifier. 447

2.4 Syntactic Tak: Dependency Paring Dependency paring identifie yntactic relation (uch a an adjective modifying a noun) between word pair in a entence. We ue the third bi- LSTM layer to claify relation between all pair of word. The input vector for the LSTM include hidden tate, word repreentation, and the label embedding for the two previou tak: g (3) t = [h (3) t 1 ; h(2) t ; x t ; (y (po) t + y (chk) t )], where we computed the chunking vector in a imilar fahion a the POS vector in Eq. (2). We predict the parent node (head) for each word. Then a dependency label i predicted for each child-parent pair. Thi approach i related to Dozat and Manning (2017) and Zhang et al. (2017), where the main difference i that our model work on a multi-tak framework. To predict the parent node of w t, we define a matching function between w t and the candidate of the parent node a m (t, j) = h (3) t (W d h (3) j ), where W d i a parameter matrix. For the root, we define h (3) L+1 = r a a parameterized vector. To compute the probability that w j (or the root node) i the parent of w t, the core are normalized: p(j h (3) t ) = exp (m (t, j)) L+1 (3) k=1,k t exp (m (t, k)). The dependency label are predicted uing [h (3) t ; h (3) j ] a input to a oftmax claifier with a ingle ReLU layer. We greedily elect the parent node and the dependency label for each word. When the paring reult i not a well-formed tree, we apply the firt-order Einer algorithm (Einer, 1996) to obtain a well-formed tree from it. 2.5 Semantic Tak: Semantic relatedne The next two tak model the emantic relationhip between two input entence. The firt tak meaure the emantic relatedne between two entence. The output i a real-valued relatedne core for the input entence pair. The econd tak i textual entailment, which require one to determine whether a premie entence entail a hypothei entence. There are typically three clae: entailment, contradiction, and neutral. We ue the fourth and fifth bi-lstm layer for the relatedne and entailment tak, repectively. Now it i required to obtain the entence-level repreentation rather than the word-level repreentation h (4) t ued in the firt three tak. We compute the entence-level repreentation h (4) a the element-wie maximum value acro all of the word-level repreentation in the fourth layer: h (4) ( ) = max h (4) 1, h(4) 2,..., h(4) L. (4) Thi max-pooling technique ha proven effective in text claification tak (Lai et al., 2015). To model the emantic relatedne between and, we follow Tai et al. (2015). The feature vector for repreenting the emantic relatedne i computed a follow: where d 1 (, ) = [ ] h (4) h (4) (4) ; h h (4), (5) h (4) h (4) i the abolute value of the element-wie ubtraction, and h (4) h (4) i the element-wie multiplication. Then d 1 (, ) i fed into a oftmax claifier with a ingle Maxout hidden layer (Goodfellow et al., 2013) to output a relatedne core (from 1 to 5 in our cae). 2.6 Semantic Tak: Textual entailment For entailment claification, we alo ue the maxpooling technique a in the emantic relatedne tak. To claify the premie-hypothei pair (, ) into one of the three clae, we compute the feature vector d 2 (, ) a in Eq. (5) except that we do not ue the abolute value of the element-wie ubtraction, becaue we need to identify which i the premie (or hypothei). Then d 2 (, ) i fed into a oftmax claifier. To ue the output from the relatedne layer directly, we ue the label embedding for the relatedne tak. More concretely, we compute the cla label embedding for the emantic relatedne tak imilar to Eq. (2). The final feature vector that are concatenated and fed into the entailment claifier are the weighted relatedne label embedding and the feature vector d 2 (, ). We ue three Maxout hidden layer before the claifier. 3 Training the JMT Model The model i trained jointly over all dataet. During each epoch, the optimization iterate over each full training dataet in the ame order a the correponding tak decribed in the modeling ection. 3.1 Pre-Training Word Repreentation We pre-train word embedding uing the Skipgram model with negative ampling (Mikolov 448

et al., 2013). We alo pre-train the character n- gram embedding uing Skip-gram. 2 The only difference i that each input word embedding i replaced with it correponding average character n- gram embedding decribed in Section 2.1. Thee embedding are fine-tuned during the model training. We denote the embedding parameter a θ e. 3.2 Training the POS Layer Let θ POS = (W POS, b POS, θ e ) denote the et of model parameter aociated with the POS layer, where W POS i the et of the weight matrice in the firt bi-lstm and the claifier, and b POS i the et of the bia vector. The objective function to optimize θ POS i defined a follow: J 1 (θ POS ) = log p(y (1) t = α h (1) t ) t (6) + λ W POS 2 + δ θ e θ e 2, where p(y (1) t = α wt h (1) t ) i the probability value that the correct label α i aigned to w t in the entence, λ W POS 2 i the L2-norm regularization term, and λ i a hyperparameter. We call the econd regularization term δ θ e θ e 2 a ucceive regularization term. The ucceive regularization i baed on the idea that we do not want the model to forget the information learned for the other tak. In the cae of POS tagging, the regularization i applied to θ e, and θ e i the embedding parameter after training the final tak in the top-mot layer at the previou training epoch. δ i a hyperparameter. 3.3 Training the Chunking Layer The objective function i defined a follow: J 2 (θ chk ) = log p(y (2) t = α h (2) t ) t + λ W chk 2 + δ θ POS θ POS 2, (7) which i imilar to that of POS tagging, and θ chk i (W chk, b chk, E POS, θ e ), where W chk and b chk are the weight and bia parameter including thoe in θ POS, and E POS i the et of the POS label embedding. θ POS i the one after training the POS layer at the current training epoch. 2 The training code and the pre-trained embedding are available at http://github.com/haygo/ charngram2vec. 3.4 Training the Dependency Layer The objective function i defined a follow: J 3 (θ dep ) = log p(α h (3) t )p(β h (3) t, h (3) α ) t + λ( W dep 2 + W d 2 ) + δ θ chk θ chk 2, (8) where p(α h (3) t ) i the probability value aigned to the correct parent node α for w t, and p(β h (3) t, h (3) α ) i the probability value aigned to the correct dependency label β for the child-parent pair (w t, α). θ dep i defined a (W dep, b dep, W d, r, E POS, E chk, θ e ), where W dep and b dep are the weight and bia parameter including thoe in θ chk, and E chk i the et of the chunking label embedding. 3.5 Training the Relatedne Layer Following Tai et al. (2015), the objective function i defined a follow: J 4 (θ rel ) = ( ) KL ˆp(, ) p(h (4), h (4) ) (, ) (9) + λ W rel 2 + δ θ dep θ dep 2, where ˆp(, ) i the gold ditribution over the defined relatedne core, p(h (4), h (4) ) i the predicted ditribution given ( the the entence) repre- entation, and KL ˆp(, ) p(h (4), h (4) ) i the KL-divergence between the two ditribution. θ rel i defined a (W rel, b rel, E POS, E chk, θ e ). 3.6 Training the Entailment Layer The objective function i defined a follow: J 5 (θ ent ) = log p(y (5) (, ) = α h(5), h (5) ) (, ) + λ W ent 2 + δ θ rel θ rel 2, (10) where p(y (5) (, ) = α h (5), h (5) ) i the probability value that the correct label α i aigned to the premie-hypothei pair (, ). θ ent i defined a (W ent, b ent, E POS, E chk, E rel, θ e ), where E rel i the et of the relatedne label embedding. 4 Related Work Many deep learning approache have proven to be effective in a variety of NLP tak and are becoming more and more complex. They are typically 449

deigned to handle ingle tak, or ome of them are deigned a general-purpoe model (Kumar et al., 2016; Sutkever et al., 2014) but applied to different tak independently. For handling multiple NLP tak, multi-tak learning model with deep neural network have been propoed (Collobert et al., 2011; Luong et al., 2016), and more recently Søgaard and Goldberg (2016) have uggeted that uing different layer for different tak i more effective than uing the ame layer in jointly learning cloely-related tak, uch a POS tagging and chunking. However, the number of tak wa limited or they have very imilar tak etting like word-level tagging, and it wa not clear how lower-level tak could be alo improved by combining higher-level tak. More related to our work, Godwin et al. (2016) alo followed Søgaard and Goldberg (2016) to jointly learn POS tagging, chunking, and language modeling, and Zhang and Wei (2016) have hown that it i effective to jointly learn POS tagging and dependency paring by haring internal repreentation. In the field of relation extraction, Miwa and Banal (2016) propoed a joint learning model for entity detection and relation extraction. All of them ugget the importance of multi-tak learning, and we invetigate the potential of handling different type of NLP tak rather than cloely-related one in a ingle hierarchical deep model. In the field of computer viion, ome tranfer and multi-tak learning approache have alo been propoed (Li and Hoiem, 2016; Mira et al., 2016). For example, Mira et al. (2016) propoed a multi-tak learning model to handle different tak. However, they aume that each data ample ha annotation for the different tak, and do not explicitly conider tak hierarchie. Recently, Ruu et al. (2016) have propoed a progreive neural network model to handle multiple reinforcement learning tak, uch a Atari game. Like our JMT model, their model i alo ucceively trained according to different tak uing different layer called column in their paper. In their model, once the firt tak i completed, the model parameter for the firt tak are fixed, and then the econd tak i handled with new model parameter. Therefore, accuracy of the previouly trained tak i never improved. In NLP tak, multi-tak learning ha the potential to improve not only higher-level tak, but alo lowerlevel tak. Rather than fixing the pre-trained model parameter, our ucceive regularization allow our model to continuouly train the lowerlevel tak without ignificant accuracy drop. 5 Experimental Setting 5.1 Dataet POS tagging: To train the POS tagging layer, we ued the Wall Street Journal (WSJ) portion of Penn Treebank, and followed the tandard plit for the training (Section 0-18), development (Section 19-21), and tet (Section 22-24) et. The evaluation metric i the word-level accuracy. Chunking: For chunking, we alo ued the WSJ corpu, and followed the tandard plit for the training (Section 15-18) and tet (Section 20) et a in the CoNLL 2000 hared tak. We ued Section 19 a the development et and employed the IOBES tagging cheme. The evaluation metric i the F1 core defined in the hared tak. Dependency paring: We alo ued the WSJ corpu for dependency paring, and followed the tandard plit for the training (Section 2-21), development (Section 22), and tet (Section 23) et. We obtained Stanford tyle dependencie uing the verion 3.3.0 of the Stanford converter. The evaluation metric are the Unlabeled Attachment Score (UAS) and the Labeled Attachment Score (LAS), and punctuation are excluded for the evaluation. Semantic relatedne: For the emantic relatedne tak, we ued the SICK dataet (Marelli et al., 2014), and followed the tandard plit for the training, development, and tet et. The evaluation metric i the Mean Squared Error (MSE) between the gold and predicted core. Textual entailment: For textual entailment, we alo ued the SICK dataet and exactly the ame data plit a the emantic relatedne dataet. The evaluation metric i the accuracy. 5.2 Training Detail We et the dimenionality of the embedding and the hidden tate in the bi-lstm to 100. At each training epoch, we trained our model in the order of POS tagging, chunking, dependency paring, emantic relatedne, and textual entailment. We ued mini-batch tochatic gradient decent and empirically found it effective to ue a gradient clipping method with growing clipping value for the different tak; concretely, we employed the imple function: min(3.0, depth), where depth i 450

the number of bi-lstm layer involved in each tak, and 3.0 i the maximum value. We applied our ucceive regularization to our model, along with L2-norm regularization and dropout (Srivatava et al., 2014). More detail are ummarized in the upplemental material. 6 Reult and Dicuion Table 1 how our reult on the tet et of the five tak. 3 The column Single how the reult of handling each tak eparately uing inglelayer bi-lstm, and the column JMT all how the reult of our JMT model. The ingle tak etting only ue the annotation of their own tak. For example, when handling dependency paring a a ingle tak, the POS and chunking tag are not ued. We can ee that all reult of the five tak are improved in our JMT model, which how that our JMT model can handle the five different tak in a ingle model. Our JMT model allow u to acce arbitrary information learned from the different tak. If we want to ue the model jut a a POS tagger, we can ue only firt bi-lstm layer. Table 1 alo how the reult of five ubet of the different tak. For example, in the cae of JMT ABC, only the firt three layer of the bi-lstm are ued to handle the three tak. In the cae of JMT DE, only the top two layer are ued a a two-layer bi-lstm by omitting all information from the firt three layer. The reult of the cloely-related tak ( AB, ABC, and DE ) how that our JMT model improve both of the high-level and low-level tak. The reult of JMT CD and JMT CE how that the paring tak can be improved by the emantic tak. It hould be noted that in our analyi on the greedy paring reult of the JMT ABC etting, we have found that more than 95% are wellformed dependency tree on the development et. In the 1,700 entence of the development data, 11 reult have multiple root note, 11 reult have no root node, and 61 reult have cycle. Thee 83 paring reult are converted into well-formed tree by Einer algorithm, and the accuracy doe not ignificantly change (UAS: 94.52% 94.53%, LAS: 92.61% 92.62%). 3 In chunking evaluation, we only how the reult of Single and JMT AB becaue the entence for chunking evaluation overlap the training data for dependency paring. 6.1 Comparion with Publihed Reult POS tagging Table 2 how the reult of POS tagging, and our JMT model achieve the core cloe to the tate-of-the-art reult. The bet reult to date ha been achieved by Ling et al. (2015), which ue character-baed LSTM. Incorporating the character-baed encoder into our JMT model would be an intereting direction, but we have hown that the imple pre-trained character n-gram embedding lead to the promiing reult. Chunking Table 3 how the reult of chunking, and our JMT model achieve the tate-of-theart reult. Søgaard and Goldberg (2016) propoed to jointly learn POS tagging and chunking in different layer, but they only howed improvement for chunking. By contrat, our reult how that the low-level tak are alo improved. Dependency paring Table 4 how the reult of dependency paring by uing only the WSJ corpu in term of the dependency annotation. 4 It i notable that our imple greedy dependency parer outperform the model in Andor et al. (2016) which i baed on beam earch with global information. The reult ugget that the bi-lstm efficiently capture global information neceary for dependency paring. Moreover, our ingle tak reult already achieve high accuracy without the POS and chunking information. The bet reult to date ha been achieved by the model propoed in Dozat and Manning (2017), which ue higher dimenional repreentation than our and propoe a more ophiticated attention mechanim called biaffine attention. It hould be promiing to incorporate their attention mechanim into our paring component. Semantic relatedne Table 5 how the reult of the emantic relatedne tak, and our JMT model achieve the tate-of-the-art reult. The reult of JMT DE i already better than the previou tate-of-the-art reult. Both of Zhou et al. (2016) and Tai et al. (2015) explicitly ued yntactic tree, and Zhou et al. (2016) relied on attention mechanim. However, our method ue the imple maxpooling trategy, which ugget that it i worth 4 Choe and Charniak (2016) employed a tri-training method to expand the training data with 400,000 tree in addition to the WSJ data, and they reported 95.9 UAS and 94.1 LAS by converting their contituency tree into dependency tree. Kuncoro et al. (2017) alo reported high accuracy (95.8 UAS and 94.6 LAS) by uing a converter. 451

Single JMT all JMT AB JMT ABC JMT DE JMT CD JMT CE A POS 97.45 97.55 97.52 97.54 n/a n/a n/a B Chunking 95.02 n/a 95.77 n/a n/a n/a n/a C Dependency UAS 93.35 94.67 n/a 94.71 n/a 93.53 93.57 Dependency LAS 91.42 92.90 n/a 92.92 n/a 91.62 91.69 D Relatedne 0.247 0.233 n/a n/a 0.238 0.251 n/a E Entailment 81.8 86.2 n/a n/a 86.8 n/a 82.4 Table 1: Tet et reult for the five tak. In the relatedne tak, the lower core are better. Method Acc. JMT all 97.55 Ling et al. (2015) 97.78 Kumar et al. (2016) 97.56 Ma and Hovy (2016) 97.55 Søgaard (2011) 97.50 Collobert et al. (2011) 97.29 Turuoka et al. (2011) 97.28 Toutanova et al. (2003) 97.27 Table 2: POS tagging reult. Method MSE JMT all 0.233 JMT DE 0.238 Zhou et al. (2016) 0.243 Tai et al. (2015) 0.253 Method F1 JMT AB 95.77 Single 95.02 Søgaard and Goldberg (2016) 95.56 Suzuki and Iozaki (2008) 95.15 Collobert et al. (2011) 94.32 Kudo and Matumoto (2001) 93.91 Turuoka et al. (2011) 93.81 Table 3: Chunking reult. Method UAS LAS JMT all 94.67 92.90 Single 93.35 91.42 Dozat and Manning (2017) 95.74 94.08 Andor et al. (2016) 94.61 92.79 Alberti et al. (2015) 94.23 92.36 Zhang et al. (2017) 94.10 91.90 Wei et al. (2015) 93.99 92.05 Dyer et al. (2015) 93.10 90.90 Bohnet (2010) 92.88 90.71 Table 4: Dependency reult. Method Acc. JMT all 86.2 JMT DE 86.8 Yin et al. (2016) 86.2 Lai and Hockenmaier (2014) 84.6 Table 5: Semantic relatedne reult. Table 6: Textual entailment reult. JMT all w/o SC w/o LE w/o SC&LE POS 97.88 97.79 97.85 97.87 Chunking 97.59 97.08 97.40 97.33 Dependency UAS 94.51 94.52 94.09 94.04 Dependency LAS 92.60 92.62 92.14 92.03 Relatedne 0.236 0.698 0.261 0.765 Entailment 84.6 75.0 81.6 71.2 Table 7: Effectivene of the Shortcut Connection (SC) and the Label Embedding (LE). invetigating uch imple method before developing complex method for imple tak. Currently, our JMT model doe not explicitly ue the learned dependency tructure, and thu the explicit ue of the output from the dependency layer hould be an intereting direction of future work. Textual entailment Table 6 how the reult of textual entailment, and our JMT model achieve the tate-of-the-art reult. The previou tate-ofthe-art reult in Yin et al. (2016) relied on attention mechanim and dataet-pecific data preproceing and feature. Again, our imple maxpooling trategy achieve the tate-of-the-art reult booted by the joint training. Thee reult how the importance of jointly handling related tak. 6.2 Analyi on the Model Architecture We invetigate the effectivene of our model in detail. All of the reult hown in thi ection are the development et reult. JMT ABC w/o SC&LE All-3 POS 97.90 97.87 97.62 Chunking 97.80 97.41 96.52 Dependency UAS 94.52 94.13 93.59 Dependency LAS 92.61 92.16 91.47 Table 8: Effectivene of uing different layer for different tak. Shortcut connection Our JMT model feed the word repreentation into all of the bi-lstm layer, which i called the hortcut connection. Table 7 how the reult of JMT all with and without the hortcut connection. The reult without the hortcut connection are hown in the column of w/o SC. Thee reult clearly how that the importance of the hortcut connection, and in particular, the emantic tak in the higher layer trongly rely on the hortcut connection. That i, imply tacking the LSTM layer i not ufficient to handle a variety of NLP tak in a ingle model. In the upplementary material, it i qualitatively hown how the hortcut connection work in our model. Output label embedding Table 7 alo how the reult without uing the output label of the POS, chunking, and relatedne layer, in the column of w/o LE. Thee reult how that the explicit ue of the output information from the claifier of the lower layer i important in our JMT 452

JMT all w/o SR w/o VC POS 97.88 97.85 97.82 Chunking 97.59 97.13 97.45 Dependency UAS 94.51 94.46 94.38 Dependency LAS 92.60 92.57 92.48 Relatedne 0.236 0.239 0.241 Entailment 84.6 84.2 84.8 Table 9: Effectivene of the Succeive Regularization (SR) and the Vertical Connection (VC). JMT all Random POS 97.88 97.83 Chunking 97.59 97.71 Dependency UAS 94.51 94.66 Dependency LAS 92.60 92.80 Relatedne 0.236 0.298 Entailment 84.6 83.2 Table 10: Effect of the order of training. model. The reult in the column of w/o SC&LE are the one without both of the hortcut connection and the label embedding. Different layer for different tak Table 8 how the reult of our JMT ABC etting and that of not uing the hortcut connection and the label embedding ( w/o SC&LE ) a in Table 7. In addition, in the column of All-3, we how the reult of uing the highet (i.e., the third) layer for all of the three tak without any hortcut connection and label embedding, and thu the two etting w/o SC&LE and All-3 require exactly the ame number of the model parameter. The All-3 etting i imilar to the multi-tak model of Collobert et al. (2011) in that tak-pecific output layer are ued but mot of the model parameter are hared. The reult how that uing the ame layer for the three different tak hamper the effectivene of our JMT model, and the deign of the model i much more important than the number of the model parameter. Succeive regularization In Table 9, the column of w/o SR how the reult of omitting the ucceive regularization term decribed in Section 3. We can ee that the accuracy of chunking i improved by the ucceive regularization, while other reult are not affected o much. The chunking dataet ued here i relatively mall compared with other low-level tak, POS tagging and dependency paring. Thu, thee reult ugget that the ucceive regularization i effective when dataet ize are imbalanced. Vertical connection We invetigated our JMT reult without uing the vertical connection in Single Single+ POS 97.52 Chunking 95.65 96.08 Dependency UAS 93.38 93.88 Dependency LAS 91.37 91.83 Relatedne 0.239 0.665 Entailment 83.8 66.4 Table 11: Effect of depth for the ingle tak. Single W&C Only W POS 97.52 96.26 Chunking 95.65 94.92 Dependency UAS 93.38 92.90 Dependency LAS 91.37 90.44 Table 12: Effect of the character embedding. the five-layer bi-lstm. More concretely, when contructing the input vector g t, we do not ue the bi-lstm hidden tate of the previou layer. Table 9 alo how the JMT all reult with and without the vertical connection. A hown in the column of w/o VC, we oberved the competitive reult. Therefore, in the target tak ued in our model, haring the word repreentation and the output label embedding i more effective than jut tacking the bi-lstm layer. Order of training Our JMT model iterate the training proce in the order decribed in Section 3. Our hypothei i that it i important to tart from the lower-level tak and gradually move to the higher-level tak. Table 10 how the reult of training our model by randomly huffling the order of the tak for each epoch in the column of Random. We ee that the core of the emantic tak drop by the random trategy. In our preliminary experiment, we have found that contructing the mini-batch ample from different tak alo hamper the effectivene of our model, which alo upport our hypothei. Depth The ingle tak etting hown in Table 1 are obtained by uing ingle layer bi-lstm, but in our JMT model, the higher-level tak ue ucceively deeper layer. To invetigate the gap between the different number of the layer for each tak, we alo how the reult of uing multi-layer bi-lstm for the ingle tak etting, in the column of Single+ in Table 11. More concretely, we ue the ame number of the layer with our JMT model; for example, three layer are ued for dependency paring, and five layer are ued for textual entailment. A hown in thee reult, deeper layer do not alway lead to better reult, and the joint learning i more important than mak- 453

ing the model complex only for ingle tak. Character n-gram embedding Finally, Table 12 how the reult for the three ingle tak with and without the pre-trained character n-gram embedding. The column of W&C correpond to uing both of the word and character n-gram embedding, and that of Only W correpond to uing only the word embedding. Thee reult clearly how that jointly uing the pre-trained word and character n-gram embedding i helpful in improving the reult. The pre-training of the character n-gram embedding i alo effective; for example, without the pre-training, the POS accuracy drop from 97.52% to 97.38% and the chunking accuracy drop from 95.65% to 95.14%. 6.3 Dicuion Training trategie In our JMT model, it i not obviou when to top the training while trying to maximize the core of all the five tak. We focued on maximizing the accuracy of dependency paring on the development data in our experiment. However, the ize of the training data are different acro the different tak; for example, the emantic tak include only 4,500 entence pair, and the dependency paring dataet include 39,832 entence with word-level annotation. Thu, in general, dependency paring require more training epoch than the emantic tak, but currently, our model train all of the tak for the ame training epoch. The ame trategy for decreaing the learning rate i alo hared acro all the different tak, although our growing gradient clipping method decribed in Section 5.2 help improve the reult. Indeed, we oberved that better core of the emantic tak can be achieved before the accuracy of dependency paring reache the bet core. Developing a method for achieving the bet core for all of the tak at the ame time i important future work. More tak Our JMT model ha the potential of handling more tak than the five tak ued in our experiment; example include entity detection and relation extraction a in Miwa and Banal (2016) a well a language modeling (Godwin et al., 2016). It i alo a promiing direction to train each tak for multiple domain by focuing on domain adaptation (Søgaard and Goldberg, 2016). In particular, incorporating language modeling tak provide an opportunity to ue large text data. Such large text data wa ued in our experiment to pre-train the word and character n- gram embedding. However, it would be preferable to efficiently ue it for improving the entire model. Tak-oriented learning of low-level tak Each tak in our JMT model i upervied by it correponding dataet. However, it would be poible to learn low-level tak by optimizing highlevel tak, becaue the model parameter of the low-level tak can be directly modified by learning the high-level tak. One example ha already been preented in Hahimoto and Turuoka (2017), where our JMT model i extended to learning tak-oriented latent graph tructure of entence by training our dependency paring component according to a neural machine tranlation objective. 7 Concluion We preented a joint many-tak model to handle multiple NLP tak with growing depth in a ingle end-to-end model. Our model i ucceively trained by conidering linguitic hierarchie, directly feeding word repreentation into all layer, explicitly uing low-level prediction, and applying ucceive regularization. In experiment on five NLP tak, our ingle model achieve the tate-of-the-art or competitive reult on chunking, dependency paring, emantic relatedne, and textual entailment. Acknowledgment We thank the anonymou reviewer and the Saleforce Reearch team member for their fruitful comment and dicuion. Reference Chri Alberti, David Wei, Greg Coppola, and Slav Petrov. 2015. Improved Tranition-Baed Paring and Tagging with Neural Network. In Proceeding of the 2015 Conference on Empirical Method in Natural Language Proceing, page 1354 1359. Daniel Andor, Chri Alberti, David Wei, Aliakei Severyn, Aleandro Preta, Kuzman Ganchev, Slav Petrov, and Michael Collin. 2016. Globally Normalized Tranition-Baed Neural Network. In Proceeding of the 54th Annual Meeting of the Aociation for Computational Linguitic (Volume 1: Long Paper), page 2442 2452. Giueppe Attardi and Felice DellOrletta. 2008. Chunking and Dependency Paring. In Proceeding of LREC 2008 Workhop on Partial Paring. 454

Bernd Bohnet. 2010. Top Accuracy and Fat Dependency Paring i not a Contradiction. In Proceeding of the 23rd International Conference on Computational Linguitic, page 89 97. Piotr Bojanowki, Edouard Grave, Armand Joulin, and Toma Mikolov. 2017. Enriching Word Vector with Subword Information. Tranaction of the Aociation for Computational Linguitic, 5:135 146. Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, and Hui Jiang. 2016. Enhancing and Combining Sequential and Tree LSTM for Natural Language Inference. arxiv, c.cl 1609.06038. Do Kook Choe and Eugene Charniak. 2016. Paring a Language Modeling. In Proceeding of the 2016 Conference on Empirical Method in Natural Language Proceing, page 2331 2336. Ronan Collobert, Jaon Weton, Leon Bottou, Michael Karlen nad Koray Kavukcuoglu, and Pavel Kuka. 2011. Natural Language Proceing (Almot) from Scratch. Journal of Machine Learning Reearch, 12:2493 2537. Timothy Dozat and Chritopher D. Manning. 2017. Deep Biaffine Attention for Neural Dependency Paring. In Proceeding of the 5th International Conference on Learning Repreentation. Chri Dyer, Miguel Balletero, Wang Ling, Autin Matthew, and Noah A. Smith. 2015. Tranition- Baed Dependency Paring with Stack Long Short- Term Memory. In Proceeding of the 53rd Annual Meeting of the Aociation for Computational Linguitic and the 7th International Joint Conference on Natural Language Proceing (Volume 1: Long Paper), page 334 343. Jaon Einer. 1996. Efficient Normal-Form Paring for Combinatory Categorial Grammar. In Proceeding of the 34th Annual Meeting of the Aociation for Computational Linguitic, page 79 86. Akiko Eriguchi, Kazuma Hahimoto, and Yohimaa Turuoka. 2016. Tree-to-Sequence Attentional Neural Machine Tranlation. In Proceeding of the 54th Annual Meeting of the Aociation for Computational Linguitic (Volume 1: Long Paper), page 823 833. Jonathan Godwin, Pontu Stenetorp, and Sebatian Riedel. 2016. Deep Semi-Supervied Learning with Linguitically Motivated Sequence Labeling Tak Hierarchie. arxiv, c.cl 1612.09113. Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yohua Bengio. 2013. Maxout Network. In Proceeding of The 30th International Conference on Machine Learning, page 1319 1327. Alex Grave and Jurgen Schmidhuber. 2005. Framewie Phoneme Claification with Bidirectional LSTM and Other Neural Network Architecture. Neural Network, 18(5):602 610. Kazuma Hahimoto and Yohimaa Turuoka. 2017. Neural Machine Tranlation with Source-Side Latent Graph Paring. In Proceeding of the 2017 Conference on Empirical Method in Natural Language Proceing. To appear. Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long hort-term memory. Neural Computation, 9(8):1735 1780. Taku Kudo and Yuji Matumoto. 2001. Chunking with Support Vector Machine. In Proceeding of the Second Meeting of the North American Chapter of the Aociation for Computational Linguitic. Ankit Kumar, Ozan Iroy, Peter Ondruka, Mohit Iyyer, Jame Bradbury, Ihaan Gulrajani, Victor Zhong, Romain Paulu, and Richard Socher. 2016. Ak Me Anything: Dynamic Memory Network for Natural Language Proceing. In Proceeding of The 33rd International Conference on Machine Learning, page 1378 1387. Adhiguna Kuncoro, Miguel Balletero, Lingpeng Kong, Chri Dyer, Graham Neubig, and Noah A. Smith. 2017. What Do Recurrent Neural Network Grammar Learn About Syntax? In Proceeding of the 15th Conference of the European Chapter of the Aociation for Computational Linguitic, page 1249 1258. Alice Lai and Julia Hockenmaier. 2014. Illinoi-LH: A Denotational and Ditributional Approach to Semantic. In Proceeding of the 8th International Workhop on Semantic Evaluation, page 329 334. Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent Convolutional Neural Network for Text Claification. In Proceeding of the Twenty-Ninth AAAI Conference on Artificial Intelligence, page 2267 2273. Zhizhong Li and Derek Hoiem. 2016. Learning without Forgetting. CoRR, ab/1606.09282. Wang Ling, Chri Dyer, Alan W Black, Iabel Trancoo, Ramon Fermandez, Silvio Amir, Lui Marujo, and Tiago Lui. 2015. Finding Function in Form: Compoitional Character Model for Open Vocabulary Word Repreentation. In Proceeding of the 2015 Conference on Empirical Method in Natural Language Proceing, page 1520 1530. Minh-Thang Luong, Ilya Sutkever, Quoc V. Le, Oriol Vinyal, and Lukaz Kaier. 2016. Multi-tak Sequence to Sequence Learning. In Proceeding of the 4th International Conference on Learning Repreentation. Xuezhe Ma and Eduard Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNN- CRF. In Proceeding of the 54th Annual Meeting of the Aociation for Computational Linguitic (Volume 1: Long Paper), page 1064 1074. 455

Marco Marelli, Luia Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini, and Roberto Zamparelli. 2014. SemEval-2014 Tak 1: Evaluation of Compoitional Ditributional Semantic Model on Full Sentence through Semantic Relatedne and Textual Entailment. In Proceeding of the 8th International Workhop on Semantic Evaluation (SemEval 2014), page 1 8. Toma Mikolov, Ilya Sutkever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Ditributed Repreentation of Word and Phrae and their Compoitionality. In Advance in Neural Information Proceing Sytem 26, page 3111 3119. Ihan Mira, Abhinav Shrivatava, Abhinav Gupta, and Martial Hebert. 2016. Cro-titch Network for Multi-tak Learning. CoRR, ab/1604.03539. Makoto Miwa and Mohit Banal. 2016. End-to-End Relation Extraction uing LSTM on Sequence and Tree Structure. In Proceeding of the 54th Annual Meeting of the Aociation for Computational Linguitic (Volume 1: Long Paper), page 1105 1116. Andrei A. Ruu, Neil C. Rabinowitz, Guillaume Dejardin, Hubert Soyer, Jame Kirkpatrick, Koray Kavukcuoglu, Razvan Pacanu, and Raia Hadell. 2016. Progreive Neural Network. CoRR, ab/1606.04671. Ander Søgaard. 2011. Semi-upervied condened nearet neighbor for part-of-peech tagging. In Proceeding of the 49th Annual Meeting of the Aociation for Computational Linguitic: Human Language Technologie, page 48 52. Ander Søgaard and Yoav Goldberg. 2016. Deep multi-tak learning with low level tak upervied at lower layer. In Proceeding of the 54th Annual Meeting of the Aociation for Computational Linguitic (Volume 2: Short Paper), page 231 235. Nitih Srivatava, Geoffrey Hinton, Alex Krizhevky, Ilya Sutkever, and Rulan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Network from Overfitting. Journal of Machine Learning Reearch, 15:1929 1958. Ilya Sutkever, Oriol Vinyal, and Quoc V Le. 2014. Sequence to Sequence Learning with Neural Network. In Advance in Neural Information Proceing Sytem 27, page 3104 3112. Meeting of the Aociation for Computational Linguitic and the 7th International Joint Conference on Natural Language Proceing (Volume 1: Long Paper), page 1556 1566. Kritina Toutanova, Dan Klein, Chritopher D Manning, and Yoram Singer. 2003. Feature-Rich Partof-Speech Tagging with a Cyclic Dependency Network. In Proceeding of the 2003 Human Language Technology Conference of the North American Chapter of the Aociation for Computational Linguitic, page 173 180. Yohimaa Turuoka, Yuuke Miyao, and Jun ichi Kazama. 2011. Learning with Lookahead: Can Hitory-Baed Model Rival Globally Optimized Model? In Proceeding of the Fifteenth Conference on Computational Natural Language Learning, page 238 246. David Wei, Chri Alberti, Michael Collin, and Slav Petrov. 2015. Structured Training for Neural Network Tranition-Baed Paring. In Proceeding of the 53rd Annual Meeting of the Aociation for Computational Linguitic and the 7th International Joint Conference on Natural Language Proceing (Volume 1: Long Paper), page 323 333. Wenpeng Yin, Hinrich Schtze, Bing Xiang, and Bowen Zhou. 2016. ABCNN: Attention-Baed Convolutional Neural Network for Modeling Sentence Pair. Tranaction of the Aociation for Computational Linguitic, 4:259 272. Xingxing Zhang, Jianpeng Cheng, and Mirella Lapata. 2017. Dependency Paring a Head Selection. In Proceeding of the 15th Conference of the European Chapter of the Aociation for Computational Linguitic, page 665 676. Yuan Zhang and David Wei. 2016. Stackpropagation: Improved Repreentation Learning for Syntax. In Proceeding of the 54th Annual Meeting of the Aociation for Computational Linguitic (Volume 1: Long Paper), page 1557 1566. Yao Zhou, Cong Liu, and Yan Pan. 2016. Modelling Sentence Pair with Tree-tructured Attentive Encoder. In Proceeding of the 26th International Conference on Computational Linguitic, page 2912 2922. Jun Suzuki and Hideki Iozaki. 2008. Semi-Supervied Sequential Labeling and Segmentation Uing Giga- Word Scale Unlabeled Data. In Proceeding of the 46th Annual Meeting of the Aociation for Computational Linguitic: Human Language Technologie, page 665 673. Kai Sheng Tai, Richard Socher, and Chritopher D. Manning. 2015. Improved Semantic Repreentation From Tree-Structured Long Short-Term Memory Network. In Proceeding of the 53rd Annual 456