Entropy-Guided Feature Induction for Structure Learning

Entropy-Guided Feature Induction for Structure Learning Eraldo R. Fernandes 1 and Ruy L. Milidiú 2 1 Faculdade de Computação Universidade Federal de Mato Grosso do Sul, Brazil eraldo@facom.ufms.br 2 Departamento de Informática Pontifícia Universidade Católica do Rio de Janeiro, Brazil milidiu@inf.puc-rio.br Abstract. In many natural language processing tasks, the output comprises several variables with complex interdependencies, as sequences, trees, alignments and clusterings. Structure learning consists of learning a mapping from inputs to complex outputs by processing a dataset of input-output pairs. Feature generation is an important subtask of structure learning and usually addressed by a domain expert that builds complex feature templates by conjoining the available basic features. This is a limited, expensive approach and is recognized as a modeling bottleneck. Here, we extend the Structure Learning framework by incorporating an automatic feature induction method, which is guided by entropy. We experimentally evaluate our framework on five natural language processing tasks involving four languages, namely: Portuguese, English, Chinese and Arabic. These experiments correspond to nine datasets, with four on Portuguese. For six datasets, our systems remarkably achieve the best known performances, being competitive with the state-of-art on the others. Furthermore, our coreference resolution systems achieved the very first place on the CoNLL-2012 Shared Task. The competing systems were ranked by the mean score over three languages: Arabic, Chinese and English. We further assess our method by experimentally comparing it with two important alternative feature generation methods, namely manual template generation and polynomial kernel. The experimental findings indicate that our method is more attractive than both alternatives. Keywords: Structure Learning, Natural Language Processing, Feature Induction, Entropy 1 Introduction Many important problems involve the prediction of a structure comprising many variables with complex interdependencies. Several structure learning (SL) problems arise from natural language processing (NLP). For instance, dependency The integral work can be obtained at http://eraldoluis.pro.br/phd.pdf

2 Fernandes and Milidiú parsing (DP) is to identify a tree underlying a sentence, whereas coreference resolution consists of clustering entity mentions in a document. The structure learning framework [2, 3, 12] is a general approach to directly learn a mapping function from inputs to complex outputs. The core of the SL framework is the prediction problem, which is an arbitrary optimization problem whose objective function is linear on some joint input-output representation. Hence, this approach is a kind of linear discriminant model. Linear models are pervasive in ML, mainly because there are efficient training algorithms to estimate such models [1, 2, 16]. On the other hand, many SL problems require models that are highly non-linear on the available features. Therefore, when training a linear SL model, it is necessary to use some feature generation method in order to provide the required nonlinear features. Feature generation is frequently solved by a domain expert that builds complex, discriminative templates by conjoining the available features. Manual template generation is a limited, expensive way to obtain feature templates and is recognized as a modeling bottleneck. Another popular alternative is to employ a kernel function if the learning algorithm is kernelized. Besides the fact that kernelized training algorithms are computationally expensive, it is difficult to control its generalization performance. Here, we extend the Structure Learning framework by incorporating an automatic feature induction method, which is called Entropy-Guided Feature Induction (EFI). The resulting framework is called Entropy-Guided Structure Learning Framework (ESL). We experimentally assess ESL by evaluating its performance on nine datasets, which involve five NLP tasks and four languages, including four Portuguese datasets. ESL is competitive on all datasets and reduces the smallest known error on seven cases, including three Portuguese datasets. We also achieved the very first place on the prestigious CoNLL-2012 Shared Task using ESL-based systems. This shared task was to resolve coreference on three languages, namely Arabic, Chinese and English. We further assess EFI by experimentally comparing it to manual template generation and polynomial kernel, two important feature generation alternative methods. EFI outperforms both methods on the considered datasets. Additionally, EFI is much cheaper than manual templates and computationally faster than kernel methods. This work has six main contributions: (i) the ESL framework; (ii) a comparison of EFI with manual templates and polynomial kernels; (iii) nine ESL-based systems for fundamental NLP tasks; (iv) state-of-the-art systems for three Portuguese tasks, namely POS tagging, text chunking and quotation extraction; (v) the first place on the CoNLL-2012 Shared Task on multilingual coreference resolution; and (vi) state-of-the-art systems for coreference resolution on Arabic, Chinese and English. Partial results of this work have been published by the authors in two journal papers [6, 11], the most recent in Computational Linguistics. We have also presented some preliminary findings in eight international conferences, namely: WWW 2009 [7], STIL 2009 [9], PROPOR 2010 [10], CoNLL 2010 Shared Task [4], CICLing 2010 [15], ECML-PKDD 2011 [3], EMNLP-CoNLL 2012 Shared Task [5] and PROPOR 2012 [8].

Entropy-Guided Feature Induction for Structure Learning 3 2 Entropy-Guided Structure Learning Framework ESL naturally extends the SL framework by automatically inducing feature templates, solving a relevant modeling bottleneck. In this section, we describe the SL framework and how EFI extends it to derive the ESL framework. 2.1 Prediction via Optimization The core of the SL framework is the prediction problem, which, for a given input x, is formulated as the following w-parameterized optimization problem F (x; w) = arg max y Y(x) w, Φ(x, y), where Y(x) is the set of feasible output structures for the input x;, is the scalar product operator; w is the vector of parameters, which comprises the learned model; and Φ(x, y) is a vector of arbitrary real-valued feature functions, or simply features, that jointly represent an input-output pair. Thus, the prediction problem is to find the output with the highest score, where the score is given by a discriminant function that is linear on some joint feature representation. The feature vector Φ(x, y) and the output domain Y(x) are arbitrary and task dependent. In dependency parsing, for instance, the output domain comprises all possible rooted trees whose nodes are sentence tokens, and the features are functions of a dependency arc. Hence, DP prediction is to solve a maximum branching problem. 2.2 Entropy-Guided Feature Induction EFI is based on the same strategy of Entropy-Guided Transformation Learning [14]. When modeling a Machine Learning problem, a set of features is usually available. These features encode basic information about the examples. In Dependency Parsing, for instance, datasets include features that are either naturally present in sentences, as words, or automatically generated by external systems, as POS tags and lemmas. We call this available information basic features. EFI automatically derives a set of basic feature conjunctions, which we call feature templates. These templates are later used to generate the derived features, which comprise the feature vectors Φ(x, y) used in the Structure Learning modeling. In SL problems, the basic features are functions of the local decision variables that compose the output structure. In DP, features are functions of a dependency arc, because these are the local decision variables that compose a dependency tree. The first step in EFI is to build the basic dataset, which contains an example for each local decision variable. The basic dataset features are the basic features in the original dataset. Many decision tree (DT) learning algorithms use the gain ratio to iteratively select the most informative feature. The gain ratio is a normalized version of the information gain measure, which is based in entropy. Hence, these algorithms provide a quick way to obtain entropy-guided feature selection. EFI uses the well known Quinlan s C4.5 algorithm to train a DT using the basic dataset. This DT predicts the local decision variable given the basic features that mostly reduce its entropy. In Figure 1 (a), we present a sample DT for Dependency

4 Fernandes and Milidiú Parsing. EFI uses a very simple scheme to extract feature templates from a DT. It considers the paths from the root node to all DT nodes. In Figure 1 (b), we present the induced templates. For each path, a template is created by conjoining the features of its nodes. Because we aim to generate feature templates conjunctions of basic features that do not include feature values we ignore the feature values (arc labels) and the decision variable values (leaves) in the DT, considering only the features in the internal nodes. (a) Decision Tree (b) Induced Feature Templates dist dist mod-pos dist mod-pos head-pos dist mod-pos side dist side Fig. 1. Feature template induction from a decision tree. 2.3 Learning Algorithm Some algorithms are frequently used to train SL models. Due mainly to simplicity and extensibility, our learning algorithm [3] is a Large Margin extension of Collins Structured Perceptron [1]. A simplified version of this algorithm is presented in Figure 2. The first step of this algorithm is to employ EFI in order to derive nonlinear feature templates from the basic dataset D. Then, these templates are used to generate the derived features Φ(x, y) in the SL dataset Φ(D). As the binary perceptron, ESL training algorithm is an iterative procedure. On each iteration, the algorithm draws an example (x, y, Φ(x, y)) from the training dataset Φ(D), makes a prediction using the current model w, and updates the model parameters according to the difference between the predicted output ŷ and correct output y. Φ(D) EFI(D) w 0 while no convergence Draw (x, y, Φ(x, y)) from Φ(D) ŷ arg max y Y(x) w, Φ(x, y ) w w + Φ(x, y) Φ(x, ŷ) return w Fig. 2. ESL training algorithm.

Entropy-Guided Feature Induction for Structure Learning 5 3 Experimental Evaluation We apply ESL to nine datasets involving five NLP tasks, namely: POS tagging, text chunking, dependency parsing, quotation extraction and coreference resolution. There are four languages involved in the tasks Arabic, Chinese, English and Portuguese. Next, we briefly describe our main experimental findings. 3.1 Portuguese Language We apply ESL to four Portuguese datasets, namely: (i) Mac-Morpho for POS Tagging,(ii) Bosque for Text Chunking,(iii) CoNLL-2006 for Dependency Parsing,and (iv) GloboQuotes for Quotation Extraction.In Table 1, we present the performances achieved by the ESL systems along with the best known performances for each dataset. Our systems improve the best known performance on three datasets. Additionally, it is important to notice that the best performing system for Portuguese DP uses high-order feature functions that could be incorporated in the ESL modeling, which would substantially improve ESL performance. When comparing ESL with the best available system that uses the same kind of feature functions, ESL performance is superior, which is reported in the integral work. Task State of the Art ESL Error Accuracy Reduction POS Tagging 96.94 97.12 5.9% Text Chunking 87.46 87.72 2.1% Dependency Parsing 93.03 92.66 5.3% Quotation Extraction 71.26 76.80 19.3% Table 1. State-of-the-art performance comparison for Portuguese datasets. 3.2 Other Languages We apply ESL to other five datasets, namely: Brown Corpus for English POS Tagging, CoNLL-2000 for English Text Chunking, and CoNLL-2012 for Arabic, Chinese, English and Multilingual Coreference Resolution. In Table 2, we present the performances achieved by the ESL systems along with the best known performances for each dataset. ESL-based systems improve the best known performance on four cases. It is important to notice that the best performing system for English POS Tagging is a committee of models. We could have combined several ESL models to improve our final performance on this task. Moreover, when comparing ESL performance to one single model, ESL is better. The multilingual coreference resolution corresponds to the average performance on the

6 Fernandes and Milidiú Arabic, Chinese and English datasets, as reported for the CoNLL-2012 Shared Task. ESL system obtained the best performance on this competition. However, at that occasion, ESL performance for Chinese coreference resolution was not the best known result. After the competition, we solved a minor issue with ESL modeling for this particular dataset and achieved the performance reported in this table, which is currently the best known result. Task Language State of the Art ESL Error Accuracy Reduction POS Tagging English 96.83 96.72 3.5% Text Chunking English 94.21 94.12 1.6% Coreference Resolution Arabic 53.55 54.22 1.4% Coreference Resolution Chinese 62.24 62.87 1.7% Coreference Resolution English 61.31 63.37 5.3% Coreference Resolution Multilingual 58.25 60.15 4.5% Table 2. State-of-the-art performance comparison for non-portuguese languages. 3.3 EFI Assessment We further assess EFI by experimentally comparing it to manual template generation and quadratic kernel on three datasets. In Table 3, we summarize these results. Observe that EFI outperforms both alternative methods on the evaluated datasets. Additionally, EFI is much cheaper than manual templates and computationally faster than kernel methods. Alternative System EFI Task Method F 1 F 1 Error Reduction Portuguese DP Manual Templates 90.06 90.28 2.2% English Chunking Quadratic Kernel 93.48 94.12 9.8% Portuguese Chunking Quadratic Kernel 86.67 87.72 7.9% Table 3. Comparison of EFI with other feature generation methods. We report on another experiment, in which we apply ESL to the CoNLL- 2012 Coreference Resolution datasets using only the 70 basic features we use in our modeling, that is not using EFI templates. It is important to notice that these 70 basic features include several complex features. Some of these features are even conjunctions of simpler basic features, and others provide complex task dependent information, as head words and agreement on number and gender. These 70 basic features were manually generated by domain experts and encode valuable coreference information. In Table 4, we report the performance of the

Entropy-Guided Feature Induction for Structure Learning 7 systems trained with the 70 basic features alone and the performance of the systems trained with EFI feature templates. We observe that the CoNLL score is impressively improved by 8.54 points when using EFI-based features. Moreover, EFI consistently outperforms the baseline on all languages. EFI Arabic Chinese English CoNLL Score No 42.01 56.38 54.37 50.92 Yes 52.52 62.50 63.35 59.46 Table 4. Impact of EFI on the CoNLL-2012 development sets. 4 Conclusions We propose the entropy-guided structure learning framework that extends the general SL framework by integrating EFI, an automatic feature generation approach. Our empirical findings indicate that EFI is faster than kernel methods and avoids overfitting. Compared to manual feature templates, the fact that EFI bypasses domain experts is highly valuable. We evaluated ESL on nine datasets involving five natural language processing tasks and four different languages. ESL presents state-of-the-art comparable performances on all evaluated datasets. Moreover, it outperforms the previous best performing systems on six datasets, namely the Mac-Morpho dataset for Portuguese POS tagging, the Bosque dataset for Portuguese text chunking, the GloboQuotes dataset for Portuguese quotation extraction, the CoNLL-2012 Shared Task datasets for Arabic, Chinese, and multilingual coreference resolution. Additionally, ESL coreference systems achieved the first place on the CoNLL-2012 Shared Task on multilingual coreference resolution. As future work, we plan to introduce second and third order features in our models for dependency parsing and coreference resolution. For coreference resolution, some authors [13] report that features based on partial clusters bring substantial improvements on performance. We also plan to extend our latent modeling in order to include such type of features. Text chunking and named entity recognition have been recurrently recast as sequence labeling problems. Nevertheless, these tasks require sentence segmentation and, additionally, segment classification. We plan to model such tasks directly as a sequence segmentation and classification problem. In fact, we have successfully done this for quotation extraction. By using this modeling, we will be able to use more meaningful features for these tasks and, hopefully, improve performance.

8 Fernandes and Milidiú References 1. Collins, M.: Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing. pp. 1 8 (2002) 2. Crammer, K., Singer, Y.: Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research 3, 951 991 (2003) 3. Fernandes, E.R., Brefeld, U.: Learning from partially annotated sequences. In: Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Athens, Greece (2011) 4. Fernandes, E.R., Crestana, C.E.M., Milidiú, R.L.: Hedge detection using the Rel- Hunter approach. In: Proceedings of the Conference on Computational Natural Language Learning Shared Task. pp. 64 69 (2010) 5. Fernandes, E.R., dos Santos, C.N., Milidiú, R.L.: Latent structure perceptron with feature induction for unrestricted coreference resolution. In: Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning Shared Task. pp. 41 48. Jeju Island, Korea (2012) 6. Fernandes, E.R., Milidiú, R.L., Rentería, R.P.: RelHunter: a machine learning method for relation extraction from text. Journal of the Brazilian Computer Society 16, 191 199 (2010) 7. Fernandes, E.R., Milidiú, R.L., dos Santos, C.N.: Portuguese language processing service. In: Proceedings of the International World Wide Web Conference WWW in Ibero-America Alternate Track (2009) 8. Fernandes, E.R., Milidiú, R.L.: Entropy-guided feature generation for structured learning of Portuguese dependency parsing. In: Proceedings of the Conference on Computational Processing of the Portuguese Language. pp. 146 156 (2012) 9. Fernandes, E.R., Pires, B.A., dos Santos, C.N., Milidiú, R.L.: Clause identification using entropy guided transformation learning. In: Proceedings of the Brazilian Symposium in Information and Human Language Technology. pp. 117 124 (2009) 10. Fernandes, E.R., dos Santos, C.N., Milidiú, R.L.: A machine learning approach to Portuguese clause identification. In: Proceedings of the Computational Processing of the Portuguese Language. pp. 55 64 (2010) 11. Fernandes, E.R., dos Santos, C.N., Milidiú, R.L.: Latent trees for coreference resolution. Computational Linguistics 40(4) (2014), (to appear) 12. Joachims, T.: Learning to align sequences: A maximum-margin approach. Tech. rep., Cornell University (2003) 13. Lee, H., Peirsman, Y., Chang, A., Chambers, N., Surdeanu, M., Jurafsky, D.: Stanford s multi-pass sieve coreference resolution system at the CoNLL-2011 shared task. In: Proceedings of the Conference on Computational Natural Language Learning Shared Task. pp. 28 34 (2011) 14. dos Santos, C.N., Milidiú, R.L.: Foundations of Computational Intelligence, Volume 1: Learning and Approximation, Studies in Computational Intelligence, vol. 201, chap. Entropy Guided Transformation Learning, pp. 159 184. Springer (2009) 15. dos Santos, C.N., Milidiú, R.L., Crestana, C.E.M., Fernandes, E.R.: ETL ensembles for chunking, NER and SRL. In: Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing. pp. 100 112 (2010) 16. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research 6, 1453 1484 (2005)