THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest, email: mnvlazar@gmail.com 2 University Politehnica of Bucharest, Bucharest The speech recognition and understanding systems require various kinds of linguistic knowledge to improve their performances. This requires the capture of knowledge from different linguistic levels (morphological, syntactic, semantic, etc.) and their processing using different techniques. The decision trees are one of the most frequently used techniques in natural language processing because they can model very well the grammatical structure of sentences/phrases. In this paper we present the role of the decision trees in modeling of possession relation between two nouns semantically related. Keywords: natural language processing, decision trees, linguistic knowledge, morphosyntactic description, syntactic level, possession relation, Romanian language. 1. INTRODUCTION The automatic speech recognition and understanding requires knowledge from different domains like the signal and pattern recognition, natural language processing, mathematics and linguistics. In recognizing and understanding natural speech are needed both acoustic pattern matching and linguistic knowledge. This linguistic knowledge used by human when they communicate is hard to model being almost impossible to obtain a language model that can solve all the problems. To model different kinds of linguistic knowledge for natural language models (so that they can be introduced in automatic speech recognition and understanding systems, machine translation, natural language generation, word sense disambiguation, part-of-speech tagging, etc.) have been developed different theories and methods. Each of these methods uses various kinds of linguistic knowledge so that the models obtained lead to improve the accuracy of automatic speech recognition and understanding. One of the most used methods in natural language processing, for solving different ambiguities from phonetic, morphologic, syntactic, semantic and pragmatic levels, are the decision trees because they can model well the language structure. 2. LINGUISTIC KNOWLEDGE Due to the complexity of the natural language, the explanation of the natural language behavior is a task that is very hard to achieve. For this reason the knowledge of language has been divided on several levels, each of them containing certain linguistic features that will allow creating some specific linguistic analysis methods for each level. The way the knowledge of language is divided on linguistic levels is made in a manner that allowing the information from the bottom linguistic level helps to achieve the analysis on the top levels. Linguistic knowledge is divided on linguistic levels as follows: - the way of producing linguistic sounds is studied on phonetics and phonology levels; - the way the words are formed from the meaningful components is examined on morphology level; - the way the words are ordered and grouped together is studied on syntax level; - the meaning of the words are examined on semantic level;

227 The role of decision trees in natural language processing - what kind of actions speakers intend by the use of certain sentences is studied on pragmatic or dialogue level. In this paper we studied word relation only on syntax level in order to define possession relation between two nouns using morphosyntactic description. 3. DECISION TREES A decision tree is a tool used for supporting the decision making process to make good choices. They are part of the family of machine learning methods and allow a hierarchical distribution of a dataset collection. Using a decision tree algorithm, is generated knowledge in the form of a hierarchical tree structure that can be used to classify instances based on a training dataset. An example of the decision tree used in natural language processing are the syntactic tree generated with a grammar (figure 1). Legend: S sentence NP noun phrase VP verb phrase PP prepositional phrase N noun V verb P preposition Figure 1. TheădecisionătreeăforăRomanianăsentenceă Mariaăpuneăcarteaăpeămas. ă(eng.:ămary puts the book on the table.) An important part of a decision tree algorithm is the method used for selecting attributes at each node of the tree. Each of these algorithms uses a certain methods for splitting the set of items. The C4.5 algorithm [6] uses for splitting the information gain, a notion based on entropy concept. Another most known algorithm, CART algorithm [10], uses the Gini impurity that measure how often a randomly chosen item from the dataset would be incorrectly labelled. The decision trees can be used to classify unseen instances because given a training dataset it can be induced a decision tree from which can be easily extracted rules concerning the dataset. Another advantage offered by the decision trees is the fact that they are able to handle both categorical and numerical. Also they are able to classify large datasets in a short period of time. 4. DECISION TREES IN NATURAL LANGUAGE PROCESSING Theă challengeă ină naturală languageă processingă isă toă selectă theă best ă linguistică knowledgeă toă beă usedă when trying to solve a problem. In fact, there are many situations when an ambiguous case (e.g.: part-ofspeech tagging) must be solved by making a decision. The decision trees are one of the best methods for decision making. These can be used for disambiguating problems from every linguistic levels, beginning with ambiguities from phonetics and ending with the understanding a dialog. So, we will present some of the uses of the decision trees in natural language processing from the point of view of the linguistic knowledge type used. Parts-of-speech are very important in morphology because they can give us a large amount of information about a word and its neighbors and the way the word is pronounced. So, the problem of assigning parts-of-speech to words (part-of-speech tagging) is very important in speech and language processing. A crucial role in part-of-speech tagging for morphologically complex languages is played by morphological parsing [7]. The structures that these morphological parsers produce can have many forms: strings, trees, or networks. Thereby, in 1990s, were developed the algorithm based on decision trees [9] and [11].

MarilenaăLAZ R, Diana MILITARU 228 Also initially, for morphological parsing (process of finding the constituent morphemes in a word) has been used hand written rules, lately to solve this task was used supervised machine learning, like decision trees [8]. Recent research has focused on unsupervised way to automatically induce the morphological structure without labeled training data sets [2], [3]. The next linguistic knowledge needed for understanding a statement is the knowledge from the syntactic level. The ways the words are arranged together help us to understand which sequence of words will make sense and which not. All linguistic knowledge about which and how words can be grouped together are included on syntax level. The ambiguity occurs on this level because sometimes the grammar assigns two or more possible parse trees to a sentence. Syntactic parsing solve this structural ambiguity by searching through the space of all the possible parse trees to find the correct parse tree for each given sentence. Today there are many parsing algorithms that employ the context-free grammar to produce syntactic trees, which later will be used for semantic analysis in applications that realize machine translation, question answering, information extraction, and grammar checking. Decision trees, together with other methods, were used to develop probabilistic grammars [9], that later were used both to disambiguation in syntactic parsing and to word prediction in language modeling. In 1990s researches has been made in order to add lexical dependencies to probabilistic context-free grammars, so that these grammars became more sensitive to syntactic structure. The lexical probabilistic approach has been used to solve preposition-phrase attachment using decision trees that employ semantic distance between heads [12]. Also, because decision trees can model well the language structure they have been used to develop statistical models for parsing [1]. Analysis on the syntactic level plays an important roles in natural language processing, because the knowledge from this level support the analysis on the superior level (semantic analysis).the meaning of linguistic word sequences can be captured in formal structures, called meaning representations that are used to link the word sequences (linguistic knowledge) to the non-linguistic knowledge in order to perform tasks involving the meaning of linguistic knowledge. But, before understanding sentences, it must be solved the word sense disambiguation problem. Examining each word from a text and determining which sense of each word is being used in that context is not an easy task because many words have more than one meaning. One of the methods used in sense disambiguation is supervised learning. In supervised learning approach to the problem of sense disambiguation is used a corpus, which was hand-labeled with correct word senses, to extract a set of features that are helpful in predicting particular senses. These extracted features will be used for training the systems classifier (naive Bayes classifier, decision list classifier, decision tree classifier, etc.). Considerable research on sense disambiguation has been made using methods like semantic networks, naive Bayes and decision list classifiers. Because of the increasing interest in supervised machine learning approaches to sense disambiguation decision trees learning begin to be used for this task [13]. Also, decision tree learning combined with methods has been used for detecting part-whole relations [14], noun compounds relations [15], [16], noun-modifier relations [17], and semantic roles [18], [19]. Another problem that was solved using the decision tree is modeling long distance dependencies in the sentences that cannot be modeled by the n-gram models [20]. But the decision tree-based language model thus obtained has only a slight improvement in perplexity over the normal n-gram model. So the author suggested that the two models (decision tree-based language model and the n-gram model) to be used together in order to obtain a much lower perplexity. Decision trees can also be used in pragmatics. Interpreting dialog act assume that the system must decide whether a given input is a statement, a question, a directive, or an acknowledgement. One of the methods used to train the prosodic predictor that can solve this task was decision trees [21]. Also the decision trees have been used for supervised discourse segmentation [22]. 5. EXPERIMENTS AND RESULTS As we mention before, decision trees are used in natural language processing for detecting different kinds of syntactic and semantic relations. In this paper we focus only on the possession relation between two nouns in Romanian language, determined using semantic criteria, and tried to detect this relation encoded

229 The role of decision trees in natural language processing into lexico-syntactic patterns. The definition of the possession relations between two nouns used is given by the following definition: an entity (A) is possessed/owned by an animated entity (B) [23]. For example in Romanian language the possession relation between two nouns, the possessed object and the possessor, is expressed by genitive case (figure 2). Noun-noun example p rulăjuliei membrii unei familii ochii lui Winston MSD Ncmsry Npfsoy Ncmpry Tifso Ncfson Ncmpry Tf-so Np (English translation) (Julia s hair) (members of a family) (Winston s eyes) Figure 2. Some of the lexico-syntactic patterns that encoded the possession relation extracted from the Romanian translation of Orwell'sănovelă Nineteenăeighty-four ă(fromămultext-east corpus [4]) To discover the possession relations between two nouns we used C4.5 decision tree learning [6] implemented in WEKA [5]. In order to detect this relation for the learning algorithm we have choose a set of linguistic factors extracted from the linguistically annotated corpus consisting of Romanian translation of Orwell'sănovelă Nineteenăeighty-four ă[4].ăweăhadă179ărelationăclassesăandăaătotalănumberăofă410ăinstances.ă The size of the obtained tree had 134 with 88 leaves. Selected features used were: - articulation type, number, gender, case and instances for the nouns; - morphosyntactic descriptions and instances for the link word. The decision tree obtained for possession relation encoded into lexico-syntactic patterns extracted from RomanianătranslationăofăOrwell'sănovelă Nineteenăeighty-four ăisăpresentedăinăfigure 3. Figure 3. The decision tree obtained for possession relation encoded into lexico-syntactic patterns extracted from Romanian translationăofăorwell'sănovelă Nineteenăeighty-four Table 1. Ranked attributes for possession relation between two nouns for Information gain (a) and Gain ratio (b)

MarilenaăLAZ R, Diana MILITARU 230 Ranked attributes forinformation gain (a) Ranked attributes forgain ratio (b) gender_possessednoun 5.3257 instancepossessednoun 1 exist_possessivearticle 3.999 instancepossessornoun 1 no_possessornoun 1.3679 MSD_PossessiveArticle 1 case_possessednoun 1.3679 instancefirstlinkword 1 gender_possessornoun 1.2081 case_possessednoun 1 no_possessednoun 1.064 case_possessornoun 1 case_possessornoun 0.9983 exist_possessivearticle 1 articulate_possessornoun 0.9931 gender_possessornoun 1 articulate_possessednoun 0.9892 no_possessednoun 0.989 instancepossessornoun 0.9601 gender_possessednoun 0.687 instancepossessednoun 0.8783 no_possessornoun 0.666 MSD_PossessiveArticle 0.8659 articulate_possessornoun 0.476 instancefirstlinkword 0.6941 articulate_possessednoun 0.476 Each instance has 13 attributes, but some of them are more relevant than others. To decide which of the attributes are the most relevant in our decision trees, we have used the information gain and gain ratio. In table 1 there are the ranked attributes for selected possession relation between two nouns for information gain (a) and gain ratio (b). To estimate the experiment performances we use the next evaluation measures [5]: a. Kappa statistic is used to measure the difference between expected and observed agreements of a dataset, standardized to lie between 1 (perfect agreement) and -1 (perfect disagreement), 0 being exactly expectation: (1) b. True positive rate (also called sensitivity or recall, in our case) is the percentage of instances correctly classified as a given class: (2) c. False positive rate is the percentage of instances incorrectly classified as a given class: (3) d. Precision is the positive predictive value as a measure of a classifiers exactly. It is the percentage of instances correctly classified as positive out of all instances the algorithm classified as positive: (4) e. F-measure is a combined measure for precision and recall: (5) f. ROC area (receiver operating characteristics area, also named recall-precision curve) is the plotting area between true positive rate and false positive rate. A good classifier will have ROC area values nearly 1. g. Accuracy is the percentage of correctly classified instances:

231 The role of decision trees in natural language processing A (6) For selected feature described above we obtained an accuracy of 99.756% when using the training set (table 2) and an accuracy of 99.512% when using the cross validation with 10 folds (table 3). Table 2. The weighted average values and accuracy of two nouns relation for training set Weighted average value for all classes Accuracy (%) Kappa statistic TP Rate (Recall) FP Rate Precision F-Measure ROC Area Correctly classified instances 0.898 0.90 0.002 0.845 0.900 0.998 90.00 Table 3. The weighted average values and accuracy of two nouns relation for cross-validation (10 folds set) Weighted average value for all classes Accuracy (%) Kappa statistic TP Rate (Recall) FP Rate Precision F-Measure ROC Area Correctly classified instances 0.858 0.861 0.002 0.829 0.841 0.997 86.097 6. CONCLUSIONS As we previous presented there are a lot of task in natural language processing that can be solved using decision trees. They can be used for modeling different linguistic data from phonetic to pragmatic knowledge and combined with other methods they really are a powerful tools in natural language processing. In our case, we have extracted 410 possession relations between two nouns, determined using semantic criteria, from the linguisticallyă annotatedă corpusă consistingă ofă Romaniană translationă ofă Orwell'să novelă Nineteenă eighty-four ă [4].ă Ină orderă toă detectă thoseă possessionă relationsă betweenă nounsă weă usedă C4.5ă decisionă treeă learning [6] implemented in WEKA [5] and these set of 13 morphosyntactic factors: - articulation type, number, gender, case and instances for the nouns - morphosyntactic descriptions and instances for the link word. Using these features the relations were grouped in 179 classes. The decision tree had 88 leaves and its size was 134. The results were: - 90% accuracy and 0.998 ROC curve for training set; - 86.097 % accuracy and 0.997 ROC curve using cross-validation with 10 folds. In conclusion, Possession relation modeling between two nouns using decision trees seems to be promising and useful for Romanian language modeling. To improve the results we could use more possessive relations between two or more nouns, more or different features, semantic interpretation of possession relation, etc. REFERENCES 1. Magerman, D. M., Statistical decision-tree models for parsing, ACL-95, pp. 66 53, ACL, 1995. 2. Monson, C., Paramor: From Paradigm Structure to Natural Language Morphology Induction, PhD Thesis, Carnegie Mellon University, 2009 3. Creutz, M., Lagus, K., Unsupervised models for morpheme segmentation and morphology learning, ACM Transactions on Speech and Language Processing, Volume 4, Issue 1, January 2007, Article No. 3 4. Erjavec, T., MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora, Proceedings of the Seventh International Conference on Language Resources and Evaluation, LREC'10, ELRA Paris 2010. 5. Witten, I.H., Frank, E., Hall, A. M., Data Mining: Practical Machine Learning Tools and Techniques, Third Edition, Morgan Kaufmann Publishers, 2011. 6. Quinlan, J. R., Induction of decision trees, Machine Learning, 1, 81-106, 1986.

MarilenaăLAZ R, Diana MILITARU 232 7. Jurafsky, D., Martin, J. H., Speech and Language Processing: Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall, New Jersey, 2nd. Ed., 2008 8. Van den Bosch, A., Learning to Pronounce Written Words: A Study in Inductive Language Learning, Ph.D. thesis, University of Maastricht, Maastricht, The Netherlands, 1997. 9. Jelinek, F., Lafferty, J. D., Magerman, D. M., Mercer, R. L., Ratnaparkhi, A., Roukos, S., Decision tree parsing using a hidden derivation model, ARPA Human Language Technologies Workshop, Plainsboro, pp. 62 67, Morgan Kaufmann, 1994. 10. Breiman, L., et al., Classification and Regression Trees, Pacific Grove, CA, Wadsworth, 1984. 11. Heeman, P. A., POS tags and decision trees for language modeling, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-99), pp. 14-137, 1999. 12. Stetina, J. and Nagao, M., Corpus based PP attachment ambiguity resolution with a semantic dictionary, Zhou, J. and Church, K. W. (Eds.), Proceedings of the Fifth Workshop on Very Large Corpora, Beijing, China, pp. 66 80. ACL, 1997. 13. Black, E., An experiment in computational discrimination of English word senses, IBM Journal of Research and Development, 3(2), 185 194, 1988. 14. Girju, R., Badulescu, A., Moldovan, D., Learning semantic constraints for the automatic discovery of part-whole relations, Proceedings of the Human Language Technology Conference (HLT-NAACL 2003), Edmonton, Canada, 2003. 15. Rosario, B., Hearst, M., Classifying the Semantic Relations in Noun Compounds via a Domain-Specific Lexical Hierarchy, Proceedings of Conference on EMNLP, 2001. 16. Rosario, B., Hearst, M., Fillmore, C, The Descendent of Hierarchy, and Selection in Relational Semantics, Proceedings of ACL, 2002. 17. Nastase, V., Szpakowicz, S., Exploring Noun-Modifier Semantic Relations International Workshop on Computational Semantics, Tillburg, Netherlands, January 2003. 18. Surdeanu, M., Harabagiu, S., Williams, J., Aarseth, P., Using predicate-argument structures for information extraction, Proceedings of the 41th Annual Conference of the Association for Computational Linguistics (ACL-03), pages 8-15, 2003. 19. Chen, J., and Rambow, O., Use of Deep Linguistic Features for the Recognition and Labeling of Semantic Arguments, Proceedings of EMNLP-2003, Sapporo, Japan, 2003 20. Bahl, L.R., Brown, P.F., desouza, P.V., Mercer, R.L., A tree-based statistical language model for natural language speech recognition, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 7, pp. 1001-1008, 1989. 21. Shriberg, E., Bates, R., Taylor, P., Stolcke, A., Jurafsky, D., Ries, K., Coccaro, N., Martin, R., Meteer, M., Van Ess-Dykema, C., Can prosody aid the automatic classification of dialog acts in conversational speech?, Language and Speech (Special Issue on Prosody and onversation), 41(3-4), 439 487, 1998. 22. McCarthy, J. F. and Lehnert, W. G., Using decision trees for coreference resolution, IJCAI-95, Montreal, Canada, pp. 1050 1055, 1995. 23. Moldovan, D., Badulescu, A., Tatu, M., Antohe, D., Girju, R., Models for the Semantic Classification of Noun Phrases, Computational Lexical Semantics Workshop, Human language Technology Conference (HLT-NAACL), Boston, USA, May 2004.