Morphological Tagging Based on Averaged Perceptron

WDS'06 Proceedings of Contributed Papers, Part I, 191 195, 2006. ISBN 80-86732-84-3 MATFYZPRESS Morphological Tagging Based on Averaged Perceptron J. Votrubec Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic. Abstract. Czech (like other Slavic languages) is well known for its complex morphology. Text processing (e.g., automatic translation, syntactic analysis...) usually requires unambiguous selection of grammatical categories (so called morphological tag) for every word in a text. Morphological tagging consists of two parts assigning all possible tags to every word in a text and selecting the right tag in a given context. Project Morče attempts to solve the second part, usually called disambiguation. Using a statistical method based on the combination of a Hidden Markov Model and the AveragedAveraged Perceptron algorithm, a number of experiments have been made exploring different parameter settings of the algorithm in order to obtain the best success rate possible. Final accuracy of Morče on data from PDT 2.0 was 95.431% (results of March 2006). So far, it is the best result for a standalone tagger. Introduction Czech, like other Slavic languages, is well known for its rich morphology. Frequent homonymy complicates computational processing. For instance, podle can be a preposition in the sentence Šel podle lesa. (He walked along a forest.) or an adverb in the sentence Choval se podle. (He behaved perfidiously.) There is also strong homonymy in word endings. The word form nehty (nails) could represent nominative, accusative, vocative or instrumental plural. Adjectives of type jarní have always at least 27 different interpretations (3 cases x 2 numbers x 4 genders and 3 more cases for feminine singular). These morphological interpretations are represented by a standardized 1 system of tags, each consisting of 15 positions one for each morphological category, such as part of speech, number, case, tense, etc. (in fact, 2 positions are unused). Thus, one word form can have more corresponding tags if it is an isolated word. The correct tag can be chosen only using the context. An average Czech word has 4 possible morphological tags (= 4 morphological interpretations). However, if we need to process natural language texts automatically, we require only one (correct) tag for each word. Some examples of applications which require morphological analysis are: machine translation, building text corpora, syntactic analysis, etc. The problem of morphology is divided into two tasks: a) morphological analysis should generate all possible tags for a given word form, ignoring context, b) tagging (or disambiguation) should select only one proper tag for each word. This paper is about a solution to task b). The input was the result of task a). Morče Morče comes from Morfologie češtiny, which means Czech morphology. It was originally a student project at MFF and later my work on it continued in my master thesis. It is a command line application with a set of tools for Czech tagging, written in C for the Linux platform. It consists not only of the tagger itself but also of some development tools to obtain the best possible accuracy. Algorithm Our approach can be labelled as a statistical and learning method. It is based on the Hidden Markov Model and the so-called Averaged Perceptron, described by Michael Collins (Collins, 2002). 1 The standard used in this work is from Prague Dependency Treebank. For details see web references. 191

HMM with Averaged perceptron Generally, the method is based on Hidden Markov model (HMM) which is used wherever you need to transform one sequence of information to another one, presuming the transformation is determined only by history of limited length. Here we are transforming a sequence of word forms into a sequence of morphological tags (however, theoretically it is the other way round). To train and use this model we use the Viterbi algorithm. It finds the most probable sequence of output sequence given the input sequence. What is new in this concept, is an algorithm for evaluating transitions between HMM states, namely the Averaged perceptron, which is simple enough (to obtain a good time results) and gives very good accuracy. Averaged perceptron has not been implemented for Czech morphological tagging yet. In order to obtain optimal accuracy it is necessary to estimate appropriate input parameters of the perceptron. These parameteres determine transition weights between HMM states. They should necessarily describe the word s context we already know that without context the proper tag cannot be found. These context describing parameters are called features. Features Features describe the given situation in the text and determine parameters which will be observed for training and testing. These features have their corresponding weight coefficients for Averaged perceptron. Generally, a feature could be complicated to any extent and it can use any information about text which is involved in the input text or which could be from it derived. Let us see an example sentence: Na tři hlavní podezřelé byla uvalena vazba. (Three main suspects were put into custody.) Word Na tři hlavní podezřelé Byla uvalena vazba Tag 2 RR 4---- ClXP4--- AAMP4--- AAMP4--- VpQW--- VsQW--- NNFS1--- ------- ------- -1A---- -1A---- XR-AA--- XX-AP--- --A---- Here are some examples of features valid for the 3 rd position (hlavní): Current tag is AAMP4----1A----. Previous tag is ClXP4---------- and current tag is AAMP4----1A----. Current word form is hlavní and current tag is AAMP4----1A----. Current word is third in the sentence and current tag is AAMP4----1A----. Current word starts with a lower case letter and current tag is AAMP4----1A----. We say that a feature is either valid at a certain position in the text (if it describes a context which corresponds to the current situation), or invalid. Then, for valid features we use their weight coefficients. Alternatively, a feature could be understood as a prediction of some tag for the current position in text. Learning algorithm: Averaged perceptron Transition weights between states of HMM (which are used by Viterbi algorithm) are defined as a sum of weight coefficients of all valid features in a given context. In the beginning all coefficients are set to 0. 2 The exact meaning of tags is not so important here. Following list of first character meanings is presented just for rough understanding of tags mentioned above. Detailed description of tag system can be found in http://quest.ms.mff.cuni.cz/pdt/morphology_and_tagging/doc/docc0pos.pdf. First character of tag R C A V N Meaning (Part of speech) Preposition numeral Adjective verb noun 192

Training In a few iterations the algorithm goes through all input data. The Viterbi algorithm finds the best path (i.e. the best tags) for each sentence using current weight coefficients. After finishing each sentence, weight coefficients are updated. This repeats until we reach necessary number of iterations. Updating weight coefficients Weight coefficients for all features based on a current sentence and algorithm tags are decreased by 1. Weight coefficients for all features based on a current sentence and correct tags are increased by 1. So, if algorithm tags are all correct, weight coefficients remain unchanged. Testing and use Testing is in fact the same as one training iteration we just find the best tags with Viterbi algorithm. However, there are no updates. In an improved version we use averaged coefficients. They are resistant against oscillations and increase accuracy of the algorithm (see Collins, 2002). Implementation The project was implemented in C as a command line application for the Linux platform. There are several steps to follow in order to obtain the best possible accuracy: a) choose a feature set (or, more precisely, the types of features) b) create list of valid features from the training data c) filter features which do not appear frequently enough in the data d) create a finite state automaton from the list of features it handles the corresponding weight coefficients e) train weight coefficients in a given number of iterations f) test (apply) the trained model on the test data (for each iteration) g) evaluate accuracy on the test data (for each iteration) h) analyze errors Tuning the algorithm There are many factors influencing the final accuracy: data, number of iterations, feature set, filtering. Data Data are in the SGML format and they are part of the Prague Dependency Treebank 2.0. They consist of three blocks: training data (1.5 mil. words), test data, and evaluation data (130,000 words each). In all these types of data there are on average 3.8 possible tags per word. The training data are used for training weight coefficients for different feature sets. These are tested on the test data; feature sets could be modified according to errors and the training process reiterates. This leads to a possible adaptation of the feature set not only to the training data but also the test data. Therefore we use a third independent block of data for final evaluation it is equal to normal use on unknown text. Result figures therefore correspond to the evaluation data. Number of iterations For every set of features 10 iterations were done. The accuracy was tested after each iteration. We observed that the maximum accuracy on the test data is reached around the fifth iteration and decreases at further iterations. Set of features There are two aspects of the choice of features. First of all, it is necessary to choose the types of features to be used as templates to generate features from the training data. Then the set of features is filtered according to the number of occurrences in data. 193

Filtering Usually it is not suitable to use all features generated from the training data but only those that occurred more than once. This substantially decreases their number and makes the algorithm run faster. It was experimentally found that the minimum number of occurrences should be 3, because it does not decrease accuracy and the time effect is significant. Feature types To improve the accuracy, more than 120 experimental versions were developed. The final set of features follows. Features 1 16 predict a complete current tag, features 17 19 predict only a current SUBPOS+CASE. 1. current tag itself (unigram) 2. previous tag (bigram) 3. tag two positions back 4. combination 2.+3. (trigram) 5. current word 6. previous word 7. word two positions back 8. following word 9. position number of the word in the sentence (maximum 9) 10. previous verb (tag) up to 30 positions back 11. previous verb (lemma) up to 30 positions back 12. following possible verb (tag) up to 10 positions forward 13. following possible verb (lemma) up to 10 positions forward 14. previous lemma 15. letter case of current word (lower case, first letter capital, all capital) 16. letter case of current lemma (lower case, first letter capital, all capital) 17. current SUBPOS+CASE 18. previous SUBPOS+CASE and current SUBPOS+CASE 19. SUBPOS+CASE two positions back, previous SUBPOS+CASE and current SUBPOS+CASE Averaged perceptron behavior During the experiments we arrived at some conclusions concerning the Averaged Perceptron. These conclusions do not necessarily apply only to tagging, we can expect that in a different application the Averaged perceptron would behave similarly. The algorithm hates too much information Although the program was implemented for a large number of features, a precisely selected small set of features gave much better results. Tuning of the algorithm takes a long time but it brought also linguistically relevant information. The algorithm hates complex features Complex features contain more information, they are more specialized and better describe given context. However, they are not general enough, so simple features gave better results. The algorithm does not need many iterations for training Maximum accuracy on test data came usually between the 4th and 8th. iteration. It should be noted that the small number of iterations also corresponds to a big volume of the training data. Results The final accuracy of Morče on data from PDT 2.0 was 95,431% (March 2006). It is the best result for a standalone tagger so far. Acknowledgments. The present work was supported by the Czech Grant Agency, grant no. GAČR 201/05/H014. 194

References Collins, M., Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. EMNLP, 2002. Hajič, J., Disambiguation of Rich Inflection (Computational Morphology of Czech). Karolinum, Praha, 2004. Jelinek, F., Statistical Methods for Speech Recognition. The MIT Press, 1998. Votrubec, J., Selecting an optimal set of features for the morphological tagging of Czech (Master thesis). MFF UK, 2005. Web links Prague Dependency Treebank: http://ufal.mff.cuni.cz/pdt. Summary of Czech morphological tagging: http://ufal.mff.cuni.cz/czech-tagging. 195