Morphological Tagging Based on Averaged Perceptron

Similar documents
Semi-supervised Training for the Averaged Perceptron POS Tagger

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

CS 598 Natural Language Processing

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Memory-based grammatical error correction

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

An Evaluation of POS Taggers for the CHILDES Corpus

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Training and evaluation of POS taggers on the French MULTITAG corpus

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

A High-Quality Web Corpus of Czech

FOREWORD.. 5 THE PROPER RUSSIAN PRONUNCIATION. 8. УРОК (Unit) УРОК (Unit) УРОК (Unit) УРОК (Unit) 4 80.

Parsing of part-of-speech tagged Assamese Texts

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Developing Grammar in Context

Accurate Unlexicalized Parsing for Modern Hebrew

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Development of the First LRs for Macedonian: Current Projects

Discriminative Learning of Beam-Search Heuristics for Planning

Context Free Grammars. Many slides from Michael Collins

A Comparison of Two Text Representations for Sentiment Analysis

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Using dialogue context to improve parsing performance in dialogue systems

Vocabulary Usage and Intelligibility in Learner Language

Grammars & Parsing, Part 1:

Leveraging Sentiment to Compute Word Similarity

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Myths, Legends, Fairytales and Novels (Writing a Letter)

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Universiteit Leiden ICT in Business

Experiments with a Higher-Order Projective Dependency Parser

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Modeling full form lexica for Arabic

Linking Task: Identifying authors and book titles in verbose queries

Course Outline for Honors Spanish II Mrs. Sharon Koller

Online Updating of Word Representations for Part-of-Speech Tagging

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Python Machine Learning

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Derivational and Inflectional Morphemes in Pak-Pak Language

ARNE - A tool for Namend Entity Recognition from Arabic Text

Indian Institute of Technology, Kanpur

(Sub)Gradient Descent

Multimedia Application Effective Support of Education

ScienceDirect. Malayalam question answering system

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

Subject: Opening the American West. What are you teaching? Explorations of Lewis and Clark

Ensemble Technique Utilization for Indonesian Dependency Parser

Loughton School s curriculum evening. 28 th February 2017

What the National Curriculum requires in reading at Y5 and Y6

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

CS Machine Learning

Ch VI- SENTENCE PATTERNS.

Improving Accuracy in Word Class Tagging through the Combination of Machine Learning Systems

1. Introduction. 2. The OMBI database editor

THE VERB ARGUMENT BROWSER

Using Semantic Relations to Refine Coreference Decisions

Generative models and adversarial training

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Developing a TT-MCTAG for German with an RCG-based Parser

Specifying a shallow grammatical for parsing purposes

On the Combined Behavior of Autonomous Resource Management Agents

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

Evolution of Symbolisation in Chimpanzees and Neural Nets

The stages of event extraction

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Sample Goals and Benchmarks

Cross Language Information Retrieval

Copyright 2002 by the McGraw-Hill Companies, Inc.

BULATS A2 WORDLIST 2

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

AN ANALYSIS OF GRAMMTICAL ERRORS MADE BY THE SECOND YEAR STUDENTS OF SMAN 5 PADANG IN WRITING PAST EXPERIENCES

Experts Retrieval with Multiword-Enhanced Author Topic Model

CS 446: Machine Learning

Automatic Pronunciation Checker

Emmaus Lutheran School English Language Arts Curriculum

The College Board Redesigned SAT Grade 12

Mandarin Lexical Tone Recognition: The Gating Paradigm

Probabilistic Latent Semantic Analysis

AQUA: An Ontology-Driven Question Answering System

Value Creation Through! Integration Workshop! Value Stream Analysis and Mapping for PD! January 31, 2002!

Today we examine the distribution of infinitival clauses, which can be

The Smart/Empire TIPSTER IR System

Litterature review of Soft Systems Methodology

California Department of Education English Language Development Standards for Grade 8

Transcription:

WDS'06 Proceedings of Contributed Papers, Part I, 191 195, 2006. ISBN 80-86732-84-3 MATFYZPRESS Morphological Tagging Based on Averaged Perceptron J. Votrubec Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic. Abstract. Czech (like other Slavic languages) is well known for its complex morphology. Text processing (e.g., automatic translation, syntactic analysis...) usually requires unambiguous selection of grammatical categories (so called morphological tag) for every word in a text. Morphological tagging consists of two parts assigning all possible tags to every word in a text and selecting the right tag in a given context. Project Morče attempts to solve the second part, usually called disambiguation. Using a statistical method based on the combination of a Hidden Markov Model and the AveragedAveraged Perceptron algorithm, a number of experiments have been made exploring different parameter settings of the algorithm in order to obtain the best success rate possible. Final accuracy of Morče on data from PDT 2.0 was 95.431% (results of March 2006). So far, it is the best result for a standalone tagger. Introduction Czech, like other Slavic languages, is well known for its rich morphology. Frequent homonymy complicates computational processing. For instance, podle can be a preposition in the sentence Šel podle lesa. (He walked along a forest.) or an adverb in the sentence Choval se podle. (He behaved perfidiously.) There is also strong homonymy in word endings. The word form nehty (nails) could represent nominative, accusative, vocative or instrumental plural. Adjectives of type jarní have always at least 27 different interpretations (3 cases x 2 numbers x 4 genders and 3 more cases for feminine singular). These morphological interpretations are represented by a standardized 1 system of tags, each consisting of 15 positions one for each morphological category, such as part of speech, number, case, tense, etc. (in fact, 2 positions are unused). Thus, one word form can have more corresponding tags if it is an isolated word. The correct tag can be chosen only using the context. An average Czech word has 4 possible morphological tags (= 4 morphological interpretations). However, if we need to process natural language texts automatically, we require only one (correct) tag for each word. Some examples of applications which require morphological analysis are: machine translation, building text corpora, syntactic analysis, etc. The problem of morphology is divided into two tasks: a) morphological analysis should generate all possible tags for a given word form, ignoring context, b) tagging (or disambiguation) should select only one proper tag for each word. This paper is about a solution to task b). The input was the result of task a). Morče Morče comes from Morfologie češtiny, which means Czech morphology. It was originally a student project at MFF and later my work on it continued in my master thesis. It is a command line application with a set of tools for Czech tagging, written in C for the Linux platform. It consists not only of the tagger itself but also of some development tools to obtain the best possible accuracy. Algorithm Our approach can be labelled as a statistical and learning method. It is based on the Hidden Markov Model and the so-called Averaged Perceptron, described by Michael Collins (Collins, 2002). 1 The standard used in this work is from Prague Dependency Treebank. For details see web references. 191

HMM with Averaged perceptron Generally, the method is based on Hidden Markov model (HMM) which is used wherever you need to transform one sequence of information to another one, presuming the transformation is determined only by history of limited length. Here we are transforming a sequence of word forms into a sequence of morphological tags (however, theoretically it is the other way round). To train and use this model we use the Viterbi algorithm. It finds the most probable sequence of output sequence given the input sequence. What is new in this concept, is an algorithm for evaluating transitions between HMM states, namely the Averaged perceptron, which is simple enough (to obtain a good time results) and gives very good accuracy. Averaged perceptron has not been implemented for Czech morphological tagging yet. In order to obtain optimal accuracy it is necessary to estimate appropriate input parameters of the perceptron. These parameteres determine transition weights between HMM states. They should necessarily describe the word s context we already know that without context the proper tag cannot be found. These context describing parameters are called features. Features Features describe the given situation in the text and determine parameters which will be observed for training and testing. These features have their corresponding weight coefficients for Averaged perceptron. Generally, a feature could be complicated to any extent and it can use any information about text which is involved in the input text or which could be from it derived. Let us see an example sentence: Na tři hlavní podezřelé byla uvalena vazba. (Three main suspects were put into custody.) Word Na tři hlavní podezřelé Byla uvalena vazba Tag 2 RR 4---- ClXP4--- AAMP4--- AAMP4--- VpQW--- VsQW--- NNFS1--- ------- ------- -1A---- -1A---- XR-AA--- XX-AP--- --A---- Here are some examples of features valid for the 3 rd position (hlavní): Current tag is AAMP4----1A----. Previous tag is ClXP4---------- and current tag is AAMP4----1A----. Current word form is hlavní and current tag is AAMP4----1A----. Current word is third in the sentence and current tag is AAMP4----1A----. Current word starts with a lower case letter and current tag is AAMP4----1A----. We say that a feature is either valid at a certain position in the text (if it describes a context which corresponds to the current situation), or invalid. Then, for valid features we use their weight coefficients. Alternatively, a feature could be understood as a prediction of some tag for the current position in text. Learning algorithm: Averaged perceptron Transition weights between states of HMM (which are used by Viterbi algorithm) are defined as a sum of weight coefficients of all valid features in a given context. In the beginning all coefficients are set to 0. 2 The exact meaning of tags is not so important here. Following list of first character meanings is presented just for rough understanding of tags mentioned above. Detailed description of tag system can be found in http://quest.ms.mff.cuni.cz/pdt/morphology_and_tagging/doc/docc0pos.pdf. First character of tag R C A V N Meaning (Part of speech) Preposition numeral Adjective verb noun 192

Training In a few iterations the algorithm goes through all input data. The Viterbi algorithm finds the best path (i.e. the best tags) for each sentence using current weight coefficients. After finishing each sentence, weight coefficients are updated. This repeats until we reach necessary number of iterations. Updating weight coefficients Weight coefficients for all features based on a current sentence and algorithm tags are decreased by 1. Weight coefficients for all features based on a current sentence and correct tags are increased by 1. So, if algorithm tags are all correct, weight coefficients remain unchanged. Testing and use Testing is in fact the same as one training iteration we just find the best tags with Viterbi algorithm. However, there are no updates. In an improved version we use averaged coefficients. They are resistant against oscillations and increase accuracy of the algorithm (see Collins, 2002). Implementation The project was implemented in C as a command line application for the Linux platform. There are several steps to follow in order to obtain the best possible accuracy: a) choose a feature set (or, more precisely, the types of features) b) create list of valid features from the training data c) filter features which do not appear frequently enough in the data d) create a finite state automaton from the list of features it handles the corresponding weight coefficients e) train weight coefficients in a given number of iterations f) test (apply) the trained model on the test data (for each iteration) g) evaluate accuracy on the test data (for each iteration) h) analyze errors Tuning the algorithm There are many factors influencing the final accuracy: data, number of iterations, feature set, filtering. Data Data are in the SGML format and they are part of the Prague Dependency Treebank 2.0. They consist of three blocks: training data (1.5 mil. words), test data, and evaluation data (130,000 words each). In all these types of data there are on average 3.8 possible tags per word. The training data are used for training weight coefficients for different feature sets. These are tested on the test data; feature sets could be modified according to errors and the training process reiterates. This leads to a possible adaptation of the feature set not only to the training data but also the test data. Therefore we use a third independent block of data for final evaluation it is equal to normal use on unknown text. Result figures therefore correspond to the evaluation data. Number of iterations For every set of features 10 iterations were done. The accuracy was tested after each iteration. We observed that the maximum accuracy on the test data is reached around the fifth iteration and decreases at further iterations. Set of features There are two aspects of the choice of features. First of all, it is necessary to choose the types of features to be used as templates to generate features from the training data. Then the set of features is filtered according to the number of occurrences in data. 193

Filtering Usually it is not suitable to use all features generated from the training data but only those that occurred more than once. This substantially decreases their number and makes the algorithm run faster. It was experimentally found that the minimum number of occurrences should be 3, because it does not decrease accuracy and the time effect is significant. Feature types To improve the accuracy, more than 120 experimental versions were developed. The final set of features follows. Features 1 16 predict a complete current tag, features 17 19 predict only a current SUBPOS+CASE. 1. current tag itself (unigram) 2. previous tag (bigram) 3. tag two positions back 4. combination 2.+3. (trigram) 5. current word 6. previous word 7. word two positions back 8. following word 9. position number of the word in the sentence (maximum 9) 10. previous verb (tag) up to 30 positions back 11. previous verb (lemma) up to 30 positions back 12. following possible verb (tag) up to 10 positions forward 13. following possible verb (lemma) up to 10 positions forward 14. previous lemma 15. letter case of current word (lower case, first letter capital, all capital) 16. letter case of current lemma (lower case, first letter capital, all capital) 17. current SUBPOS+CASE 18. previous SUBPOS+CASE and current SUBPOS+CASE 19. SUBPOS+CASE two positions back, previous SUBPOS+CASE and current SUBPOS+CASE Averaged perceptron behavior During the experiments we arrived at some conclusions concerning the Averaged Perceptron. These conclusions do not necessarily apply only to tagging, we can expect that in a different application the Averaged perceptron would behave similarly. The algorithm hates too much information Although the program was implemented for a large number of features, a precisely selected small set of features gave much better results. Tuning of the algorithm takes a long time but it brought also linguistically relevant information. The algorithm hates complex features Complex features contain more information, they are more specialized and better describe given context. However, they are not general enough, so simple features gave better results. The algorithm does not need many iterations for training Maximum accuracy on test data came usually between the 4th and 8th. iteration. It should be noted that the small number of iterations also corresponds to a big volume of the training data. Results The final accuracy of Morče on data from PDT 2.0 was 95,431% (March 2006). It is the best result for a standalone tagger so far. Acknowledgments. The present work was supported by the Czech Grant Agency, grant no. GAČR 201/05/H014. 194

References Collins, M., Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. EMNLP, 2002. Hajič, J., Disambiguation of Rich Inflection (Computational Morphology of Czech). Karolinum, Praha, 2004. Jelinek, F., Statistical Methods for Speech Recognition. The MIT Press, 1998. Votrubec, J., Selecting an optimal set of features for the morphological tagging of Czech (Master thesis). MFF UK, 2005. Web links Prague Dependency Treebank: http://ufal.mff.cuni.cz/pdt. Summary of Czech morphological tagging: http://ufal.mff.cuni.cz/czech-tagging. 195