PART-OF-SPEECH TAGGING FROM AN INFORMATION-THEORETIC POINT OF VIEW

Similar documents
have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Training and evaluation of POS taggers on the French MULTITAG corpus

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Improving Accuracy in Word Class Tagging through the Combination of Machine Learning Systems

Parsing of part-of-speech tagged Assamese Texts

Using dialogue context to improve parsing performance in dialogue systems

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Memory-based grammatical error correction

An Evaluation of POS Taggers for the CHILDES Corpus

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Accurate Unlexicalized Parsing for Modern Hebrew

Lecture 10: Reinforcement Learning

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Lecture 1: Machine Learning Basics

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Learning Computational Grammars

BULATS A2 WORDLIST 2

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Treebank mining with GrETEL. Liesbeth Augustinus Frank Van Eynde

Learning Methods in Multilingual Speech Recognition

A Case Study: News Classification Based on Term Frequency

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

The stages of event extraction

CS Machine Learning

Context Free Grammars. Many slides from Michael Collins

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

Beyond the Pipeline: Discrete Optimization in NLP

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Linking Task: Identifying authors and book titles in verbose queries

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Switchboard Language Model Improvement with Conversational Data from Gigaword

The College Board Redesigned SAT Grade 12

Natural Language Processing. George Konidaris

Speech Recognition at ICSI: Broadcast News and beyond

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

1. Introduction. 2. The OMBI database editor

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Indian Institute of Technology, Kanpur

Corpus Linguistics (L615)

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Cross Language Information Retrieval

CS 598 Natural Language Processing

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

LTAG-spinal and the Treebank

Specifying a shallow grammatical for parsing purposes

Ensemble Technique Utilization for Indonesian Dependency Parser

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Semi-supervised Training for the Averaged Perceptron POS Tagger

Problems of the Arabic OCR: New Attitudes

Evolution of Symbolisation in Chimpanzees and Neural Nets

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Rule Learning With Negation: Issues Regarding Effectiveness

Writing a composition

First Grade Curriculum Highlights: In alignment with the Common Core Standards

The Smart/Empire TIPSTER IR System

What the National Curriculum requires in reading at Y5 and Y6

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

The Role of the Head in the Interpretation of English Deverbal Compounds

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Evolutive Neural Net Fuzzy Filtering: Basic Description

Python Machine Learning

Disambiguation of Thai Personal Name from Online News Articles

Short Text Understanding Through Lexical-Semantic Analysis

Applications of memory-based natural language processing

Automating the E-learning Personalization

The Discourse Anaphoric Properties of Connectives

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

cmp-lg/ Jan 1998

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Spanish III Class Description

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Learning Distributed Linguistic Classes

The taming of the data:

Reinforcement Learning by Comparing Immediate Reward

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Experiments with a Higher-Order Projective Dependency Parser

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Abstractions and the Brain

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

AQUA: An Ontology-Driven Question Answering System

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Myths, Legends, Fairytales and Novels (Writing a Letter)

Constructing Parallel Corpus from Movie Subtitles

Transcription:

PART-OF-SPEECH TAGGING FROM AN INFORMATION-THEORETIC POINT OF VIEW P. Vanroose Katholieke Universiteit Leuven, div. ESAT PSI Kasteelpark Arenberg 10, B 3001 Heverlee, Belgium Peter.Vanroose@esat.kuleuven.ac.be The goal of part-of-speech tagging is to assign to each word in a sentence its morphosyntactic category. Annotating a text with part-of-speech tags is a standard low-level text preprocessing step before further analysis. An interesting novel approach to the tagging problem is proposed here, by modelling a language as a data source followed by a channel. The Shannon capacity of this simple source/channel model tells us something about the maximally achievable percentage of tagging correctness of any tagging algorithm on an unseen text. INTRODUCTION Automatic natural language processing (NLP) is currently an active research area. Different aspects of NLP have been subdivided into separate topics, as there are (in order of increasing complexity): sentence boundary detection, lemmatisation, part-of-speech (POS) tagging, parsing, and text understanding. These are auxiliary tools for several language-related applications like automatic text translation, text-to-speech engines, and intelligent spelling correction. The goal of POS tagging is to assign to each word in a sentence the most appropriate so called morphosyntactic category. This presumes first of all a predefined tag set which can contain from 10 up to 1000 different tags. These could for example be verb, noun, adjective, etc., or they can be more detailed, like auxiliary verb, transitive verb, verb in present tense, third person singular,... A tag set must form a partition, in the sense that a certain word in a certain sentence can only be assigned exactly one tag. From here on we will assume that a tag set has been chosen and each word in the vocabulary has been assigned a subset of the tag set. If a certain word is assigned more than one tag, this means that this word can have different meanings or functions in different contexts. Such a word may even have different

pronunciations, depending on its meaning (and hence its assigned POS tag), like the English word lives or the Dutch word voornaam. A notorious (English) example of a sentence where POS tags disambiguate the meaning is the following: Time flies like an arrow but fruit flies like a banana, which has the following POS tag assignment: noun verb prep art noun conj adj noun verb art noun. This example also indicates that a first, necessary preprocessing step performed by humans in order to understand a sentence consists of a kind of POS tagging. This explains the importance of POS tagging for more complex language applications like machine translation or text interpretation. An other important aspect of POS tagging which can be learned from this example is that contextual information is needed to resolve potential tag ambiguity for a certain word (like flies ): knowing the tags of surrounding words helps disambiguating the POS tag of a certain word. In order to be able to automatically assign POS tags, it is thus necessary to deduce rules, or at least probabilities for co-occurrence of POS tags in a sentence. These rules are of course very language-dependent and can for example be learned from a large text corpus which is manually tagged. Classically, either a completely deterministic rule-based system is built [1], or a Markov model is assumed for tag transitions between consecutive words in the sentence [2], or a richer context model is trained using supervised learning algorithms [3]. These systems typically achieve up to 97 % correctly tagged words on a previously unseen text (even including unknown words), which seems to be an unsurpassable upper bound, partially because of the presence of inconsistencies (or noise ) in the manually tagged corpora. A SOURCE/CHANNEL CODING MODEL FOR POS TAGGING In this contribution an information-theoretic approach to the POS tagging problem is investigated. Written (or spoken) language can be seen as an (imperfect) communication channel through which humans try to exchange some meaning. So the channel input is the ideal, unambiguous sentence, which is (in a somewhat simplified model) the sequence of words plus the attached POS tags, and the channel performs an imperfect mapping, thereby only revealing the words. Hence the task of the decoder is to recover the POS tags.

Channel input symbols are thus disambiguated words while the channel output symbols are just words (without the POS tag). The fact that different meanings can be mapped onto the same word (like flies or like in the earlier example) reminds of the bins of a multi-user broadcast channel code or the defect cells of a memory with defects : different channel inputs are mapped onto the same output. The first task is thus to model the data source which is generating these channel inputs, i.e., the natural language itself, disambiguated with POS tags for every word. Classical approaches to POS tagging use this idea of an information source whose behaviour has to be modelled, although the term source modelling is never used explicitly. Also, there is still the channel to be dealt with: we don t observe the source output directly but only the channel output. This source/channel separation was never considered before. SOURCE P (W, T) (W, T) CHANNEL W POS tagger ˆT Of course, the traditional channel coding model does not apply since we do not have control over the channel encoder. But remarkably, natural languages (or at least the languages considered here viz. Dutch and English) seem to be optimised in this respect that the encoding from POS tags to words is maximally unambiguous (except maybe in texts that want to exploit the ambiguity in the language, like poems or humoristic texts). Hence, provided that we are given a reasonably accurate source model, we may assume that the channel capacity tells us what the optimal accuracy is of POS tagging. This is the main added value of this new approach. A MODIFIED SOURCE/CHANNEL MODEL In the source/channel proposal of the previous section, the probabilistic behaviour is completely modelled by the source, whereas the channel acts deterministically because it just drops the POS tag part of the source symbols. The motivation for this is its correspondence to the concept of meaning as part of the source, namely the person who wants to communicate something. The channel

part of the model then corresponds to the fact that the speaker must make use of an (intrinsically ambiguous) language. An alternative split-up of source and channel is maybe less intuitive but proves to be more useful: the source is producing a sequence of POS tags only and the channel maps each of them onto words. SOURCE P (T) T CHANNEL W POS P (W i T i ) tagger ˆT Whereas in the first approach only the source has to be modelled, now both the source and the channel must be modelled by observing (a lot of) sample output. Clearly, in both approaches, the source is not memoryless: the possible sequences of POS tags generated by a natural language are constrained to satisfy the grammar of that language. For example, in Dutch or English an article ( de, het, een, the, a, an ) must be followed by a noun or by a noun group (adjective(s) + noun). This explains why an important family of good POS tagging algorithms like the Brill tagger [1] are rule-based: natural languages do really satisfy their grammatical rules to a large extent. But there are two clear drawbacks of this approach: such a deterministic source model does not easily generalise to other languages (since the grammar rules are very language specific) and moreover it is not 100 % accurate, which explains the intrinsic limit of around 97 % correct POS tagging when using this kind of source models. Note that the channel may safely be assumed to be memoryless: if the POS tag set is rich enough, all inter-word stochastic dependencies can be explained by the POS tags. Hence the channel randomly replaces tags with words where the choice is of course limited to those words that have that tag in their tag set. The simplest non-memoryless generalisation from a deterministic to a probabilistic source model is a (hidden) Markov source: at any time instant the source is in a certain state and moves to a next state (governed by a state transition probability matrix), thereby producing one source output symbol, which is a POS-tagged word in the first model, or a POS tag in the second model. The earliest non-rule-based POS taggers used a hidden Markov model [4].

In a certain respect, this is more restricted than what a deterministic source model can describe, since the memory of the source is at most one symbol. Typically, at least for languages like English and Dutch, we need a memory depth of at least two words to accurately describe a source model for a language, cf. language modelling [5, 6]. Of course, for a full disambiguation, in some cases a much longer memory would be needed. But on the average no significant improvement is obtained with a memory of more than two symbols. The main advantage of a probabilistic model is that the source statistics can be estimated by observing the source output, i.e., from a (manually) POS-tagged corpus, and no expert knowledge about the language is needed. But it turned out that a simple Markov model is not good enough for POS tagging, mainly because of the limited amount of information that it can represent. Therefore, other techniques were proposed to enrich the POS tag source model [2, 3]. MODELLING THE CHANNEL The channel input alphabet consists of tags {T j, j = 1... M}, and the output alphabet is the set of all words {W k, k = 1... N} from the language. The channel statistics can be estimated from a sufficiently large POS-annotated text corpus, using a simple empirical distribution for P (W k T j ). Since also unseen words could show up, a kind of back-off discounting strategy is needed to reserve probability mass for such words [5]. An alternative is the use of a Krichevsky-Trofimov (KT) estimator, as was done in [6] for language modelling. Experiments were performed on two tagged corpora: the (American) English Wall Street Journal corpus, tagged with the Penn Treebank tagset [7], and the dutch CGN (corpus gesproken Nederlands) database [8]. The WSJ Penn Treebank consists of 1 037 224 tagged words in 49 203 sentences from WSJ newspaper articles from 1989. It uses 36 different, mutually disjoint POS tags, including e.g. noun, plural-noun, proper-noun, verb, verb-past, verb-third-person-singular. This tag set is very languagespecific, as there is e.g. no tag for a verb in the second person singular form. The channel capacity of this simple channel model can be calculated based on the empirical distribution derived from the Penn Treebank for WSJ. Combined with the obtained source model, a theoretical upper bound of about 93 % is found. This seems to contradict the achieved 97 % correctness of current state-of-the-art POS tagging algorithms.

An explanation for this is that we assumed a memoryless channel here. This was based on the assumption that all memory can be modelled by the source, i.e., the POS tags only. This is only valid when the POS tag set is rich enough. This is clearly not the case for the Penn Treebank: e.g., depending on the gender of a proper-noun, the channel is not allowed to freely choose between his and her further on in the sentence. The CGN corpus uses a set of about 300 tags, so the expectation is that the channel capacity calculated from the CGN corpus will give an upper bound of about 98 % on the correctly tagged words. REFERENCES [1] E. Brill, Some advances in transformation-based part of speech tagging, in: Proceedings of the Twelfth National Conference on Artificial Intelligence, vol. 1, pp. 722 727, 1994. [2] J. Zavrel, W. Daelemans, Recent advances in memory-based part-of-speech tagging, in: Actas del VI Simposio Internacional de Comunicacion Social, Santiago de Cuba, pp. 590 597, 1999. [3] A. Ratnaparkhi, Learning to parse natural language with maximum entropy models, Machine Learning 34, pp. 151 175, 1999. [4] S. DeRose, Grammatical category disambiguation by statistical optimization, Computational Linguistics 14, pp. 31 39, 1988. [5] M. Federico, R. De Mori, Language modelling, chapter 7 from Spoken dialogues with computers, Renato De Mori, ed., in series Signal Processing and its Applications, pp. 199 230; Academic Press, 1998. [6] P. Vanroose, Stochastic language modelling using context tree weighting, in Proceedings of the Twentieth Symposium on Information Theory in the Benelux, Haasrode (May 1999), pp. 33 38. [7] M. Marcus, B. Santorini, M.A. Marcinkiewicz, Building a large annotated corpus of English: the Penn Treebank, Computational Linguistics 19(2), pp. 313-330, 1993. [8] F. Van Eynde, J. Zavrel, W. Daelemans, Part of speech tagging and lemmatisation for the spoken dutch corpus, in Proceedings of the Second Int l Conf. on Language Resources and Evaluation (LREC), Athens (May 2000), vol. III, pp. 1427 1433.