PART-OF-SPEECH TAGGING FROM AN INFORMATION-THEORETIC POINT OF VIEW

PART-OF-SPEECH TAGGING FROM AN INFORMATION-THEORETIC POINT OF VIEW P. Vanroose Katholieke Universiteit Leuven, div. ESAT PSI Kasteelpark Arenberg 10, B 3001 Heverlee, Belgium Peter.Vanroose@esat.kuleuven.ac.be The goal of part-of-speech tagging is to assign to each word in a sentence its morphosyntactic category. Annotating a text with part-of-speech tags is a standard low-level text preprocessing step before further analysis. An interesting novel approach to the tagging problem is proposed here, by modelling a language as a data source followed by a channel. The Shannon capacity of this simple source/channel model tells us something about the maximally achievable percentage of tagging correctness of any tagging algorithm on an unseen text. INTRODUCTION Automatic natural language processing (NLP) is currently an active research area. Different aspects of NLP have been subdivided into separate topics, as there are (in order of increasing complexity): sentence boundary detection, lemmatisation, part-of-speech (POS) tagging, parsing, and text understanding. These are auxiliary tools for several language-related applications like automatic text translation, text-to-speech engines, and intelligent spelling correction. The goal of POS tagging is to assign to each word in a sentence the most appropriate so called morphosyntactic category. This presumes first of all a predefined tag set which can contain from 10 up to 1000 different tags. These could for example be verb, noun, adjective, etc., or they can be more detailed, like auxiliary verb, transitive verb, verb in present tense, third person singular,... A tag set must form a partition, in the sense that a certain word in a certain sentence can only be assigned exactly one tag. From here on we will assume that a tag set has been chosen and each word in the vocabulary has been assigned a subset of the tag set. If a certain word is assigned more than one tag, this means that this word can have different meanings or functions in different contexts. Such a word may even have different

pronunciations, depending on its meaning (and hence its assigned POS tag), like the English word lives or the Dutch word voornaam. A notorious (English) example of a sentence where POS tags disambiguate the meaning is the following: Time flies like an arrow but fruit flies like a banana, which has the following POS tag assignment: noun verb prep art noun conj adj noun verb art noun. This example also indicates that a first, necessary preprocessing step performed by humans in order to understand a sentence consists of a kind of POS tagging. This explains the importance of POS tagging for more complex language applications like machine translation or text interpretation. An other important aspect of POS tagging which can be learned from this example is that contextual information is needed to resolve potential tag ambiguity for a certain word (like flies ): knowing the tags of surrounding words helps disambiguating the POS tag of a certain word. In order to be able to automatically assign POS tags, it is thus necessary to deduce rules, or at least probabilities for co-occurrence of POS tags in a sentence. These rules are of course very language-dependent and can for example be learned from a large text corpus which is manually tagged. Classically, either a completely deterministic rule-based system is built [1], or a Markov model is assumed for tag transitions between consecutive words in the sentence [2], or a richer context model is trained using supervised learning algorithms [3]. These systems typically achieve up to 97 % correctly tagged words on a previously unseen text (even including unknown words), which seems to be an unsurpassable upper bound, partially because of the presence of inconsistencies (or noise ) in the manually tagged corpora. A SOURCE/CHANNEL CODING MODEL FOR POS TAGGING In this contribution an information-theoretic approach to the POS tagging problem is investigated. Written (or spoken) language can be seen as an (imperfect) communication channel through which humans try to exchange some meaning. So the channel input is the ideal, unambiguous sentence, which is (in a somewhat simplified model) the sequence of words plus the attached POS tags, and the channel performs an imperfect mapping, thereby only revealing the words. Hence the task of the decoder is to recover the POS tags.

Channel input symbols are thus disambiguated words while the channel output symbols are just words (without the POS tag). The fact that different meanings can be mapped onto the same word (like flies or like in the earlier example) reminds of the bins of a multi-user broadcast channel code or the defect cells of a memory with defects : different channel inputs are mapped onto the same output. The first task is thus to model the data source which is generating these channel inputs, i.e., the natural language itself, disambiguated with POS tags for every word. Classical approaches to POS tagging use this idea of an information source whose behaviour has to be modelled, although the term source modelling is never used explicitly. Also, there is still the channel to be dealt with: we don t observe the source output directly but only the channel output. This source/channel separation was never considered before. SOURCE P (W, T) (W, T) CHANNEL W POS tagger ˆT Of course, the traditional channel coding model does not apply since we do not have control over the channel encoder. But remarkably, natural languages (or at least the languages considered here viz. Dutch and English) seem to be optimised in this respect that the encoding from POS tags to words is maximally unambiguous (except maybe in texts that want to exploit the ambiguity in the language, like poems or humoristic texts). Hence, provided that we are given a reasonably accurate source model, we may assume that the channel capacity tells us what the optimal accuracy is of POS tagging. This is the main added value of this new approach. A MODIFIED SOURCE/CHANNEL MODEL In the source/channel proposal of the previous section, the probabilistic behaviour is completely modelled by the source, whereas the channel acts deterministically because it just drops the POS tag part of the source symbols. The motivation for this is its correspondence to the concept of meaning as part of the source, namely the person who wants to communicate something. The channel

part of the model then corresponds to the fact that the speaker must make use of an (intrinsically ambiguous) language. An alternative split-up of source and channel is maybe less intuitive but proves to be more useful: the source is producing a sequence of POS tags only and the channel maps each of them onto words. SOURCE P (T) T CHANNEL W POS P (W i T i ) tagger ˆT Whereas in the first approach only the source has to be modelled, now both the source and the channel must be modelled by observing (a lot of) sample output. Clearly, in both approaches, the source is not memoryless: the possible sequences of POS tags generated by a natural language are constrained to satisfy the grammar of that language. For example, in Dutch or English an article ( de, het, een, the, a, an ) must be followed by a noun or by a noun group (adjective(s) + noun). This explains why an important family of good POS tagging algorithms like the Brill tagger [1] are rule-based: natural languages do really satisfy their grammatical rules to a large extent. But there are two clear drawbacks of this approach: such a deterministic source model does not easily generalise to other languages (since the grammar rules are very language specific) and moreover it is not 100 % accurate, which explains the intrinsic limit of around 97 % correct POS tagging when using this kind of source models. Note that the channel may safely be assumed to be memoryless: if the POS tag set is rich enough, all inter-word stochastic dependencies can be explained by the POS tags. Hence the channel randomly replaces tags with words where the choice is of course limited to those words that have that tag in their tag set. The simplest non-memoryless generalisation from a deterministic to a probabilistic source model is a (hidden) Markov source: at any time instant the source is in a certain state and moves to a next state (governed by a state transition probability matrix), thereby producing one source output symbol, which is a POS-tagged word in the first model, or a POS tag in the second model. The earliest non-rule-based POS taggers used a hidden Markov model [4].

In a certain respect, this is more restricted than what a deterministic source model can describe, since the memory of the source is at most one symbol. Typically, at least for languages like English and Dutch, we need a memory depth of at least two words to accurately describe a source model for a language, cf. language modelling [5, 6]. Of course, for a full disambiguation, in some cases a much longer memory would be needed. But on the average no significant improvement is obtained with a memory of more than two symbols. The main advantage of a probabilistic model is that the source statistics can be estimated by observing the source output, i.e., from a (manually) POS-tagged corpus, and no expert knowledge about the language is needed. But it turned out that a simple Markov model is not good enough for POS tagging, mainly because of the limited amount of information that it can represent. Therefore, other techniques were proposed to enrich the POS tag source model [2, 3]. MODELLING THE CHANNEL The channel input alphabet consists of tags {T j, j = 1... M}, and the output alphabet is the set of all words {W k, k = 1... N} from the language. The channel statistics can be estimated from a sufficiently large POS-annotated text corpus, using a simple empirical distribution for P (W k T j ). Since also unseen words could show up, a kind of back-off discounting strategy is needed to reserve probability mass for such words [5]. An alternative is the use of a Krichevsky-Trofimov (KT) estimator, as was done in [6] for language modelling. Experiments were performed on two tagged corpora: the (American) English Wall Street Journal corpus, tagged with the Penn Treebank tagset [7], and the dutch CGN (corpus gesproken Nederlands) database [8]. The WSJ Penn Treebank consists of 1 037 224 tagged words in 49 203 sentences from WSJ newspaper articles from 1989. It uses 36 different, mutually disjoint POS tags, including e.g. noun, plural-noun, proper-noun, verb, verb-past, verb-third-person-singular. This tag set is very languagespecific, as there is e.g. no tag for a verb in the second person singular form. The channel capacity of this simple channel model can be calculated based on the empirical distribution derived from the Penn Treebank for WSJ. Combined with the obtained source model, a theoretical upper bound of about 93 % is found. This seems to contradict the achieved 97 % correctness of current state-of-the-art POS tagging algorithms.

An explanation for this is that we assumed a memoryless channel here. This was based on the assumption that all memory can be modelled by the source, i.e., the POS tags only. This is only valid when the POS tag set is rich enough. This is clearly not the case for the Penn Treebank: e.g., depending on the gender of a proper-noun, the channel is not allowed to freely choose between his and her further on in the sentence. The CGN corpus uses a set of about 300 tags, so the expectation is that the channel capacity calculated from the CGN corpus will give an upper bound of about 98 % on the correctly tagged words. REFERENCES [1] E. Brill, Some advances in transformation-based part of speech tagging, in: Proceedings of the Twelfth National Conference on Artificial Intelligence, vol. 1, pp. 722 727, 1994. [2] J. Zavrel, W. Daelemans, Recent advances in memory-based part-of-speech tagging, in: Actas del VI Simposio Internacional de Comunicacion Social, Santiago de Cuba, pp. 590 597, 1999. [3] A. Ratnaparkhi, Learning to parse natural language with maximum entropy models, Machine Learning 34, pp. 151 175, 1999. [4] S. DeRose, Grammatical category disambiguation by statistical optimization, Computational Linguistics 14, pp. 31 39, 1988. [5] M. Federico, R. De Mori, Language modelling, chapter 7 from Spoken dialogues with computers, Renato De Mori, ed., in series Signal Processing and its Applications, pp. 199 230; Academic Press, 1998. [6] P. Vanroose, Stochastic language modelling using context tree weighting, in Proceedings of the Twentieth Symposium on Information Theory in the Benelux, Haasrode (May 1999), pp. 33 38. [7] M. Marcus, B. Santorini, M.A. Marcinkiewicz, Building a large annotated corpus of English: the Penn Treebank, Computational Linguistics 19(2), pp. 313-330, 1993. [8] F. Van Eynde, J. Zavrel, W. Daelemans, Part of speech tagging and lemmatisation for the spoken dutch corpus, in Proceedings of the Second Int l Conf. on Language Resources and Evaluation (LREC), Athens (May 2000), vol. III, pp. 1427 1433.