POS Tagging & Disambiguation Goutam Kumar Saha Additional Director CDAC Kolkata
The Significance of the Part of Speech (POS) in Natural Language Processing (NLP) - POS gives a significant amount of information about the word and its neighbors. - POS can be used in stemming for informational retrieval on morphological affixes. (by G.K.Saha)
Morphological Analysis - Analyzing words into their Linguistic Components or Morphemes. Morphemes - The smallest meaningful units of language e.g. Cars Æ Car + Plural Babake Æ Baba + Ke ( in Bangla Language) (by G.K.Saha)
You shall know a word by the company it keeps ( Firth,1957 ) POS Tagging - Each word has a POS tag to describe its category. - POS Tag of a word can be one of major word groups or its subgroups. POS Tagger - Tries to POS tag the words. (by G.K.Saha)
Light the light light. Is the word light a verb or noun or adjective? -- Morphological analyzer cannot make decision on POS of the light word. -- A POS tagger can make that decision by looking the surrounding words.
Two broad super categories of POS - Closed Class types - They have relatively fixed membership e.g. Prepositions ( new prepositions are rarely coined. ) - Open Class types - They have no fixed membership e.g. Nouns and Verbs ( new verb fax or the borrowed noun lathi ) - Other two major open classes: Adjectives, Adverbs.
Other closed classes - Determiners (article) : a, an, the - Pronouns: she, who, I - Conjunctions : and, but, or - Auxiliary verbs: can, may, should - Numerals : one, two, three, first, third - Particles (used to form phrasal verb): up, off, on, down, in, out, at, by - Prepositions: on, under, over, near, by, at, from, to, with
- Languages have generally a relative small set of closed class words (CCW) - CCWs are used frequently and they act as function words - CCWs can be ambiguous in their POS tags. Function Words - Function words are grammatical words like it, of, and, or you. - Function words tend to be short and play an important role in grammar - They occur frequently.
POS Tagging -- is the process of assigning a part of speech label or other lexical class marker to each of a sequence of words reflecting their syntactic category. -- Words can belong to different syntactic categories in different contexts. e.g., (a) He reads books <plural noun> (b) He books <3 rd person singular verb> tickets.
POS Tagger Architecture A pipeline of 3 major components: (i) Tokeniser : is responsible for segmenting the input text into words and sentences. Advanced tokenisers ( also called preprocessors) attempt to recognise phrasal constructions, proper names etc, as single tokens. (ii) Morphological Classifier : is responsible for classifying string-tokens as word-tokens with sets of morpho-syntactic features. It returns a set of possible POS Tags (or POS Class) and related morpho-syntactic features. Morpho syntactic features: number, case, gender etc.
The morphological classifier returns a set of possible POS tags when more than one tag can be assigned (e.g., book). (iii) Morphological Disambiguator : chooses a single POS tag according to the context. Organising the Lexicon: 1. The word list lexicon where each word is declaratively stored together with its morphosyntactic features.
2. The morohological lexicon: The base forms of words (stems) are provided with the rules for the formation of their inflectional and derivational variants. POS guesser However, no lexicon contains all possible words. When the morphological classifier comes across a word that is not in the lexicon, then a POS guesser tries to guess the POS class for the unknown word. Disambiguator Word tokens together with their POS tags are sent to the morphological disambiguator. It chooses a single POS tag according to the context.
Automatic POS Tagging In terms of the degree of automation of the training and tagging process, we can have the following two broad approaches to automatic POS Tagging: 1. Supervised 2. Unsupervised Supervised Taggers typically rely on pre tagged corpora to serve as the basis for creating any tools to be used throughout the tagging process, for example: the tagger dictionary, the word / tag frequencies, the tag sequence probabilities and / or the rule set.
Unsupervised Tagger Unsupervised Taggers do not require a pre-tagged corpus but instead use computational methods to automatically induce word groupings (i.e. tag sets). Based on the automatic groupings, it calculates either the probabilistic information needed by stochastic taggers or to induce the context rules needed by the rule-based systems. Pros and Cons A fully automated approach to POS tagging is extremely portable.
The automatic POS taggers tend to perform best when they are both trained and tested on the same kind ( or genre ) of text. The unfortunate reality is that pre-tagged corpora are not readily available for many languages and genres which one might to tag. Full automation of the tagging process addresses the need to accurately tag previously untagged genres and languages in light of the fact that hand tagging of training data is a costly and time consuming process. Drawbacks: The word clusterings resulting from such unsupervised methods are very coarse. In other words, one loses the fine distinctions found in the carefully designed tag sets used in the supervised methods.
POS Taggers can be characterized as 1. Rule Based 2. Stochastic. Rule Based Taggers use hand written rules to distinguish tag ambiguity. constraints to eliminate tags that are inconsistent with the context.
Stochastic Taggers : Hidden Markov Model or HMM based choose a tag sequence for a whole sentence rather than for a single word choose the tag sequence that maximizes the product of word likelihood and tag sequence probability
Rule based POS Tagging Use dictionary to find all possible parts of speech for a word Use disambiguation rules ( e.g., det X n = X / adj ) Typically hundreds of constraints can be designed manually
Rule Based POS Tagging Typical rule based approaches use contextual information to assign tags to unknown or ambiguous words. These rules are often known as context frame rules. As an example, a context frame rule might say something like: If an ambiguous / unknown word X is preceded by a determiner and followed by a noun, then tag it as an adjective.
Rule Based POS Tagger (RBPT) In addition to contextual information, a RBPT might use morphological information to aid in the disambiguation process. For an example, V X (ends in an ing) = X / verb Going beyond the usage of Contextual and morphological information, we can also include rules pertaining to such factors as capitalization ( possibly identifying as a proper noun ) and punctuation.
Rule-Based POS Tagging Adverbial - That Given input: that If ( +1 ADJ / ADV ); Disambiguation Rules Rule ( +2 S-BND); /* sentence boundary */ ( NOT -1 VAAOC ); /* verbs allowing adjs as object complements */ Then eliminate non-adv tags /* that is an Adverbial Intensifier */ Else eliminate ADV tag /* that is an complementizer */ {e.g. It isn t that odd. I believe / think / consider that odd. }
STOCHASTIC TAGGING Stochastic tagger (ST) refers to an approach to the problem of POS Tagging that incorporates frequency or probability, i.e. statistics. The simplest ST disambiguates words solely on the probability that a word occurs with a particular tag. In other words, the tag encountered most frequently in the training set is the one assigned to an ambiguous instance of that word.
ST The problem : it may yield a valid tag for a given word or can yield inadmissible sequence of tags. An alternative to the word frequency approach is to calculate the probability of a given sequence of tags occurring. This is referred to as the N-gram Approach (NGA). NGA refers to the fact that the best tag for a given word is determined by the probability that it occurs with the N previous tags. Viterbi Algorithm implements an NGA.
Hidden Markov Model (HMM) A Stochastic Tagger that use both Assumptions Tag sequence probabilities Word frequency measurements Each hidden tag state produces a word in the sentence. Each word is: 1. uncorrelated with all the other words and their tags. 2. probabilistic depending on the N previous tags only.
Limitations Solutions HMM cannot be used in an automated tagging schema. It relies upon the calculation of statistics on output sequences (tag states). HMM cannot be trained automatically. The solution to the problem of being unable to automatically train HMMs is to employ the Baum- Welch Algorithm ( also known as the Forward Backward Algorithm). This algorithm uses word rather than tag information to iteratively construct a sequence to improve the probability of the training data.
Unknown Words How should the unknown words be dealt with? Certain of the rules in rule based taggers are equipped to address this issue. But what happens in the stochastic models? How can one calculate the probability that a given occurs with a given tag if that word is unknown tagger. Solutions word to the 1. to assign a set of default tags (typically the open classes: N, V, Adj, Adv.) to unknown words, and to disambiguate using the probabilities that those tags occur at the end of the n-gram in question.
2. The tagger calculates the probability that a suffix on an unknown word occurs with a particular tag. If an HMM is being used, the probability that a word containing that suffix occurs with a particular tag in the given sequence is calculated. Steps in STOCHASTIC TAGGING It is necessary to make all of the necessary measurements and calculations to determine the values for the n-gram based transitional frequency values.
In order to create a matrix of transitional probabilities, it is necessary to begin with a tagged corpus upon which to base the estimates of those probabilities. We base our estimates on the immediate context of a words and do not consider any context further than one word away ( bigram model ). The 1 st step in this process is to determine the probability of each category s occurrence. In order to determine the probability of a noun occuring in a given corpus, we divide the total number of nouns by the total number of words. The next step is to determine transitional probabilities for sequence of words (conditional prob.)
For an example, to determine the probability of a following a determiner : noun P (noun det) = P (det & noun) / P (det). (1) We read it as : the probability of a noun occurring given the occurrence of a determiner is equal to the probability of a determiner and a noun occurring together, divided by the probability of a determiner occurring. Prof. Allen (1995) uses the category frequencies instead of the category probability. P(Cat i = noun Cat i-1 = det) = Count(det at i-1 & noun at i ) / Count (det at position i-1). (2) This is the bigram transitional probabilities.
Flaws in the equation (2): The trouble is that words which occur with high frequency, such as nouns, get favoured too heavily during the disambiguation process. Thus it results in a decrease in the precision of the system. The problem is that the frequency of the category at i 1 was never taken into account. The solution is to slightly modify that equation to include the frequency of the context word: P ( Cat i = noun Cat i-1 = det) = Count( det at i-1 & noun at i ) / (Count (det at i-1) * Count(noun at i)). (3) The denominator is the product of the frequencies of the words in the bigram, rather than just the frequency of the context word.
The final step in the basic probabilistic disambiguation process is to use the transitional probabilities (Eqn. 3) to determine the optimal path through the search space ranging from one unambiguous tag to the next. In other words, we need to implement some kind of search algorithm which will allow the calculations just made to be of some use in the disambiguous process. In the algorithms we have used the products of the transitional probabilities at each node.the principle which allows this type of formula to be used is known as the Markov assumption. Markov assumption: takes for granted that the probability of a particular category occurring depends solely on the category immediately preceding it.
Markov Models: Algorithms which rely on the Markov assumption to determine the optimal path are known as Markov models. Hidden Markov Models: A Markov model is hidden when we cannot determine the state sequence it has passed through on the basis of the outputs we observed. Efficiency of a Markov Model: is best exploited when used in conjunction with some form of a best first search algorithm so as to avoid the polynomial time problem.
HMM Tagger Example Ram <NP> is <VBZ> expected <VBN> to <TO> race <VB> tomorrow <ADV> People <NNS> continue <VBP> to<to> inquire <VB> the<dt> reason<nn> for<in> the<dt> race<nn> for<in> outer<jj> space<nn> t i = argmax J P(t J t i-1 ) P(w i t J ) ; HMM eqn. where, P(t J t i-1 ) = a tag sequence probability and P(w i t J ) = a word likelihood P(VB TO) P(race VB) P(NN TO)P(race NN) {TO: to+vb (to sleep), to+nn (to school)}
P(NN TO) =.021 /* From the combined Brown P(VB TO) =.34 and Switchboard corpora */ P(race NN) =.00041 P(race VB) =.00003 P(VB TO)P(race VB) =.00001 P(NN TO)P(race NN) =.00000861 P(race VB) Æ If we are expecting a verb, how likely is it that this verb would be race.
Transformation Based Tagging (TBT) also called Brill Tagging an instance of the Transformation Based Learning (TBL) inspired by both the rule-based (RB) & stochastic tagger (ST) TBL is based on rules that specify what tags should be assigned to what words But like the ST taggers, TBL is a machine learning technique (rules are automatically induced from the data) TBL is a supervised learning technique It assumes a pre tagged training corpus
TBL TBL has a set of tagging rules A corpus is first tagged using the broadest rule (i.e., the one that applies to the most cases ) Then a slightly more specific rule is chosen for changing some of the original tags Next an even narrower rule to change a smaller number of tags ( some of which might be previously changed tags )
TBL Sentence 1. Ram is expected to race tomorrow. Sentence 2. The race for outer space is high. The tagger labels every word with its most-likely tag ( most likely tags from a tagged corpus) From the Brown corpus, race is most likely to be a noun: P (NN race) = 0.98 and P (VB race ) = 0.02 Thus the word race (in sent 1 & 2) gets initially coded as NN After selecting the most-likely tag, Brill s tagger applies its transformation rules. Tagger learned a rule that applies exactly to this mistagging of race (in sent. 1) :
TBL Change NN to VB when the previous tag is TO This rule changes race / NN to race / VB in the sentence 1, since it is preceded by to/ TO. TBL Algorithm has three major stages: It first labels every word with its most-likely tag It than examines every possible transformation and selects the one that results in the most improved tagging Finally, it then re-tags the data according to this rule
TBL These three stages of TBL are repeated until some stopping criterion is reached, such as insufficient improvement over the previous pass. This is to note that stage two requires that TBL knows the correct tag of each word. That is, TBL is a supervised learning algorithm. The output of the TBL process is an ordered list of transformation. These then constitute a tagging procedure that can be applied to a new corpus. TBL needs to consider every possible transformation in order to pick the best one on each pass through the algorithm.
TBL Thus the TBL algorithm needs a way to limit the set of transformations. This is done by designing a small set of templates ( abstracted transformations). Example of a set of templates: The preceding (following) word is tagged z. The word two before (after) is tagged z. One of the two preceding (following) words is tagged z. One of the three preceding (following) words is tagged z. The preceding word is tagged z and the following word is tagged w. The preceding (following) word is tagged Z and the word two before (after) is tagged w. <The variables a, b, z and w range over POS> <Each begins with Change tag a to tag b when:.
Unknown Words The likelihood of an unknown word = P(w i t i ) = P(unknown-word t i ) * P(capital t i ) * References P(endings/hyph t i ) Eric Brill, A Simple Rule based Part of Speech Tagger, Proceedings of the Third Annual Conference on Applied Natural Language Processing, ACL. James Allen, Natural Language Understanding, 1995, Benjamin Cummings. Stephen J. DeRose, Grammatical Category Disambiguation by Statistical Optimization, Computational Linguistics, 14.1: 31-39, 1988.
References Contd. Daniel Jurafsky & James H. Martin, Speech and Language Processing, 3 rd Ed., Perason. R. Weischedel, et al., Coping with ambiguity and unknown words through probabilistic models, Computational Linguistics, 19(2), 359-382, 1993. Presented by Goutam Kumar Saha ( Scientist F, Sr.Mem. IEEE) can be reached via <goutam.k.saha@cdackolkata.com> or, via <gksaha@rediffmail.com>