POS Tagging & Disambiguation. Goutam Kumar Saha Additional Director CDAC Kolkata

Similar documents
2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Parsing of part-of-speech tagged Assamese Texts

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

BULATS A2 WORDLIST 2

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Words come in categories

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

An Evaluation of POS Taggers for the CHILDES Corpus

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Context Free Grammars. Many slides from Michael Collins

CS 598 Natural Language Processing

Linking Task: Identifying authors and book titles in verbose queries

Derivational and Inflectional Morphemes in Pak-Pak Language

Adjectives tell you more about a noun (for example: the red dress ).

The stages of event extraction

Outline. Dave Barry on TTS. History of TTS. Closer to a natural vocal tract: Riesz Von Kempelen:

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

AQUA: An Ontology-Driven Question Answering System

Grammars & Parsing, Part 1:

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Ch VI- SENTENCE PATTERNS.

Development of the First LRs for Macedonian: Current Projects

Probabilistic Latent Semantic Analysis

Training and evaluation of POS taggers on the French MULTITAG corpus

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Modeling function word errors in DNN-HMM based LVCSR systems

Indian Institute of Technology, Kanpur

A Syllable Based Word Recognition Model for Korean Noun Extraction

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Natural Language Processing. George Konidaris

Using dialogue context to improve parsing performance in dialogue systems

Modeling full form lexica for Arabic

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Loughton School s curriculum evening. 28 th February 2017

What the National Curriculum requires in reading at Y5 and Y6

Modeling function word errors in DNN-HMM based LVCSR systems

A Case Study: News Classification Based on Term Frequency

Memory-based grammatical error correction

Learning Methods in Multilingual Speech Recognition

Developing a TT-MCTAG for German with an RCG-based Parser

Distant Supervised Relation Extraction with Wikipedia and Freebase

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Writing a composition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

THE VERB ARGUMENT BROWSER

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Universiteit Leiden ICT in Business

Can Human Verb Associations help identify Salient Features for Semantic Verb Classification?

Specifying a shallow grammatical for parsing purposes

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Software Maintenance

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

Multilingual Sentiment and Subjectivity Analysis

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

A Bayesian Learning Approach to Concept-Based Document Classification

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles

Coast Academies Writing Framework Step 4. 1 of 7

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Improving Accuracy in Word Class Tagging through the Combination of Machine Learning Systems

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

LING 329 : MORPHOLOGY

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

A Graph Based Authorship Identification Approach

BYLINE [Heng Ji, Computer Science Department, New York University,

A Comparison of Two Text Representations for Sentiment Analysis

Developing Grammar in Context

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Large vocabulary off-line handwriting recognition: A survey

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Cross Language Information Retrieval

Constructing Parallel Corpus from Movie Subtitles

Speech Recognition at ICSI: Broadcast News and beyond

Switchboard Language Model Improvement with Conversational Data from Gigaword

A Computational Evaluation of Case-Assignment Algorithms

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

Applications of memory-based natural language processing

Lecture 1: Machine Learning Basics

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

On the Notion Determiner

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

Transcription:

POS Tagging & Disambiguation Goutam Kumar Saha Additional Director CDAC Kolkata

The Significance of the Part of Speech (POS) in Natural Language Processing (NLP) - POS gives a significant amount of information about the word and its neighbors. - POS can be used in stemming for informational retrieval on morphological affixes. (by G.K.Saha)

Morphological Analysis - Analyzing words into their Linguistic Components or Morphemes. Morphemes - The smallest meaningful units of language e.g. Cars Æ Car + Plural Babake Æ Baba + Ke ( in Bangla Language) (by G.K.Saha)

You shall know a word by the company it keeps ( Firth,1957 ) POS Tagging - Each word has a POS tag to describe its category. - POS Tag of a word can be one of major word groups or its subgroups. POS Tagger - Tries to POS tag the words. (by G.K.Saha)

Light the light light. Is the word light a verb or noun or adjective? -- Morphological analyzer cannot make decision on POS of the light word. -- A POS tagger can make that decision by looking the surrounding words.

Two broad super categories of POS - Closed Class types - They have relatively fixed membership e.g. Prepositions ( new prepositions are rarely coined. ) - Open Class types - They have no fixed membership e.g. Nouns and Verbs ( new verb fax or the borrowed noun lathi ) - Other two major open classes: Adjectives, Adverbs.

Other closed classes - Determiners (article) : a, an, the - Pronouns: she, who, I - Conjunctions : and, but, or - Auxiliary verbs: can, may, should - Numerals : one, two, three, first, third - Particles (used to form phrasal verb): up, off, on, down, in, out, at, by - Prepositions: on, under, over, near, by, at, from, to, with

- Languages have generally a relative small set of closed class words (CCW) - CCWs are used frequently and they act as function words - CCWs can be ambiguous in their POS tags. Function Words - Function words are grammatical words like it, of, and, or you. - Function words tend to be short and play an important role in grammar - They occur frequently.

POS Tagging -- is the process of assigning a part of speech label or other lexical class marker to each of a sequence of words reflecting their syntactic category. -- Words can belong to different syntactic categories in different contexts. e.g., (a) He reads books <plural noun> (b) He books <3 rd person singular verb> tickets.

POS Tagger Architecture A pipeline of 3 major components: (i) Tokeniser : is responsible for segmenting the input text into words and sentences. Advanced tokenisers ( also called preprocessors) attempt to recognise phrasal constructions, proper names etc, as single tokens. (ii) Morphological Classifier : is responsible for classifying string-tokens as word-tokens with sets of morpho-syntactic features. It returns a set of possible POS Tags (or POS Class) and related morpho-syntactic features. Morpho syntactic features: number, case, gender etc.

The morphological classifier returns a set of possible POS tags when more than one tag can be assigned (e.g., book). (iii) Morphological Disambiguator : chooses a single POS tag according to the context. Organising the Lexicon: 1. The word list lexicon where each word is declaratively stored together with its morphosyntactic features.

2. The morohological lexicon: The base forms of words (stems) are provided with the rules for the formation of their inflectional and derivational variants. POS guesser However, no lexicon contains all possible words. When the morphological classifier comes across a word that is not in the lexicon, then a POS guesser tries to guess the POS class for the unknown word. Disambiguator Word tokens together with their POS tags are sent to the morphological disambiguator. It chooses a single POS tag according to the context.

Automatic POS Tagging In terms of the degree of automation of the training and tagging process, we can have the following two broad approaches to automatic POS Tagging: 1. Supervised 2. Unsupervised Supervised Taggers typically rely on pre tagged corpora to serve as the basis for creating any tools to be used throughout the tagging process, for example: the tagger dictionary, the word / tag frequencies, the tag sequence probabilities and / or the rule set.

Unsupervised Tagger Unsupervised Taggers do not require a pre-tagged corpus but instead use computational methods to automatically induce word groupings (i.e. tag sets). Based on the automatic groupings, it calculates either the probabilistic information needed by stochastic taggers or to induce the context rules needed by the rule-based systems. Pros and Cons A fully automated approach to POS tagging is extremely portable.

The automatic POS taggers tend to perform best when they are both trained and tested on the same kind ( or genre ) of text. The unfortunate reality is that pre-tagged corpora are not readily available for many languages and genres which one might to tag. Full automation of the tagging process addresses the need to accurately tag previously untagged genres and languages in light of the fact that hand tagging of training data is a costly and time consuming process. Drawbacks: The word clusterings resulting from such unsupervised methods are very coarse. In other words, one loses the fine distinctions found in the carefully designed tag sets used in the supervised methods.

POS Taggers can be characterized as 1. Rule Based 2. Stochastic. Rule Based Taggers use hand written rules to distinguish tag ambiguity. constraints to eliminate tags that are inconsistent with the context.

Stochastic Taggers : Hidden Markov Model or HMM based choose a tag sequence for a whole sentence rather than for a single word choose the tag sequence that maximizes the product of word likelihood and tag sequence probability

Rule based POS Tagging Use dictionary to find all possible parts of speech for a word Use disambiguation rules ( e.g., det X n = X / adj ) Typically hundreds of constraints can be designed manually

Rule Based POS Tagging Typical rule based approaches use contextual information to assign tags to unknown or ambiguous words. These rules are often known as context frame rules. As an example, a context frame rule might say something like: If an ambiguous / unknown word X is preceded by a determiner and followed by a noun, then tag it as an adjective.

Rule Based POS Tagger (RBPT) In addition to contextual information, a RBPT might use morphological information to aid in the disambiguation process. For an example, V X (ends in an ing) = X / verb Going beyond the usage of Contextual and morphological information, we can also include rules pertaining to such factors as capitalization ( possibly identifying as a proper noun ) and punctuation.

Rule-Based POS Tagging Adverbial - That Given input: that If ( +1 ADJ / ADV ); Disambiguation Rules Rule ( +2 S-BND); /* sentence boundary */ ( NOT -1 VAAOC ); /* verbs allowing adjs as object complements */ Then eliminate non-adv tags /* that is an Adverbial Intensifier */ Else eliminate ADV tag /* that is an complementizer */ {e.g. It isn t that odd. I believe / think / consider that odd. }

STOCHASTIC TAGGING Stochastic tagger (ST) refers to an approach to the problem of POS Tagging that incorporates frequency or probability, i.e. statistics. The simplest ST disambiguates words solely on the probability that a word occurs with a particular tag. In other words, the tag encountered most frequently in the training set is the one assigned to an ambiguous instance of that word.

ST The problem : it may yield a valid tag for a given word or can yield inadmissible sequence of tags. An alternative to the word frequency approach is to calculate the probability of a given sequence of tags occurring. This is referred to as the N-gram Approach (NGA). NGA refers to the fact that the best tag for a given word is determined by the probability that it occurs with the N previous tags. Viterbi Algorithm implements an NGA.

Hidden Markov Model (HMM) A Stochastic Tagger that use both Assumptions Tag sequence probabilities Word frequency measurements Each hidden tag state produces a word in the sentence. Each word is: 1. uncorrelated with all the other words and their tags. 2. probabilistic depending on the N previous tags only.

Limitations Solutions HMM cannot be used in an automated tagging schema. It relies upon the calculation of statistics on output sequences (tag states). HMM cannot be trained automatically. The solution to the problem of being unable to automatically train HMMs is to employ the Baum- Welch Algorithm ( also known as the Forward Backward Algorithm). This algorithm uses word rather than tag information to iteratively construct a sequence to improve the probability of the training data.

Unknown Words How should the unknown words be dealt with? Certain of the rules in rule based taggers are equipped to address this issue. But what happens in the stochastic models? How can one calculate the probability that a given occurs with a given tag if that word is unknown tagger. Solutions word to the 1. to assign a set of default tags (typically the open classes: N, V, Adj, Adv.) to unknown words, and to disambiguate using the probabilities that those tags occur at the end of the n-gram in question.

2. The tagger calculates the probability that a suffix on an unknown word occurs with a particular tag. If an HMM is being used, the probability that a word containing that suffix occurs with a particular tag in the given sequence is calculated. Steps in STOCHASTIC TAGGING It is necessary to make all of the necessary measurements and calculations to determine the values for the n-gram based transitional frequency values.

In order to create a matrix of transitional probabilities, it is necessary to begin with a tagged corpus upon which to base the estimates of those probabilities. We base our estimates on the immediate context of a words and do not consider any context further than one word away ( bigram model ). The 1 st step in this process is to determine the probability of each category s occurrence. In order to determine the probability of a noun occuring in a given corpus, we divide the total number of nouns by the total number of words. The next step is to determine transitional probabilities for sequence of words (conditional prob.)

For an example, to determine the probability of a following a determiner : noun P (noun det) = P (det & noun) / P (det). (1) We read it as : the probability of a noun occurring given the occurrence of a determiner is equal to the probability of a determiner and a noun occurring together, divided by the probability of a determiner occurring. Prof. Allen (1995) uses the category frequencies instead of the category probability. P(Cat i = noun Cat i-1 = det) = Count(det at i-1 & noun at i ) / Count (det at position i-1). (2) This is the bigram transitional probabilities.

Flaws in the equation (2): The trouble is that words which occur with high frequency, such as nouns, get favoured too heavily during the disambiguation process. Thus it results in a decrease in the precision of the system. The problem is that the frequency of the category at i 1 was never taken into account. The solution is to slightly modify that equation to include the frequency of the context word: P ( Cat i = noun Cat i-1 = det) = Count( det at i-1 & noun at i ) / (Count (det at i-1) * Count(noun at i)). (3) The denominator is the product of the frequencies of the words in the bigram, rather than just the frequency of the context word.

The final step in the basic probabilistic disambiguation process is to use the transitional probabilities (Eqn. 3) to determine the optimal path through the search space ranging from one unambiguous tag to the next. In other words, we need to implement some kind of search algorithm which will allow the calculations just made to be of some use in the disambiguous process. In the algorithms we have used the products of the transitional probabilities at each node.the principle which allows this type of formula to be used is known as the Markov assumption. Markov assumption: takes for granted that the probability of a particular category occurring depends solely on the category immediately preceding it.

Markov Models: Algorithms which rely on the Markov assumption to determine the optimal path are known as Markov models. Hidden Markov Models: A Markov model is hidden when we cannot determine the state sequence it has passed through on the basis of the outputs we observed. Efficiency of a Markov Model: is best exploited when used in conjunction with some form of a best first search algorithm so as to avoid the polynomial time problem.

HMM Tagger Example Ram <NP> is <VBZ> expected <VBN> to <TO> race <VB> tomorrow <ADV> People <NNS> continue <VBP> to<to> inquire <VB> the<dt> reason<nn> for<in> the<dt> race<nn> for<in> outer<jj> space<nn> t i = argmax J P(t J t i-1 ) P(w i t J ) ; HMM eqn. where, P(t J t i-1 ) = a tag sequence probability and P(w i t J ) = a word likelihood P(VB TO) P(race VB) P(NN TO)P(race NN) {TO: to+vb (to sleep), to+nn (to school)}

P(NN TO) =.021 /* From the combined Brown P(VB TO) =.34 and Switchboard corpora */ P(race NN) =.00041 P(race VB) =.00003 P(VB TO)P(race VB) =.00001 P(NN TO)P(race NN) =.00000861 P(race VB) Æ If we are expecting a verb, how likely is it that this verb would be race.

Transformation Based Tagging (TBT) also called Brill Tagging an instance of the Transformation Based Learning (TBL) inspired by both the rule-based (RB) & stochastic tagger (ST) TBL is based on rules that specify what tags should be assigned to what words But like the ST taggers, TBL is a machine learning technique (rules are automatically induced from the data) TBL is a supervised learning technique It assumes a pre tagged training corpus

TBL TBL has a set of tagging rules A corpus is first tagged using the broadest rule (i.e., the one that applies to the most cases ) Then a slightly more specific rule is chosen for changing some of the original tags Next an even narrower rule to change a smaller number of tags ( some of which might be previously changed tags )

TBL Sentence 1. Ram is expected to race tomorrow. Sentence 2. The race for outer space is high. The tagger labels every word with its most-likely tag ( most likely tags from a tagged corpus) From the Brown corpus, race is most likely to be a noun: P (NN race) = 0.98 and P (VB race ) = 0.02 Thus the word race (in sent 1 & 2) gets initially coded as NN After selecting the most-likely tag, Brill s tagger applies its transformation rules. Tagger learned a rule that applies exactly to this mistagging of race (in sent. 1) :

TBL Change NN to VB when the previous tag is TO This rule changes race / NN to race / VB in the sentence 1, since it is preceded by to/ TO. TBL Algorithm has three major stages: It first labels every word with its most-likely tag It than examines every possible transformation and selects the one that results in the most improved tagging Finally, it then re-tags the data according to this rule

TBL These three stages of TBL are repeated until some stopping criterion is reached, such as insufficient improvement over the previous pass. This is to note that stage two requires that TBL knows the correct tag of each word. That is, TBL is a supervised learning algorithm. The output of the TBL process is an ordered list of transformation. These then constitute a tagging procedure that can be applied to a new corpus. TBL needs to consider every possible transformation in order to pick the best one on each pass through the algorithm.

TBL Thus the TBL algorithm needs a way to limit the set of transformations. This is done by designing a small set of templates ( abstracted transformations). Example of a set of templates: The preceding (following) word is tagged z. The word two before (after) is tagged z. One of the two preceding (following) words is tagged z. One of the three preceding (following) words is tagged z. The preceding word is tagged z and the following word is tagged w. The preceding (following) word is tagged Z and the word two before (after) is tagged w. <The variables a, b, z and w range over POS> <Each begins with Change tag a to tag b when:.

Unknown Words The likelihood of an unknown word = P(w i t i ) = P(unknown-word t i ) * P(capital t i ) * References P(endings/hyph t i ) Eric Brill, A Simple Rule based Part of Speech Tagger, Proceedings of the Third Annual Conference on Applied Natural Language Processing, ACL. James Allen, Natural Language Understanding, 1995, Benjamin Cummings. Stephen J. DeRose, Grammatical Category Disambiguation by Statistical Optimization, Computational Linguistics, 14.1: 31-39, 1988.

References Contd. Daniel Jurafsky & James H. Martin, Speech and Language Processing, 3 rd Ed., Perason. R. Weischedel, et al., Coping with ambiguity and unknown words through probabilistic models, Computational Linguistics, 19(2), 359-382, 1993. Presented by Goutam Kumar Saha ( Scientist F, Sr.Mem. IEEE) can be reached via <goutam.k.saha@cdackolkata.com> or, via <gksaha@rediffmail.com>