Lecture Outline. Word-Classes and Part-of-Speech Tagging. Definition. An Example

Similar documents
2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

BULATS A2 WORDLIST 2

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Words come in categories

Context Free Grammars. Many slides from Michael Collins

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

What the National Curriculum requires in reading at Y5 and Y6

Grammars & Parsing, Part 1:

Development of the First LRs for Macedonian: Current Projects

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Parsing of part-of-speech tagged Assamese Texts

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

CS 598 Natural Language Processing

Specifying a shallow grammatical for parsing purposes

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Ch VI- SENTENCE PATTERNS.

Dear Teacher: Welcome to Reading Rods! Reading Rods offer many outstanding features! Read on to discover how to put Reading Rods to work today!

Developing Grammar in Context

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

BASIC ENGLISH. Book GRAMMAR

Loughton School s curriculum evening. 28 th February 2017

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

An Evaluation of POS Taggers for the CHILDES Corpus

Emmaus Lutheran School English Language Arts Curriculum

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Adjectives tell you more about a noun (for example: the red dress ).

Character Stream Parsing of Mixed-lingual Text

Sample Goals and Benchmarks

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Outline. Dave Barry on TTS. History of TTS. Closer to a natural vocal tract: Riesz Von Kempelen:

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

THE VERB ARGUMENT BROWSER

The Role of the Head in the Interpretation of English Deverbal Compounds

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

First Grade Curriculum Highlights: In alignment with the Common Core Standards

2017 national curriculum tests. Key stage 1. English grammar, punctuation and spelling test mark schemes. Paper 1: spelling and Paper 2: questions

Improving Accuracy in Word Class Tagging through the Combination of Machine Learning Systems

Copyright 2002 by the McGraw-Hill Companies, Inc.

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

Training and evaluation of POS taggers on the French MULTITAG corpus

Derivational and Inflectional Morphemes in Pak-Pak Language

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

Formulaic Language and Fluency: ESL Teaching Applications

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Chapter 4: Valence & Agreement CSLI Publications

A Graph Based Authorship Identification Approach

Advanced Grammar in Use

The Smart/Empire TIPSTER IR System

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Primary English Curriculum Framework

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Indian Institute of Technology, Kanpur

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

On the Notion Determiner

Linking Task: Identifying authors and book titles in verbose queries

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Ensemble Technique Utilization for Indonesian Dependency Parser

LTAG-spinal and the Treebank

5 Star Writing Persuasive Essay

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Prediction of Maximal Projection for Semantic Role Labeling

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Writing a composition

Word Stress and Intonation: Introduction

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

1. Introduction. 2. The OMBI database editor

4 th Grade Reading Language Arts Pacing Guide

Subject: Opening the American West. What are you teaching? Explorations of Lewis and Clark

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

SAMPLE. Chapter 1: Background. A. Basic Introduction. B. Why It s Important to Teach/Learn Grammar in the First Place

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Coast Academies Writing Framework Step 4. 1 of 7

ScienceDirect. Malayalam question answering system

Mercer County Schools

AQUA: An Ontology-Driven Question Answering System

The College Board Redesigned SAT Grade 12

Universiteit Leiden ICT in Business

UC Berkeley Berkeley Undergraduate Journal of Classics

EAGLE: an Error-Annotated Corpus of Beginning Learner German

Memory-based grammatical error correction

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Lemmatization of Multi-word Lexical Units: In which Entry?

Transcription:

1 2 Word-Classes and Part-of-Speech Tagging Christopher Brewster University of Sheffield Computer Science Department Natural Language Processing Group C.Brewster@dcs.shef.ac.uk Lecture Outline Definition and Example Motivation Word-classes A Basic Tagging System Transformation-Based Tagging Tagging Unknown Words Definition the process of assigning a part-of-speech or other lexical class marker to each word in a corpus D. Jurafsky and J.H. Martin, 2000, Speech and Language Processing WORDS the girl kissed the boy on the cheek TAGS N V P ART 3 An Example lemma tag The the +DET girl girl +NOUN kissed kiss +VPAST the the +DET boy boy +NOUN on on +PREP the the +DET cheek cheek +NOUN from http://www.xrce.xerox.com/research/mltt/toolhome.html 4 1

5 6 Motivation: the uses of Tagging Word Classes Speech synthesis pronunciation Speech recognition class-based N-grams Information retrieval stemming Word-sense disambiguation Corpus analysis of language & lexicography Basic words classes: Noun, Verb, Adjective, Adverb, Preposition,.. Open vs. Closed classes. Closed e.g determiners: a, an, the pronouns: she, he, I, others prepositions: on, under, over, near, by, at, from, to, with 7 8 Word Classes: Tag sets Word Classes: Tag set example Vary in number of tags: a dozen to over 200 Size of tag sets depends on language, objectives and purpose Simple morphology = more ambiguity = fewer tags Some tagging approaches (e.g. constraint grammar based) make fewer distinctions eg. conflating adverbs, particles and interjections CC CD DT EX FW coordin. conjunction cardinal number determiner existential there foreign word and, but, or one, two, three a, the there mea culpa IN JJ JJR NN NNS prepositi on adjective adj. compar. noun singular or mass noun, plural of, in, by yellow bigger llama llamas from the Penn treebank part-of-speech tag set. 2

The Problem Words often have more than one word class: this This is a nice day = PR This day is nice = ADJ You can go this far. = ADV 9 Word Class Ambiguity (in the Brown Corpus) Unambiguous (1 tag) 35, 340 Ambiguous (2-7 tags) 4,100 2 tags 3,760 3 tags 264 4 tags 61 5 tags 12 6 tags 2 7 tags 1 (still) from DeRose (1988) 10 A Basic System: the PARTS program PARTS A System for Assigning Word Classes to English Texts, L.L.Cherry Uses list of function words, and list of suffixes and auxiliaries as key sources of information many combination classes e.g. noun_adj words members of >2 classes initially assigned unk 11 input List of function words and irregular verbs with tags: able,adj will, aux or, conj outside, prep every, adj do, auxv but, conj up, prep own, adj be, be begun, ed over, prep ago, adj_adv and, conj bitten, ed until, prep_adv List of suffixes with most probable tag for words of that suffix. ic, adj ship, noun age, noun ment, noun ance, noun ant, noun_adj ize, verb ary, adj suffixes chosen by hand if most words with suffix have only 1 or 2 tags, this single or combined class assigned, exceptions added to exception list exception list has many obscure words A text 12 3

step 1 pre-processing 13 step 2 suffix analysis 14 1. tokenises words and sentences word = string of characters separated by blanks or punctuation sentence = string of words ending in.?! (other punctuation is treated as a comma 2. marks capitalised words not starting sentences as noun_adj 3. marks hyphenated words as noun_adj 4. lookup function words & irregular verbs in the list 1. applies to words NOT assigned tags in step 1 2. look up suffix list 3. unassigned words go on to step 3 15 16 step 3 word class assignment results and example 1. finds verb in the sentence (using auxiliary) 2. finds nouns 3. applies a set of rules of form: verb_adj & ~a => verb if the word has been assigned the class verb_adj and the verb has not been recognised in the sentence, assign verb to it 95% correct assignment 41.5% of errors arise from noun-adjective confusion Example: They act as messengers for the legislators. pronp unk prep_adv nv_pl prep_adv art nv_pl pron verb prep noun prep art noun 4

Other methods: Stochastic Tagging Not based on rules, but on probability of a certain tag occurring given. various possibilities. Necessitates a TRAINING CORPUS i.e. a hand tagged text in order to derive probabilities. Problem: no probabilities for words not in corpus Problem: Bad results if training corpus is very different from test corpus 17 Stochastic tagging Method: Choose most frequent tag in training text for each word. Result: 90% accuracy Reason: cf. figures on word class ambiguity where 90% of words have only one tag Therefore: this is a base line, and any other method must do significantly better cf. HMM tagging (lecture of Nick Webb) 18 Transformation-Based Learning Tagging (Brill Tagging) Combination of rule-based AND stochastic tagging methodologies Like rule-based because rules are used to specify tags in a certain environment Like stochastic approach because machine learning is used using a tagged corpus as input Input: a tagged corpus a dictionary (with the most frequent tags) 19 TBL: Rule Application Example rules: Change NN to VB when previous tag is TO For example: race has the following probabilities in the Brown corpus: P(NN race) =.98 P(VB race) =.02 is/vbz expected/vbn to/to race/nn tomorrow/nn becomes is/vbz expected/vbn to/to race/vb tomorrow/nn 20 5

TBL: Rule Learning 21 TBL: Rule Learning (2) 22 2 parts to a rule: Triggering environment Rewrite rule The range of Triggering environments or templates(from Manning & Schutze 1999:363): Schema t1-3 ti-2 ti-1 ti ti+1 ti+2 ti+3 1 * 2 * 3 * 4 * 5 * 6 * 7 * 8 * 9 * Templates are like under specified rules: Replace tag X with tag Y, provided tag Z or word Z appears in some position Rules are learned in ordered sequence whichever gives best net improvement at each iteration of the learning algorithm. Rules may interact i.e. Rule 1 may make a change which provides context for Rule 2 to fire. Rules are compact (a few hundred) and can be inspected by humans (vs. impossibility of inspecting HMM transition probabilities) TBL: the Algorithm Step 1: Label every word with most likely tag (from dictionary) Step 2: Check every possible transformation & select one which most improves tagging (with respect to hand tagged corpus) Step 3: Re-tag corpus applying the rules Repeat 2-3 until some stopping criterion is reached e.g. x % correct with respect to training corpus RESULT: a sequence of transformation rules 23 TBL: Problems Execution Speed: TBL tagger is slow compared to HMM approach Solution: compile the rules to a Finite State Transducer (FST) Learning Speed: Brill s implementation over a day (600k tokens) 24 6

Tagging Unknown Words New words added to (newspaper) language 20+ per month. Plus many proper names. Increases error rates by 1-2% Method 1: assume they are nouns Method 2: assume the unknown words have a probability distribution similar to hapax legomena Method 3: use capitalisation, suffixes, etc. This works very well for morphologically complex languages 25 Further Reading Introductory: Jurafsky, Daniel & James H. Martin, Speech and Language Processing, Prentice Hall: 2000 Chapter 8, pp285-322 Manning, Christopher & Hinrich Schutze, Foundations of Statistical Natural Language Processing, Chap 10, pp341-380 Texts: Brill, Eric Transformation-based error-driven learning and natural language processing: A case-study in part-of-speech tagging. Computational Linguistics 21:543-565 Cherry, L. PART: a system for assigning words classes to English text. AT &T memorandum. 1978 Church, K. A stochastic parts program and noun phrase parser for unrestricted text. Second Conference on Applied NLP, Austin, 1988 Garside, Roger, Geoffrey Sampson and Geoffrey Leach (eds) The Computational analysis of English: a corpus-based approach. London: 1987 Also check the papers referred to in the Introductory references. 26 7