Pre-processing and annotation

Similar documents
Context Free Grammars. Many slides from Michael Collins

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Grammars & Parsing, Part 1:

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

CS 598 Natural Language Processing

Parsing of part-of-speech tagged Assamese Texts

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Emmaus Lutheran School English Language Arts Curriculum

Prediction of Maximal Projection for Semantic Role Labeling

SEMAFOR: Frame Argument Resolution with Log-Linear Models

National Literacy and Numeracy Framework for years 3/4

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Natural Language Processing. George Konidaris

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

BULATS A2 WORDLIST 2

Modeling full form lexica for Arabic

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

LTAG-spinal and the Treebank

Specifying a shallow grammatical for parsing purposes

Linking Task: Identifying authors and book titles in verbose queries

AQUA: An Ontology-Driven Question Answering System

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

The College Board Redesigned SAT Grade 12

Using dialogue context to improve parsing performance in dialogue systems

Achim Stein: Diachronic Corpora Aston Corpus Summer School 2011

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Ch VI- SENTENCE PATTERNS.

Derivational and Inflectional Morphemes in Pak-Pak Language

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Copyright and moral rights for this thesis are retained by the author

Corpus Linguistics (L615)

The stages of event extraction

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Ensemble Technique Utilization for Indonesian Dependency Parser

1. Introduction. 2. The OMBI database editor

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Thornhill Primary School - Grammar coverage Year 1-6

Developing Grammar in Context

Accurate Unlexicalized Parsing for Modern Hebrew

Written by: YULI AMRIA (RRA1B210085) ABSTRACT. Key words: ability, possessive pronouns, and possessive adjectives INTRODUCTION

EAGLE: an Error-Annotated Corpus of Beginning Learner German

Loughton School s curriculum evening. 28 th February 2017

Learning Computational Grammars

Adjectives tell you more about a noun (for example: the red dress ).

The Smart/Empire TIPSTER IR System

Primary English Curriculum Framework

Character Stream Parsing of Mixed-lingual Text

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

MYP Language A Course Outline Year 3

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

An Evaluation of POS Taggers for the CHILDES Corpus

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Words come in categories

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Interactive Corpus Annotation of Anaphor Using NLP Algorithms

The Role of the Head in the Interpretation of English Deverbal Compounds

Leveraging Sentiment to Compute Word Similarity

Training and evaluation of POS taggers on the French MULTITAG corpus

THE VERB ARGUMENT BROWSER

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Controlled vocabulary

Myths, Legends, Fairytales and Novels (Writing a Letter)

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

The Discourse Anaphoric Properties of Connectives

A Bayesian Learning Approach to Concept-Based Document Classification

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

2017 national curriculum tests. Key stage 1. English grammar, punctuation and spelling test mark schemes. Paper 1: spelling and Paper 2: questions

Constraining X-Bar: Theta Theory

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

BASIC ENGLISH. Book GRAMMAR

Developing a TT-MCTAG for German with an RCG-based Parser

Memory-based grammatical error correction

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Development of the First LRs for Macedonian: Current Projects

A Graph Based Authorship Identification Approach

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Advanced Grammar in Use

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

A Case Study: News Classification Based on Term Frequency

Transcription:

Inf1-DA 2010 2011 II: 83 / 119 Pre-processing and annotation Raw data from a linguistic source can t be exploited directly. We first have to perform: pre-processing: identify the basic units in the corpus: tokenization; sentence boundary detection; annotation: add task-specific information: parts of speech; syntactic structure; dialogue structure, prosody, etc.

Inf1-DA 2010 2011 II: 84 / 119 Tokenization Tokenization: divide the raw textual data into tokens (words, numbers, punctuation marks). Word: a continuous string of alphanumeric characters delineated by whitespace (space, tab, newline). Example: potentially difficult cases: amazon.com, Micro$oft John s, isn t, rock n roll child-as-required-yuppie-possession (As in: The idea of a child-as-required-yuppie-possession must be motivating them. ) cul de sac

Inf1-DA 2010 2011 II: 85 / 119 Sentence Boundary Detection Sentence boundary detection: identify the start and end of sentences. Sentence: string of words ending in a full stop, question mark or exclamation mark. This is correct 90% of the time. Example: potentially difficult cases: Dr. Foster went to Gloucester. He said rubbish!. He lost cash on lastminute.com. The detection of word and sentence boundaries is particularly difficult for spoken data.

Inf1-DA 2010 2011 II: 86 / 119 Corpus Annotation Annotation: adds information that is not explicit in the data itself, increases its usefulness (often application-specific). Annotation scheme: basis for annotation, consists of a tag set and annotation guidelines. Tag set: is an inventory of labels for markup. Annotation guidelines: tell annotators (domain experts) how tag set is to be applied; ensure consistency across different annotators.

Inf1-DA 2010 2011 II: 87 / 119 Part-of-speech (POS) annotation Part-of-speech (POS) tagging is the most basic kind of linguistic annotation. Each linguistic token is assigned a code indicating its part of speech, i.e., basic grammatical status. Examples of POS information: singular common noun; comparative adjective; past participle. POS tagging forms a basic first step in the disambiguation of homographs. E.g., it distinguishes between the verb boot and the noun boot. But it does not distiguish between boot meaning kick and boot as in boot a computer, both of which are transitive verbs.

Inf1-DA 2010 2011 II: 88 / 119 Example POS tag sets CLAWS tag set (used for BNC): 62 tags; (Constituent Likelihood Automatic Word-tagging System) Brown tag set (used for Brown corpus): 87 tags: Penn tag set (used for the Penn Treebank): 45 tags. Category Examples CLAWS Brown Penn Adjective happy, bad AJ0 JJ JJ Adverb often, badly PNI CD CD Determiner this, each DT0 DT DT Noun aircraft, data NN0 NN NN Noun singular woman, book NN1 NN NN Noun plural women, books NN2 NN NN Noun proper singular London, Michael NP0 NP NNP Noun proper plural Australians, NP0 NPS NNPS Methodists

Inf1-DA 2010 2011 II: 89 / 119 POS Tagging Idea: Automate POS tagging: look up the POS of a word in a dictionary. Problem: POS ambiguity: words can have several possible POS s; e.g.: Time flies like an arrow. (1) time: singular noun or a verb; flies: plural noun or a verb; like: singular noun, verb, preposition. Combinatorial explosion: (1) can be assigned 2 2 3 = 12 different POS sequences. Need more information to resolve such ambiguities. It might seem that higher-level meaning (semantics) would be needed, but in fact great improvement is possible using the probabilities of different POS.

Inf1-DA 2010 2011 II: 90 / 119 Probabilistic POS tagging Observation: words can have more than one POS, but one of them is more frequent than the others. Idea: assign each word its most frequent POS (get frequencies from manually annotated training data). Accuracy: around 90%. Improvement: use frequencies of POS sequences, and other context clues. Accuracy: 96 98%. Example output from a POS tagger (not XML format!): Our/PRP$ enemies/nns are/vbp innovative/jj and/cc resourceful/jj,/, and/cc so/rb are/vb we/prp./. They/PRP never/rb stop/vb thinking/vbg about/in new/jj ways/nns to/to harm/vb our/prp$ country/nn and/cc our/prp$ people/nn, and/cc neither/dt do/vb we/prp./. (George W. Bush)

Inf1-DA 2010 2011 II: 91 / 119 Use of markup languages An important general application of markup languages, such as XML, is to separate data from metadata. In a corpus, this serves to keep different types of information apart; Data is just the raw data. In a corpus this is the text itself. Metadata is data about the data. In a corpus this is the various annotations. Nowadays, XML is the most widely used markup language for corpora. The example on the next slide is taken from the BNC XML Edition, which was released only in 2007. (The previous BNC World Edition was formatted in SGML.)

Inf1-DA 2010 2011 II: 92 / 119 Example from the BNC XML Edition <wtext type="fiction"> <div level="1"> <head> <s n="1"> <w c5="nn1" hw="chapter" pos="subst">chapter </w> <w c5="crd" hw="1" pos="adj">1</w> </s> </head> <p> <s n="2"> <c c5="puq"> </c> <w c5="cjc" hw="but" pos="conj">but</w> <c c5="pun">,</c> <c c5="puq"> </c> <w c5="vvd" hw="say" pos="verb">said </w> <w c5="np0" hw="owen" pos="subst">owen</w> <c c5="pun">,</c> <c c5="puq"> </c> <w c5="avq" hw="where" pos="adv">where </w> <w c5="vbz" hw="be" pos="verb">is </w> <w c5="at0" hw="the" pos="art">the </w> <w c5="nn1" hw="body" pos="subst">body</w> <c c5="pun">?</c> <c c5="puq"> </c> </s> </p>... </div> </wtext>

Inf1-DA 2010 2011 II: 93 / 119 Aspects of this example This example is the opening text of J10, a novel by Michael Pearce. Some aspects of the tagging: The wtext element stands for written text. The attribute type indicates the genre. The head element tags a portion of header text (in this case a chapter heading). The s element tags sentences. (N.B., a chapter heading counts as a sentence.) Sentences are numbered via the attribute n. The w element tags words. The attribute pos is a POS tag, with more detailed POS information given by the c5 attribute, which contains the CLAWS code. The attribute hw represents the root form of the word (e.g., the root form of said is say ). The c element tags punctuation.

Inf1-DA 2010 2011 II: 94 / 119 Syntactic annotation (parsing) Syntactic annotation: information about the structure of sentences. Prerequisite for computing meaning. Linguists use phrase markers to indicates which parts of a sentence belong together: noun phrase (NP): noun and its adjectives, determiners, etc. verb phrase (VP): verb and its objects; prepositional phrase (PP): preposition and its NP; sentence (S): VP and its subject. Phrase markers group hierarchically in a syntax tree. Syntactic annotation can be automated. Accuracy: around 90%.

Inf1-DA 2010 2011 II: 95 / 119 Example syntax tree Sentence from the Penn Treebank corpus: S NP VP PRP They VB saw NP NP PP DT NN IN NP the president of DT NN the company

Inf1-DA 2010 2011 II: 96 / 119 The same syntax tree in XML: <s> <np><w pos="prp">they</w></np> <vp><w pos="vb">saw</w> <np> <np><w pos="dt">the</w> <w pos="nn">president</w></np> <pp><w pos="nn">of</w> <np><w pos="dt">the</w> <w pos="nn">company</w></np> </pp> </np> </vp> </s> Note the conventions used in the above document: phrase markers are represented as elements; whereas POS tags are given as attribute values. N.B. The tree on the previous slide is not the XML element tree generated by this document.