Pre-processing and annotation. Tokenization. Sentence Boundary Detection

Similar documents
Context Free Grammars. Many slides from Michael Collins

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Grammars & Parsing, Part 1:

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

CS 598 Natural Language Processing

Parsing of part-of-speech tagged Assamese Texts

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Emmaus Lutheran School English Language Arts Curriculum

National Literacy and Numeracy Framework for years 3/4

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Prediction of Maximal Projection for Semantic Role Labeling

Natural Language Processing. George Konidaris

BULATS A2 WORDLIST 2

Modeling full form lexica for Arabic

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Specifying a shallow grammatical for parsing purposes

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Linking Task: Identifying authors and book titles in verbose queries

AQUA: An Ontology-Driven Question Answering System

SEMAFOR: Frame Argument Resolution with Log-Linear Models

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

The College Board Redesigned SAT Grade 12

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Using dialogue context to improve parsing performance in dialogue systems

Achim Stein: Diachronic Corpora Aston Corpus Summer School 2011

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Ch VI- SENTENCE PATTERNS.

LTAG-spinal and the Treebank

Derivational and Inflectional Morphemes in Pak-Pak Language

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Corpus Linguistics (L615)

Copyright and moral rights for this thesis are retained by the author

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Ensemble Technique Utilization for Indonesian Dependency Parser

1. Introduction. 2. The OMBI database editor

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Thornhill Primary School - Grammar coverage Year 1-6

Developing Grammar in Context

Written by: YULI AMRIA (RRA1B210085) ABSTRACT. Key words: ability, possessive pronouns, and possessive adjectives INTRODUCTION

The Role of the Head in the Interpretation of English Deverbal Compounds

Accurate Unlexicalized Parsing for Modern Hebrew

Loughton School s curriculum evening. 28 th February 2017

EAGLE: an Error-Annotated Corpus of Beginning Learner German

Adjectives tell you more about a noun (for example: the red dress ).

Learning Computational Grammars

The stages of event extraction

The Smart/Empire TIPSTER IR System

Primary English Curriculum Framework

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Character Stream Parsing of Mixed-lingual Text

MYP Language A Course Outline Year 3

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

An Evaluation of POS Taggers for the CHILDES Corpus

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Interactive Corpus Annotation of Anaphor Using NLP Algorithms

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Words come in categories

Training and evaluation of POS taggers on the French MULTITAG corpus

THE VERB ARGUMENT BROWSER

Leveraging Sentiment to Compute Word Similarity

Controlled vocabulary

Myths, Legends, Fairytales and Novels (Writing a Letter)

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

The Discourse Anaphoric Properties of Connectives

A Bayesian Learning Approach to Concept-Based Document Classification

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

2017 national curriculum tests. Key stage 1. English grammar, punctuation and spelling test mark schemes. Paper 1: spelling and Paper 2: questions

Constraining X-Bar: Theta Theory

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

BASIC ENGLISH. Book GRAMMAR

Memory-based grammatical error correction

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Developing a TT-MCTAG for German with an RCG-based Parser

Development of the First LRs for Macedonian: Current Projects

A Graph Based Authorship Identification Approach

Advanced Grammar in Use

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

A Case Study: News Classification Based on Term Frequency

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

Transcription:

Inf1-DA 2010 2011 II: 83 / 119 Pre-processing and annotation Raw data from a linguistic source can t be exploited directly. We first have to perform: pre-processing: identify the basic units in the corpus: tokenization; sentence boundary detection; annotation: add task-specific information: parts of speech; syntactic structure; dialogue structure, prosody, etc. Inf1-DA 2010 2011 II: 84 / 119 Tokenization Tokenization: divide the raw textual data into tokens (words, numbers, punctuation marks). Word: a continuous string of alphanumeric characters delineated by whitespace (space, tab, newline). Example: potentially difficult cases: amazon.com, Micro$oft John s, isn t, rock n roll child-as-required-yuppie-possession (As in: The idea of a child-as-required-yuppie-possession must be motivating them. ) cul de sac Inf1-DA 2010 2011 II: 85 / 119 Sentence Boundary Detection Sentence boundary detection: identify the start and end of sentences. Sentence: string of words ending in a full stop, question mark or exclamation mark. This is correct 90% of the time. Example: potentially difficult cases: Dr. Foster went to Gloucester. He said rubbish!. He lost cash on lastminute.com. The detection of word and sentence boundaries is particularly difficult for spoken data.

Inf1-DA 2010 2011 II: 86 / 119 Corpus Annotation Annotation: adds information that is not explicit in the data itself, increases its usefulness (often application-specific). Annotation scheme: basis for annotation, consists of a tag set and annotation guidelines. Tag set: is an inventory of labels for markup. Annotation guidelines: tell annotators (domain experts) how tag set is to be applied; ensure consistency across different annotators. Inf1-DA 2010 2011 II: 87 / 119 Part-of-speech (POS) annotation Part-of-speech (POS) tagging is the most basic kind of linguistic annotation. Each linguistic token is assigned a code indicating its part of speech, i.e., basic grammatical status. Examples of POS information: singular common noun; comparative adjective; past participle. POS tagging forms a basic first step in the disambiguation of homographs. E.g., it distinguishes between the verb boot and the noun boot. But it does not distiguish between boot meaning kick and boot as in boot a computer, both of which are transitive verbs. Inf1-DA 2010 2011 II: 88 / 119 Example POS tag sets CLAWS tag set (used for BNC): 62 tags; (Constituent Likelihood Automatic Word-tagging System) Brown tag set (used for Brown corpus): 87 tags: Penn tag set (used for the Penn Treebank): 45 tags. Category Examples CLAWS Brown Penn Adjective happy, bad AJ0 JJ JJ Adverb often, badly PNI CD CD Determiner this, each DT0 DT DT Noun aircraft, data NN0 NN NN Noun singular woman, book NN1 NN NN Noun plural women, books NN2 NN NN Noun proper singular London, Michael 0 N Noun proper plural Australians, 0 S NS Methodists

Inf1-DA 2010 2011 II: 89 / 119 POS Tagging Idea: Automate POS tagging: look up the POS of a word in a dictionary. Problem: POS ambiguity: words can have several possible POS s; e.g.: Time flies like an arrow. (1) time: singular noun or a verb; flies: plural noun or a verb; like: singular noun, verb, preposition. Combinatorial explosion: (1) can be assigned 2 2 3 = 12 different POS sequences. Need more information to resolve such ambiguities. It might seem that higher-level meaning (semantics) would be needed, but in fact great improvement is possible using the probabilities of different POS. Inf1-DA 2010 2011 II: 90 / 119 Probabilistic POS tagging Observation: words can have more than one POS, but one of them is more frequent than the others. Idea: assign each word its most frequent POS (get frequencies from manually annotated training data). Accuracy: around 90%. Improvement: use frequencies of POS sequences, and other context clues. Accuracy: 96 98%. Example output from a POS tagger (not XML format!): Our/PRP$ enemies/nns are/vbp innovative/jj and/cc resourceful/jj,/, and/cc so/rb are/vb we/prp./. They/PRP never/rb stop/vb thinking/vbg about/in new/jj ways/nns to/to harm/vb our/prp$ country/nn and/cc our/prp$ people/nn, and/cc neither/dt do/vb we/prp./. (George W. Bush) Inf1-DA 2010 2011 II: 91 / 119 Use of markup languages An important general application of markup languages, such as XML, is to separate data from metadata. In a corpus, this serves to keep different types of information apart; Data is just the raw data. In a corpus this is the text itself. Metadata is data about the data. In a corpus this is the various annotations. Nowadays, XML is the most widely used markup language for corpora. The example on the next slide is taken from the BNC XML Edition, which was released only in 2007. (The previous BNC World Edition was formatted in SGML.)

Inf1-DA 2010 2011 II: 92 / 119 Example from the BNC XML Edition <wtext type="fiction"> <div level="1"> <head> <s n="1"> <w c5="nn1" hw="chapter" pos="subst">chapter </w> <w c5="crd" hw="1" pos="adj">1</w> </s> </head> <p> <s n="2"> <c c5="puq"> </c> <w c5="cjc" hw="but" pos="conj">but</w> <c c5="pun">,</c> <c c5="puq"> </c> <w c5="vvd" hw="say" pos="verb">said </w> <w c5="0" hw="owen" pos="subst">owen</w> <c c5="pun">,</c> <c c5="puq"> </c> <w c5="avq" hw="where" pos="adv">where </w> <w c5="vbz" hw="be" pos="verb">is </w> <w c5="at0" hw="the" pos="art">the </w> <w c5="nn1" hw="body" pos="subst">body</w> <c c5="pun">?</c> <c c5="puq"> </c> </s> </p>... </div> </wtext> Inf1-DA 2010 2011 II: 93 / 119 Aspects of this example This example is the opening text of J10, a novel by Michael Pearce. Some aspects of the tagging: The wtext element stands for written text. The attribute type indicates the genre. The head element tags a portion of header text (in this case a chapter heading). The s element tags sentences. (N.B., a chapter heading counts as a sentence.) Sentences are numbered via the attribute n. The w element tags words. The attribute pos is a POS tag, with more detailed POS information given by the c5 attribute, which contains the CLAWS code. The attribute hw represents the root form of the word (e.g., the root form of said is say ). The c element tags punctuation. Inf1-DA 2010 2011 II: 94 / 119 Syntactic annotation (parsing) Syntactic annotation: information about the structure of sentences. Prerequisite for computing meaning. Linguists use phrase markers to indicates which parts of a sentence belong together: noun phrase (): noun and its adjectives, determiners, etc. verb phrase (VP): verb and its objects; prepositional phrase (PP): preposition and its ; sentence (S): VP and its subject. Phrase markers group hierarchically in a syntax tree. Syntactic annotation can be automated. Accuracy: around 90%.

Inf1-DA 2010 2011 II: 95 / 119 Example syntax tree Sentence from the Penn Treebank corpus: S VP PRP They VB saw PP DT NN IN the president of DT NN the company Inf1-DA 2010 2011 II: 96 / 119 The same syntax tree in XML: <s> <np><w pos="prp">they</w></np> <vp><w pos="vb">saw</w> <np> <np><w pos="dt">the</w> <w pos="nn">president</w></np> <pp><w pos="nn">of</w> <np><w pos="dt">the</w> <w pos="nn">company</w></np> </pp> </np> </vp> </s> Note the conventions used in the above document: phrase markers are represented as elements; whereas POS tags are given as attribute values. N.B. The tree on the previous slide is not the XML element tree generated by this document.