Introduction to NLP. The Penn Treebank

Similar documents
Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Context Free Grammars. Many slides from Michael Collins

The stages of event extraction

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Grammars & Parsing, Part 1:

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

cmp-lg/ Jan 1998

Prediction of Maximal Projection for Semantic Role Labeling

The Indiana Cooperative Remote Search Task (CReST) Corpus

LTAG-spinal and the Treebank

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Accurate Unlexicalized Parsing for Modern Hebrew

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Survey on parsing three dependency representations for English

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

Advanced Grammar in Use

Loughton School s curriculum evening. 28 th February 2017

CORPUS ANALYSIS CORPUS ANALYSIS QUANTITATIVE ANALYSIS

Words come in categories

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Improving Accuracy in Word Class Tagging through the Combination of Machine Learning Systems

Training and evaluation of POS taggers on the French MULTITAG corpus

BULATS A2 WORDLIST 2

Extracting Verb Expressions Implying Negative Opinions

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles

Using dialogue context to improve parsing performance in dialogue systems

NEURAL DIALOG STATE TRACKER FOR LARGE ONTOLOGIES BY ATTENTION MECHANISM. Youngsoo Jang*, Jiyeon Ham*, Byung-Jun Lee, Youngjae Chang, Kee-Eung Kim

Part of Speech Template

Writing a composition

The Role of the Head in the Interpretation of English Deverbal Compounds

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Specifying a shallow grammatical for parsing purposes

Name of Course: French 1 Middle School. Grade Level(s): 7 and 8 (half each) Unit 1

Sample Goals and Benchmarks

Ch VI- SENTENCE PATTERNS.

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

Indian Institute of Technology, Kanpur

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Emmaus Lutheran School English Language Arts Curriculum

Developing Grammar in Context

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Linking Task: Identifying authors and book titles in verbose queries

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Dear Teacher: Welcome to Reading Rods! Reading Rods offer many outstanding features! Read on to discover how to put Reading Rods to work today!

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

Presentation Exercise: Chapter 32

Outline. Dave Barry on TTS. History of TTS. Closer to a natural vocal tract: Riesz Von Kempelen:

BASIC ENGLISH. Book GRAMMAR

GERM 3040 GERMAN GRAMMAR AND COMPOSITION SPRING 2017

The Ups and Downs of Preposition Error Detection in ESL Writing

Adjectives tell you more about a noun (for example: the red dress ).

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Memory-based grammatical error correction

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

An Evaluation of POS Taggers for the CHILDES Corpus

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Tagging Urdu Sentences from English POS Taggers

Ensemble Technique Utilization for Indonesian Dependency Parser

Learning Computational Grammars

Course Outline for Honors Spanish II Mrs. Sharon Koller

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

What is NLP? CS 188: Artificial Intelligence Spring Why is Language Hard? The Big Open Problems. Information Extraction. Machine Translation

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Pronunciation: Student self-assessment: Based on the Standards, Topics and Key Concepts and Structures listed here, students should ask themselves...

What the National Curriculum requires in reading at Y5 and Y6

CS 598 Natural Language Processing

FOREWORD.. 5 THE PROPER RUSSIAN PRONUNCIATION. 8. УРОК (Unit) УРОК (Unit) УРОК (Unit) УРОК (Unit) 4 80.

The Smart/Empire TIPSTER IR System

Greeley-Evans School District 6 French 1, French 1A Curriculum Guide

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Vocabulary Usage and Intelligibility in Learner Language

Building a Semantic Role Labelling System for Vietnamese

Annotation Projection for Discourse Connectives

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Copyright 2017 DataWORKS Educational Research. All rights reserved.

Minimalism is the name of the predominant approach in generative linguistics today. It was first

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Semi-supervised Training for the Averaged Perceptron POS Tagger

The Discourse Anaphoric Properties of Connectives

Today we examine the distribution of infinitival clauses, which can be

Interactive Corpus Annotation of Anaphor Using NLP Algorithms

A Statistical Approach to the Semantics of Verb-Particles

Written by: YULI AMRIA (RRA1B210085) ABSTRACT. Key words: ability, possessive pronouns, and possessive adjectives INTRODUCTION

French II Map/Pacing Guide

Myths, Legends, Fairytales and Novels (Writing a Letter)

Transcription:

NLP

Introduction to NLP The Penn Treebank

Description Background From the early 90 s Developed at the University of Pennsylvania (Marcus, Santorini, and Marcinkiewicz 1993) Size 40,000 training sentences 2400 test sentences Genre Mostly Wall Street Journal news stories and some spoken conversations Importance Helped launch modern automatic parsing methods

External Links Treebank-3 http://catalog.ldc.upenn.edu/ldc99t42 Original version http://catalog.ldc.upenn.edu/ldc95t7 Tokenization guidelines http://www.cis.upenn.edu/~treebank/tokenization.html The American National Corpus http://www.americannationalcorpus.org/oanc/penn.html

Penn Treebank tagset (1/2) Tag Description Example CC coordinating conjunction and CD cardinal number 1, third DT determiner the EX existential there there is FW foreign word d oeuvre IN preposition/subordinating conjunction in, of, like JJ adjective green JJR adjective, comparative greener JJS adjective, superlative greenest LS list marker 1) MD modal could, will NN noun, singular or mass table NNS noun plural tables NNP proper noun, singular John NNPS proper noun, plural Vikings PDT predeterminer both the boys POS possessive ending friend's

Penn Treebank tagset (2/2) Tag Description Example PRP personal pronoun I, he, it PRP$ possessive pronoun my, his RB adverb however, usually, naturally, here, good RBR adverb, comparative better RBS adverb, superlative best RP particle give up TO to to go, to him UH interjection uhhuhhuhh VB verb, base form take VBD verb, past tense took VBG verb, gerund/present participle taking VBN verb, past participle taken VBP verb, sing. present, non-3d take VBZ verb, 3rd person sing. present takes WDT wh-determiner which WP wh-pronoun who, what WP$ possessive wh-pronoun whose WRB wh-abverb where, when

Example Sentence WSJ/12/WSJ_1273.MRG, sentence 11 Because the CD had an effective yield of 13.4 % when it was issued in 1984, and interest rates in general had declined sharply since then, part of the price Dr. Blumenfeld paid was a premium -- an additional amount on top of the CD 's base value plus accrued interest that represented the CD 's increased market value.

Parsed sentence BAR-PRP (IN Because) -SBJ (DT the) (NNP CD)) (VBD had) (DT an) (JJ effective) (NN yield)) (PP (IN of) (CD 13.4) (NN %)))) BAR-TMP (WHADVP-4 (WRB when)) -SBJ-1 (PRP it)) (VBD was) (VBN issued) (-NONE- *-1)) (PP-TMP (IN in) (CD 1984))) (ADVP-TMP (-NONE- *T*-4))))))))...

BAR-PRP (IN Because) -SBJ (DT the) (NNP CD)) (VBD had) (DT an) (JJ effective) (NN yield)) (PP (IN of) (CD 13.4) (NN %)))) BAR-TMP (WHADVP-4 (WRB when)) -SBJ-1 (PRP it)) (VBD was) (VBN issued) (-NONE- *-1)) (PP-TMP (IN in) (CD 1984))) (ADVP-TMP (-NONE- *T*-4)))))))) (,,) (CC and) -SBJ (NN interest) (NNS rates)) (PP (IN in) (ADJP (JJ general)))) (VBD had) (VBN declined) (ADVP-MNR (RB sharply)) (PP-TMP (IN since) (RB then)))))))) (,,) -SBJ (NN part)) (PP (IN of) (DT the) (NN price)) BAR (WHNP-3 (-NONE- 0)) -SBJ (NNP Dr.) (NNP Blumenfeld)) (VBD paid) (-NONE- *T*-3)))))))) (VBD was) -PRD (DT a) (NN premium)) (: --) (DT an) (JJ additional) (NN amount)) (PP-LOC (IN on) (NN top)) (PP (IN of) (DT the) (NNP CD) (POS 's)) (NN base) (NN value)))))) (CC plus) (VBN accrued) (NN interest))) BAR (WHNP-2 (WDT that)) -SBJ (-NONE- *T*-2)) (VBD represented) (DT the) (NNP CD) (POS 's)) (VBN increased) (NN market) (NN value))))))) (..))

BAR-PRP (IN Because) -SBJ (DT the) (NNP CD)) (VBD had) (DT an) (JJ effective) (NN yield)) (PP (IN of) (CD 13.4) (NN %)))) BAR-TMP (WHADVP-4 (WRB when)) -SBJ-1 (PRP it)) (VBD was) (VBN issued) (-NONE- *-1)) (PP-TMP (IN in) (CD 1984))) (ADVP-TMP (-NONE- *T*-4)))))))) (,,) (CC and) -SBJ (NN interest) (NNS rates)) (PP (IN in) (ADJP (JJ general)))) (VBD had) (VBN declined) (ADVP-MNR (RB sharply)) (PP-TMP (IN since) (RB then)))))))) (,,) -SBJ (NN part)) (PP (IN of) (DT the) (NN price)) BAR (WHNP-3 (-NONE- 0)) -SBJ (NNP Dr.) (NNP Blumenfeld)) (VBD paid) (-NONE- *T*-3)))))))) (VBD was) -PRD (DT a) (NN premium)) (: --) (DT an) (JJ additional) (NN amount)) (PP-LOC (IN on) (NN top)) (PP (IN of) (DT the) (NNP CD) (POS 's)) (NN base) (NN value)))))) (CC plus) (VBN accrued) (NN interest))) BAR (WHNP-2 (WDT that)) -SBJ (-NONE- *T*-2)) (VBD represented) (DT the) (NNP CD) (POS 's)) (VBN increased) (NN market) (NN value))))))) (..))

Complementizers e.g., that Gaps *NONE* SBAR Peculiarities SBAR à COMP S E.g., that *NONE* represented the CD market value

tgrep A < B A immediately dominates B A << B A dominates B A <- B B is the last child of A A <<, B B is a leftmost descendant of A A <<` B B is a rightmost descendant of A A. B A immediately precedes B A.. B A precedes B A $ B A and B are sisters A $. B A and B are sisters and A immediately precedes B A $.. B A and B are sisters and A precedes B

The Use Of Treebanks Disadvantages A lot more work to annotate 40K+ sentences than to write a grammar. Advantages Statistics about different constituents and phenomena Training systems Evaluating systems Multilingual extensions

Introduction to NLP Parsing evaluation

Evaluation Methodology (1/2) Classification tasks Document retrieval Part of speech tagging Parsing Data split Training Dev-test Test

Evaluation Methodology (2/2) Baselines Dumb baseline Intelligent baseline Human performance (ceiling) New method Evaluation methods Accuracy Precision and Recall Multiple references Interjudge agreement

Kappa κ = P( A) P( E) 1 P( E) Agreement vs. expected agreement P(A) is the level of agreement of the judges P(E) is the expected probability of agreement by chance When κ >.7 agreement is considered high Question Judge agreement on a binary classification task is 60%, is this high?

Answer κ = P( A) P( E) 1 P( E) Data P(A) =.6 P(E) =.5 Kappa k =.1/.5 =.2 not high

Parsing Evaluation Precision and recall get the proper constituents Labeled precision and recall also get the correct non-terminal labels F1 harmonic mean of precision and recall Crossing brackets (A (B C)) vs ((A B) C) PTB corpus training 02-21, development 22, test 23

Evaluation Example GOLD = (DT The) (JJ Japanese) (JJ industrial) (NNS companies)) (MD should) (VB know) (ADVP (JJR better)))) (..) CHAR = (DT The) (JJ Japanese) (JJ industrial) (NNS companies)) (MD should) (VB know)) ((ADVP (RBR better)))) (..)) Bracketing Recall = 80.00 Bracketing Precision = 66.67 Bracketing FMeasure = 72.73 Complete match = 0.00 No crossing = 100.00 Tagging accuracy = 87.50

NLP