University of Toronto, Department of Computer Science CSC 485/2501F Computational Linguistics, Fall Assignment 1

Similar documents
Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

CS 598 Natural Language Processing

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Chapter 4: Valence & Agreement CSLI Publications

Ch VI- SENTENCE PATTERNS.

Parsing of part-of-speech tagged Assamese Texts

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Proof Theory for Syntacticians

Context Free Grammars. Many slides from Michael Collins

Writing a composition

Words come in categories

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

Emmaus Lutheran School English Language Arts Curriculum

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

SAMPLE. Chapter 1: Background. A. Basic Introduction. B. Why It s Important to Teach/Learn Grammar in the First Place

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

Developing Grammar in Context

Grammars & Parsing, Part 1:

Part I. Figuring out how English works

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

BULATS A2 WORDLIST 2

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

First Grade Curriculum Highlights: In alignment with the Common Core Standards

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

BASIC ENGLISH. Book GRAMMAR

Natural Language Processing. George Konidaris

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

LING 329 : MORPHOLOGY

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Today we examine the distribution of infinitival clauses, which can be

Universal Grammar 2. Universal Grammar 1. Forms and functions 1. Universal Grammar 3. Conceptual and surface structure of complex clauses

Adjectives tell you more about a noun (for example: the red dress ).

Constraining X-Bar: Theta Theory

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Developing a TT-MCTAG for German with an RCG-based Parser

5 th Grade Language Arts Curriculum Map

An Interactive Intelligent Language Tutor Over The Internet

Dear Teacher: Welcome to Reading Rods! Reading Rods offer many outstanding features! Read on to discover how to put Reading Rods to work today!

Comprehension Recognize plot features of fairy tales, folk tales, fables, and myths.

4 th Grade Reading Language Arts Pacing Guide

Derivational and Inflectional Morphemes in Pak-Pak Language

Advanced Grammar in Use

Pseudo-Passives as Adjectival Passives

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Character Stream Parsing of Mixed-lingual Text

Subject: Opening the American West. What are you teaching? Explorations of Lewis and Clark

Argument structure and theta roles

Written by: YULI AMRIA (RRA1B210085) ABSTRACT. Key words: ability, possessive pronouns, and possessive adjectives INTRODUCTION

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

The College Board Redesigned SAT Grade 12

Using a Native Language Reference Grammar as a Language Learning Tool

2017 national curriculum tests. Key stage 1. English grammar, punctuation and spelling test mark schemes. Paper 1: spelling and Paper 2: questions

Presentation skills. Bojan Jovanoski, project assistant. University Skopje Business Start-up Centre

Course Outline for Honors Spanish II Mrs. Sharon Koller

Thornhill Primary School - Grammar coverage Year 1-6

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

On the Notion Determiner

California Department of Education English Language Development Standards for Grade 8

Control and Boundedness

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

Sample Goals and Benchmarks

Theoretical Syntax Winter Answers to practice problems

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

The Structure of Multiple Complements to V

Course Syllabus Advanced-Intermediate Grammar ESOL 0352

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Myths, Legends, Fairytales and Novels (Writing a Letter)

GERM 3040 GERMAN GRAMMAR AND COMPOSITION SPRING 2017

Building an HPSG-based Indonesian Resource Grammar (INDRA)

Construction Grammar. University of Jena.

Compositional Semantics

Loughton School s curriculum evening. 28 th February 2017

Let's Learn English Lesson Plan

Oakland Unified School District English/ Language Arts Course Syllabus

Reading Project. Happy reading and have an excellent summer!

Procedia - Social and Behavioral Sciences 154 ( 2014 )

What the National Curriculum requires in reading at Y5 and Y6

STUDENT MOODLE ORIENTATION

Houghton Mifflin Online Assessment System Walkthrough Guide

AQUA: An Ontology-Driven Question Answering System

Using dialogue context to improve parsing performance in dialogue systems

LTAG-spinal and the Treebank

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

EAGLE: an Error-Annotated Corpus of Beginning Learner German

The building blocks of HPSG grammars. Head-Driven Phrase Structure Grammar (HPSG) HPSG grammars from a linguistic perspective

National Literacy and Numeracy Framework for years 3/4

HISTORY COURSE WORK GUIDE 1. LECTURES, TUTORIALS AND ASSESSMENT 2. GRADES/MARKS SCHEDULE

Transcription:

University of Toronto, Department of Computer Science CSC 485/2501F Computational Linguistics, Fall 2017 Assignment 1 Due date: 14:10, Friday 6 October 2017, in tutorial. Late assignments will not be accepted without a valid medical certificate or other documentation of an emergency. This assignment is worth either 25% (CSC 2501) or 33% (CSC 485) of your final grade. Fill out both sides of the assignment cover sheet, and staple together all answer sheets (in order) with the cover sheet (sparse side up) on the front. (Don t turn in a copy of this handout.) Please type your answers/reports in no less than 12pt font; if you wish to include diagrams and tree structures in your reports, they may be drawn with software or neatly by hand. Any clarifications to the problems will be posted on the course website announcements page. You are responsible for taking into account in your solutions any information that is posted there, or discussed in class, so you should check the page regularly between now and the due date. What you turn in must be your own work. You may not work with anyone else on any of the problems in this assignment. If you need assistance, contact the instructor or the TA for the assignment. 1

1. Grammars [10 marks] The grammar given below covers only declarative sentences, those that assert a proposition: People walk their dogs in parks. Two other kinds of sentences in English are imperative (commands): Walk your dogs. and interrogative (questions): Who walk their dogs in parks? What will people walk in parks? Where should people walk their dogs? Should people walk their dogs in parks? Modify the grammar by adding or changing rules so that it accounts for these kinds of sentences. Your grammar should be reasonably principled and produce good sentences without producing bad ones. For example, it should not produce sentences such as these: *What people walk in parks? *What should people walk their dogs in parks? *Where walk their dogs in parks? (where the * indicates ill-formedness). Furthermore, you should only extend the lexicon of this grammar with the words, who, what, where, should and will. Note: You do not need to account for verb endings due to person and tense (such as the difference between walk, walked and walks). We will see how to deal with these later. This grammar uses plural nouns so that you do not have to think about that. Nevertheless, some sentences will be nonsensical, although grammatical, e.g. Parks walk dogs in people. This is fine what we are after here is grammaticality. Note: This grammar is so simple that it cannot attach in parks to their dogs (i.e., people walking dogs that they keep in parks), although this is not how we would normally interpret this sentence in English anyway. This is also good enough for our purposes. Submit the source code of your grammar (in NLTK format) in a file called q1.txt. 2

Grammar: S NP VP NP N NP Det N NP Adj N PP P NP VP V VP V NP VP V NP PP Lexicon: Det the their your Adj old red happy N dogs parks statues people V saw walk ate P in to on under with 3

2. Playing with NLTK [6 marks] Try out the interactive recursive-descent and shift-reduce parser demos in NLTK (nltk. app.srparser(),nltk.app.rdparser()), and answer the following questions. a. With the shift-reduce parser and its default grammar, parse the sentence my dog saw a man in the park with a statue as many different ways as you can. How many trees are there? b. The default grammar with the shift-reduce parser includes the rule NP NP PP, but that of the recursive-descent parser does not. Replace the NP rule of the latter with this rule: NP NP PP Det N and parse the default sentence. What happens? Now try: NP Det N NP PP Is this order change an adequate solution in general? Submit your answers in a file called q2.txt. 4

3. A simple context-free grammar of English [30 marks] Your task is to develop two context-free grammars for the chart parser in NLTK, each covering an interesting and not-entirely-trivial subset of English, along with a lexicon and a set of test sentences. In the first step, you ll build a base, which you ll then use for separate grammars in each of the next two steps. After testing your grammars, you ll write a report on what they can and can t do. 3.1 Very simple sentences To get started, begin with very simple sentences consisting of a subject noun phrase (NP), an intransitive verb in simple past tense (like ate or arrived), plus some modifiers. To begin introducing recursive rules in your grammar, allow the NPs to be modified by prepositional phrases as well as adjectives. You should be able to parse sentences such as the following. (Note that punctuation and sentence-initial capitalization should not be used in the test sentences; but words that are names can always be capitalized.) Nadia left immediately the cat with the long soft fur slowly ate she arrived You should not, however, accept the following kinds of sentences. (Remember that means a sentence is ungrammatical.) Nadia with the long soft fur slowly ate (Can t attach a PP to a name) the cat with the tall her arrived (Pronoun can t take an adjective) These are obviously not the only sentences your parser and grammar should accept or reject. Part of the point of the assignment is for you to decide what grammar you need in order for your parser to correctly accept or reject a range of constructions of this kind. Note that this does not mean simply listing lots of different words in the lexicon; rather, you need to think about similar grammatical constructions. 3.2 The auxiliary system Agreement is one of the complex phenomena in natural language that makes writing good grammars difficult. This shows up in a particularly interesting way in the auxiliary system in English. In the next stage of development, extend your grammar to handle valid sequences of auxiliaries and verbs in English, such as those below. Nadia will leave Nadia has left 5

Nadia may have been leaving Nadia will left Nadia has could leave Nadia has had left See the lecture slides and Jurafsky & Martin 12.3.6 for additional possible combinations of auxiliaries. Your analysis of the English auxiliary system must include the existential passive, which means that you will need to include the past participles of some transitive verbs in your lexicon as well. Note: To keep things simple for the next part of the grammar in section 3.3, keep one rule S -> NP VP for sentences containing verbs in the simple past tense (without auxiliaries). 3.3 Subcategorization Next, increase the coverage of your base grammar to deal with verbs other than those that are intransitive. A very important aspect of language in which one constituent places strong constraints on other constituents is the phenomenon of subcategorization. Subcategorization refers to the specification that a particular word places on the possible complements (objects) that it can occur with (see lecture notes #4). Verbs take a wide variety of combinations of complements. Here are some example sentences your parser should be able to handle correctly; you should also think of other types of verbs (or get ideas from Jurafsky & Martin, 12.3.5) that you should be able to parse. Nadia fondled the eggplant the handsome poodle brought Ross to the autoclave Nadia brought a package for the cheese they told her to jump onto the elephant she believed that Ross was already on the hovercraft she really wanted help she really wanted to help cheese was always on the menu the eggplant reminded Nadia of Ross Nadia found Ross brought to him they told to jump onto the elephant There are many other complement possibilities (and impossibilities!). Note: To keep your grammar a manageable size, do not try to combine the rules for the auxiliary system from section 3.2 with the rules for verbs with different subcategorizations. That is, DO NOT add grammar rules that generate auxiliary+verb constructions except for intransitives (which have the simplest of subcategorization requirements). Complements should only be generated for verb phrases in the active voice, simple past tense. Note: You should use standard context-free rules with atomic categories, not rules with features, to account for subcategorization. We will look at features later on in the class. 6

3.4 Testing Your grammar should handle all the test cases given in this handout. They are available at: http://www.cs.toronto.edu/ gpenn/csc485/a1-test.txt, and a list of the words that they use is available at: http://www.cs.toronto.edu/ gpenn/csc485/a1-vocab.txt; note that some words can have more than one syntactic category and hence will have more than one entry in the lexicon. Also, you will have to add words to this set for your own testing. In particular, several verbs appear in this file without all of their inflected forms. Your lexicon should add the missing forms, as well as all of the inflected forms for the verbs that you add. In addition to the sentences given here, you will need to develop an additional set of your own test sentences to fully demonstrate that the grammar meets its specifications and to reveal over- and undergeneration that is, types of sentences that it shouldn t accept but does, and types of sentences that it should accept but doesn t. You will need to write testing code that uses NLTK to parse your test sentences with your grammar. A simple way to use NLTK s chart parser is demonstrated in the following: import nltk grammar = nltk.grammar.cfg.fromstring(""" S -> NP VP PP -> P NP NP -> DET N N NP PP VP -> V NP VP PP DET -> the N -> Nadia man eggplant V -> rewarded P -> with """) sentence = nltk.tokenize.word_tokenize(""" Nadia rewarded the man with the eggplant """) parser = nltk.parse.bottomupchartparser(grammar) for t in parser.parse_all(sentence): print(t) This produces two parses for the given sentence: (S (NP (N Nadia)) (VP (VP (V rewarded) (NP (DET the) (N man))) 7

(PP (P with) (NP (DET the) (N eggplant))))) (S (NP (N Nadia)) (VP (V rewarded) (NP (NP (DET the) (N man)) (PP (P with) (NP (DET the) (N eggplant)))))) Refer to the NLTK API documentation for details. Your test code will also need to read input and write output, as described in section 3.6.4 below. Note: Your code must run on the teach.cs.toronto.edu servers. The command python runs an old version of Python (2.7.13) but with the current version (3.2.4) of NLTK. Your code for this assignment must use Python 3.5 (typed python3.5), which comes with NLTK 3.2.1. 3.5 Limitations At this point, you ll be able to parse lots of interesting sentences. But there will be lots of sentences, even ones similar to those above, that you won t be able to accept or reject correctly, because your grammar will necessarily be limited. For example, you aren t asked to implement subject verb agreement. In your report, you should discuss the kinds of limitations that still exist in your final grammar, including shortcomings revealed by your tests. This section of your report is not intended to cover every aspect of English that you can t deal with correctly (that would be a very long report). Rather, you should focus your report on constructions that are very similar to those you have been asked to deal with. For example, aspects of NPs and VPs would be reasonable to mention. 3.6 Implementation details In order for the grader to be able to semi-automatically test your work, each file must have the exact name and format specified in the following subsections. Some overall specifications for your files: The first line of each file must be a comment with your name, login ID, and student ID. Each grammar rule, lexical entry, or sentence must appear on a separate line in its appropriate file. The symbol % at the beginning of a line in your grammar, lexicon, and sentence files should indicate a comment. You should organize and comment your grammar and lexicon, just as you would your code, to make it easily understandable. 8

3.6.1 Grammar Name of the file: Grammar An example: % Your name, login ID and student ID go here. S -> NP VP NP -> NPrp NP -> Det N % NLTK allows abbreviations like the following VP -> V V NP VP -> V NP PP PP -> P NP The start symbol for the grammar must be S. Please use reasonable names for the other grammar symbols (nonterminals and parts-of-speech), following the conventions below. You might need symbols that aren t listed here devise reasonable names for them based on your reading from the textbook or the terms we use in class. Nouns: NP, N, NPrp (proper noun), NPro (pronoun), NDem (demonstrative pronoun) Verbs: VP, V, Aux, Modal Adjectives: AdjP, Adj Prepositions: PP, P Other: Det (article or determiner), Dem (demonstrative determiner), Adv (adverb) It is particularly important that your three grammars be compatible with each other, in the sense that non-terminals refer to the same kind of word or phrase across all three grammars. This is because they will reside in this one file, and when we test your grammar, we will load all of this file at once. 3.6.2 Lexicon Name of the file: Lexicon An example: % Your name, login ID and student ID go here. Det -> a an N -> elephant rutabaga autopoiesis shot NPro -> i NPrp -> Nadia Marseilles Google V -> won smiled demanded shot Note: If a word has n possible parts-of-speech (lexical category designations), then it is listed n times in the lexicon, once for each part-of-speech, as with shot above. 9

3.6.3 Test sentences Name of the file: Positive An example: % Your name, login ID and student ID go here. Nadia won an elephant I could have demanded a rutabaga autopoiesis always reminded her of Marseilles Note: The sentences should not contain any punctuation marks. Note that proper names are capitalized, but the first word of a sentence is not (unless it is a proper name). In the same format, include files Negative, with negative examples that your grammar correctly excludes, Overgen, with ungrammatical sentences that your grammar incorrectly includes, and Undergen, with grammatical sentences that your grammar incorrectly excludes. Ungrammatical examples should not have asterisks the file names make it clear whether they are ungrammatical. 3.6.4 Output parse trees Your output parse trees should be in the pretty-print format already provided by NLTK. If a sentence is not accepted by the grammar, the file should contain the line No parses. All your output should be directed to a file name named ParseTrees. The file should have your name, ID, and student number as the first line. It should then contain each sentence and the parses, as shown below: Name of the output file: ParseTrees An example output file: % mygivenname myfamilyname, myloginid, 999999999 I saw Nadia on a boat with my elephant (S (NP I) (VP (VP (VP (V saw) (NP Nadia)) (PP (P on) (NP (Det a) (N boat)))) (PP (P with) (NP (Det my) (N elephant))))) (S (NP I) (VP (VP (V saw) (NP Nadia)) (PP (P on) (NP (NP (Det a) (N boat)) (PP (P with) (NP (Det my) (N elephant))))))) (S (NP I) 10

(VP (VP (V saw) (NP (NP Nadia) (PP (P on) (NP (Det a) (N boat))))) (PP (P with) (NP (Det my) (N elephant))))) (S (NP I) (VP (V saw) (NP (NP (NP Nadia) (PP (P on) (NP (Det a) (N boat)))) (PP (P with) (NP (Det my) (N elephant)))))) (S (NP I) (VP (V saw) (NP (NP Nadia) (PP (P on) (NP (NP (Det a) (N boat)) (PP (P with) (NP (Det my) (N elephant)))))))) 3.7 What to submit 3.7.1 On paper Staple together (no paperclips or folded corners, please) the following items in the following order: The project cover sheet (attached to the back of this handout). A written report describing the limitations of your grammar (see section 3.5 above) and your testing strategy in particular, cite specific instances of overgeneration and undergeneration. This report should be no more than one page. 3.7.2 Electronically In addition to your paper submission, you must submit your grammar and your output electronically. Please include: All the input files (Grammar, Lexicon, Positive, Negative, Overgen, Undergen) that you used to test and evaluate your parser. Your output file, ParseTrees Note: You do not need to submit the actual code that you use to run your tests. Submit all required files using the submit command on teach.cs: % submit -c <course> -a A1 <filename-1>...<filename-n> 11

where <course> is either csc485h or csc2501h, and <filename-1> to <filename-n> are the n files you are submitting. Make sure every file you turn in contains a comment at the top that gives your name, your login ID on teach.cs, and your student ID number. Grading scheme We will test your grammar on the examples in this handout as well as on some held-out test sentences (i.e., sentences that you haven t seen). We will test your grammars on your sentences using a computer. It is imperative that you follow the formatting requirements provided in this handout. Remarks will not be granted to work that did not adhere to these requirements. Grammar 1: simple sentences 2 marks Grammar 2: auxiliaries and modals 2 marks Grammar 3: subcategorization 2 marks Testing (including output parses) by us and by you: meeting specifications; tests of overgeneration and undergeneration 20 marks Your report: description of your grammars limitations and your testing strategy 4 marks Total 30 marks 12

CSC 2501 / 485, Fall 2017: Assignment 1 Family name: Given name: Staple to assignment this side up

CSC 2501 / 485, Fall 2017: Assignment 1 Family name: Given name: Student #: I declare that this assignment, both my paper and electronic submissions, is my own work, and is in accordance with the University of Toronto Code of Behaviour on Academic Matters and the Code of Student Conduct. Signature: Grade: Q1 / 10 Q2 / 6 Q4 / 30 TOTAL / 46