UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES

Similar documents
Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Proof Theory for Syntacticians

Grammars & Parsing, Part 1:

Erkki Mäkinen State change languages as homomorphic images of Szilard languages

Parsing of part-of-speech tagged Assamese Texts

CS 598 Natural Language Processing

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Compositional Semantics

Language properties and Grammar of Parallel and Series Parallel Languages

Developing a TT-MCTAG for German with an RCG-based Parser

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Linking Task: Identifying authors and book titles in verbose queries

Natural Language Processing. George Konidaris

Some Principles of Automated Natural Language Information Extraction

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

A R "! I,,, !~ii ii! A ow ' r.-ii ' i ' JA' V5, 9. MiN, ;

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Using dialogue context to improve parsing performance in dialogue systems

Ensemble Technique Utilization for Indonesian Dependency Parser

"f TOPIC =T COMP COMP... OBJ

The Strong Minimalist Thesis and Bounded Optimality

cambridge occasional papers in linguistics Volume 8, Article 3: 41 55, 2015 ISSN

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

A Version Space Approach to Learning Context-free Grammars

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

arxiv: v1 [math.at] 10 Jan 2016

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

AQUA: An Ontology-Driven Question Answering System

Analysis of Probabilistic Parsing in NLP

Transfer Learning Action Models by Measuring the Similarity of Different Domains

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

A Computational Evaluation of Case-Assignment Algorithms

Radius STEM Readiness TM

Hyperedge Replacement and Nonprojective Dependency Structures

A General Class of Noncontext Free Grammars Generating Context Free Languages

Learning Computational Grammars

Context Free Grammars. Many slides from Michael Collins

Prediction of Maximal Projection for Semantic Role Labeling

The Interface between Phrasal and Functional Constraints

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Association Between Categorical Variables

Segmented Discourse Representation Theory. Dynamic Semantics with Discourse Structure

Accurate Unlexicalized Parsing for Modern Hebrew

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Beyond the Pipeline: Discrete Optimization in NLP

The taming of the data:

Pre-Processing MRSes

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

12- A whirlwind tour of statistics

The Discourse Anaphoric Properties of Connectives

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Character Stream Parsing of Mixed-lingual Text

Abstractions and the Brain

A Framework for Customizable Generation of Hypertext Presentations

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Refining the Design of a Contracting Finite-State Dependency Parser

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Part III: Semantics. Notes on Natural Language Processing. Chia-Ping Chen

The Role of the Head in the Interpretation of English Deverbal Compounds

Efficient Normal-Form Parsing for Combinatory Categorial Grammar

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Foundations of Knowledge Representation in Cyc

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Chapter 4: Valence & Agreement CSLI Publications

Language Model and Grammar Extraction Variation in Machine Translation

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Phonological Processing for Urdu Text to Speech System

Underlying and Surface Grammatical Relations in Greek consider

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

The stages of event extraction

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Adapting Stochastic Output for Rule-Based Semantics

Aspectual Classes of Verb Phrases

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Hans-Ulrich Block, Hans Haugeneder Siemens AG, MOnchen ZT ZTI INF W. Germany. (2) [S' [NP who][s does he try to find [NP e]]s IS' $=~

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

ARNE - A tool for Namend Entity Recognition from Arabic Text

What the National Curriculum requires in reading at Y5 and Y6

Enumeration of Context-Free Languages and Related Structures

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Probabilistic Latent Semantic Analysis

LING 329 : MORPHOLOGY

Universiteit Leiden ICT in Business

Parsing natural language

Emotional Variation in Speech-Based Natural Language Generation

Assignment 1: Predicting Amazon Review Ratings

Fluency YES. an important idea! F.009 Phrases. Objective The student will gain speed and accuracy in reading phrases.

Transcription:

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES Wednesday 12 th August 2015 09:30 to 11:30 INSTRUCTIONS TO CANDIDATES 1. Answer all five questions in Part A, and two out of three questions in Part B. Each question in Part A is worth 10% of the total exam mark; each question in Part B is worth 25%. 2. Use a single script book for all questions. 3. Calculators may be used in this exam. Convener: D. K. Arvind External Examiner: C. Johnson THIS EXAMINATION WILL BE MARKED ANONYMOUSLY

PART A Answer ALL questions in Part A 1. (a) Three stages in the language processing pipeline, for a statically-typed programming language such as Haskell, are lexing, parsing, and type-checking. Briefly describe what the input to and output from each stage is. Illustrate your answers with tiny but well-chosen examples. (b) Name one other stage in the language processing pipeline of a staticallytyped programming language such as Haskell. [9 marks] [1 mark] 2. Suppose M 1 and M 2 are deterministic finite automata (DFAs), over the alphabet Σ = {a, b}, recognising the languages L 1 and L 2 respectively. (a) Using M 1 and M 2, describe how to construct a DFA that recognises the language L 1 (Σ L 2 ) of all strings that are in L 1 but not in L 2. (b) Illustrate your construction from part (b) in the case that M 1 and M 2 are the two DFAs below. [4 marks] (c) Suppose L Σ is a language generated by a context-free grammar (CFG) with set of terminals Σ. Is it possible, in general, to find a CFG that generates the language Σ L? Briefly justify your answer. [4 marks] [2 marks] 3. Consider a pushdown automaton (PDA) with two control states Q = {q1, q2}, start state q1, input alphabet Σ = {a, b}, stack alphabet Γ = {, a} (where is the start symbol), and transition relation: q1 q2 a, : a q1 q1 a, a : ɛ q2 q2 a, a : a a q1 q1 ɛ, a : ɛ q2 b, a : a q2 The automaton accepts on empty stack. (In the above description, we use the general notation c, x : α q q to mean that when the automaton is in control state q Q and x Γ is popped from the top of the stack, the input symbol or empty string c Σ {ɛ} can be read to reach control state q Q with α Γ pushed onto the stack.) QUESTION CONTINUES ON NEXT PAGE Page 1 of 7

QUESTION CONTINUED FROM PREVIOUS PAGE (a) Describe in detail an execution of the above PDA that accepts the string aaaba (b) Give a concise mathematical definition of the language L recognised by the PDA above. [8 marks] [2 marks] 4. (a) State Zipf s law for the frequency of word occurrences within a typical corpus of natural language text. (b) Suppose that in a corpus that conforms closely to Zipf s law (even for the most common word types), the four most common word types account for 2500 of the tokens. Estimate how many tokens are covered, in total, by the 10 word types with frequency rank in the range 21 30 inclusive. An approximate calculation is acceptable. (c) The Viterbi tagging algorithm requires a matrix of transition probabilities and another matrix of emission probabilities. Explain briefly how such probabilities may be extracted from a corpus of tagged text. [2 marks] (d) Suppose that we have extracted probabilities from a given corpus as in part (c), and now run the Viterbi algorithm on the text of that same corpus. Would you expect a higher proportional accuracy rate (i.e. proportion of correct taggings) for occurrences of the 10 most frequent word types, or the 10 least frequent ones? Briefly justify your answer. [2 marks] 5. Consider the following context-free grammar with start symbol S: S NP VP NP NP NP VP V NP NP boat man watches fish V man watches fish Construct a CYK parse chart for the sentence boat man watches fish Beside each entry in the chart, write the number of possible ways of parsing the corresponding word sequence that yield the non-terminal in question. For example, if the segment man watches fish can be analysed as NP via three different parse trees, you should write NP 3 in the appropriate cell. If a cell contains entries for several different non-terminals, each of these should be annotated separately with the corresponding number of parse trees. You should take account of all possible analyses of segments, whether or not they contribute to a parse of the complete sentence. [10 marks] Page 2 of 7

PART B Answer TWO questions in Part B 6. The context-free grammar below has a single nonterminal S, and terminals: a b ( ) The productions are: S a S b ( S S ) (a) This grammar is not LL(1). Justify this assertion. (b) Write out an LL(1) grammar that generates the same language as the grammar above. (c) Write out the first set of every nonterminal in your LL(1) grammar. (d) Write out the follow set of every nonterminal in your LL(1) grammar. (e) Write out the parse table for your LL(1) grammar. (f) Describe the step-by-step execution of the LL(1) predictive parsing algorithm, when parsing the expression below using your parse table from (e) above. ( a a ) [1 mark] [5 marks] [6 marks] [7 marks] Page 3 of 7

7. In English morphology, a simple version of the Y-replacement rule says that the letter y is replaced by ie if it occurs at the end of a stem, is immediately preceded by any letter except a,e,i,o,u, and is immediately followed by the suffix s. (a) Design a finite state transducer that applies the Y-replacement rule to translate from an intermediate (i.e. segmented) word form to a surface form, e.g. horse s# horses# pony s# ponies# donkey s# donkeys# ass# ass# Your transducer should have input alphabet {a,...,z,,#}, where denotes a morpheme boundary and # denotes a word boundary, and output alphabet {a,...,z,#}. The replacement rule should be applied only when the y in question is immediately followed by s#. In contrast to the definition of finite state transducers given in lectures, it is permissible for your transducer to output a sequence of several characters on a single transition. You may use abbreviating conventions in your presentation of the transducer, but should include an explanation of these in your answer. (b) Now suppose that your transducer were applied in reverse in order to split surface forms into their constituent morphemes. For each of the three words in the sentence [12 marks] he# plies# trades# list all possible intermediate forms that the transducer may output. QUESTION CONTINUES ON NEXT PAGE [4 marks] Page 4 of 7

QUESTION CONTINUED FROM PREVIOUS PAGE (c) Suppose that our lexicon contains the following entries for the POS categories PRN, N, V: PRN N V he plie, trade hear, ply, trade (The example is somewhat artificial: plie enters the lexicon as the noun plié (denoting a position in ballet) with the accent stripped off.) Suppose now that the outputs from part (b) are passed through a further (nondeterministic) transducer that replaces each word stem by its possible POS categories, and furthermore applies the transformations N s NPL V s V3S Thus, for example, this second transducer will convert the intermediate form hear to V, and hear s to V3S; in these cases these are the unique possible outputs. For each of the three words he# plies# trades# list all possible POS tags that may result once both the above phases have been applied. (d) We may now use a version of the Viterbi algorithm to determine the most probable sequence of tags for the above phrase, given the set of possibilities established by part (c). We here use a simplified form of the algorithm that is concerned purely with tags, not with words, so that no emission probabilities are involved. Determine the most probable tag sequence for he plies trades that is consistent with your solution to (c), using the following transition probability matrix. Show your working, and include backtrace pointers in your Viterbi matrix. to PRN to N to NPL to V to V3S from start 0.4 0.2 0.3 0.1 0.0 from PRN 0.1 0.1 0.1 0.3 0.4 from N 0.1 0.2 0.1 0.1 0.5 from NPL 0.1 0.1 0.2 0.5 0.1 from V 0.2 0.2 0.5 0.1 0.0 from V3S 0.2 0.2 0.5 0.1 0.0 [6 marks] Page 5 of 7

8. This question concerns the semantics of a language for describing family relationships. Consider the following context-free grammar with start symbol S. S Name is NP NP Name NP NP s Rel Rel Rel1 Rel1 father daughter brother grandchild Name Alice Brian Calum Diana The reason for the two non-terminals Rel and Rel1 will emerge in part (d) below. This grammar allows us to generate sentences such as Alice is Brian s daughter. We shall here consider this to be saying only that Alice is a daughter of Brian s, with no implication that she is the only daughter. Likewise, the sentence Calum is Diana s brother s grandchild leaves open the possibility that Diana has several brothers, each of whom has several grandchildren. For the purpose of our semantics, we shall use two base types: the usual type t of truth values, and a type p of people for the latter, we have the logical constant symbols Alice, Brian, Calum, Diana. Our logical language will also include unary relations Male, Female and binary relations Father, Mother on the type p: for example, we may write Male(x) to mean that x is male, and Mother(y, z) to mean that y is the mother of z. We also have the equality predicate = on the type p. We wish to give a semantics that associates a term of type t with a complete sentence (i.e., a phrase of category S). Phrases of category NP should receive interpretations of type < p, t >, since, for example, Brian s daughter denotes not a particular person, but rather a property of people (several people might fit the description). Phrases of category Rel and Rel1 should receive interpretations of type <p, <p, t>>, since these denote relationships between two people. (a) Provide semantic attachments for each of the rules of the above grammar so as to assign the intended meaning to each phrase in a compositional way. You may use the usual notations of first order logic and λ-calculus for this purpose. The following two examples are given to get you started (you need not copy these out): NP Name {λx. x = Name.Sem} Rel1 daughter {λxy. Female(x) & (Father(y, x) Mother(y, x))} (b) Draw the parse tree for the sentence Brian is Alice s father, allowing plenty of room for annotations. Then annotate each node with the interpretation given by your semantics from part (b). You should not perform any β- reductions at this stage. QUESTION CONTINUES ON NEXT PAGE [12 marks] [7 marks] Page 6 of 7

QUESTION CONTINUED FROM PREVIOUS PAGE (c) Show explicitly how the λ-expression associated with the S node in part (c) reduces to a normal form via a sequence of β-reduction steps. (d) Now suppose that we augment our grammar with the new rule Rel only Rel1 Given any binary relation R(x, y), write a logical formula that defines a new relation R(x, y), which holds exactly when x is the only element w such that R(w, y) holds. Use this idea to provide an appropriate semantic attachment for the above clause. Page 7 of 7