Course Roadmap. Informatics 2A: Lecture 2. Mary Cryan, Shay Cohen

Similar documents
Introduction, Organization Overview of NLP, Main Issues

Parsing of part-of-speech tagged Assamese Texts

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

CS 598 Natural Language Processing

Natural Language Processing. George Konidaris

Using dialogue context to improve parsing performance in dialogue systems

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Grammars & Parsing, Part 1:

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Context Free Grammars. Many slides from Michael Collins

The Strong Minimalist Thesis and Bounded Optimality

Proof Theory for Syntacticians

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

An Introduction to the Minimalist Program

Some Principles of Automated Natural Language Information Extraction

Compositional Semantics

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

AQUA: An Ontology-Driven Question Answering System

Applications of memory-based natural language processing

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Developing a TT-MCTAG for German with an RCG-based Parser

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

A R "! I,,, !~ii ii! A ow ' r.-ii ' i ' JA' V5, 9. MiN, ;

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Ensemble Technique Utilization for Indonesian Dependency Parser

Self Study Report Computer Science

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

English Language and Applied Linguistics. Module Descriptions 2017/18

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Linking Task: Identifying authors and book titles in verbose queries

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Beyond the Pipeline: Discrete Optimization in NLP

Getting Started with Deliberate Practice

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Control and Boundedness

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Adapting Stochastic Output for Rule-Based Semantics

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Chapter 4: Valence & Agreement CSLI Publications

Construction Grammar. University of Jena.

NAME: East Carolina University PSYC Developmental Psychology Dr. Eppler & Dr. Ironsmith

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Jacqueline C. Kowtko, Patti J. Price Speech Research Program, SRI International, Menlo Park, CA 94025

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Cross Language Information Retrieval

Prediction of Maximal Projection for Semantic Role Labeling

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

A Framework for Customizable Generation of Hypertext Presentations

The Smart/Empire TIPSTER IR System

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Segmented Discourse Representation Theory. Dynamic Semantics with Discourse Structure

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

An Interactive Intelligent Language Tutor Over The Internet

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Analysis of Probabilistic Parsing in NLP

LING 329 : MORPHOLOGY

Character Stream Parsing of Mixed-lingual Text

GACE Computer Science Assessment Test at a Glance

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Derivational and Inflectional Morphemes in Pak-Pak Language

ARNE - A tool for Namend Entity Recognition from Arabic Text

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

The Role of the Head in the Interpretation of English Deverbal Compounds

The taming of the data:

SEMAFOR: Frame Argument Resolution with Log-Linear Models

L1 and L2 acquisition. Holger Diessel

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

Ch VI- SENTENCE PATTERNS.

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

Organizing Comprehensive Literacy Assessment: How to Get Started

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Vocabulary Usage and Intelligibility in Learner Language

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Update on Soar-based language processing

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

MYCIN. The MYCIN Task

Foundations of Knowledge Representation in Cyc

Program in Linguistics. Academic Year Assessment Report

Interfacing Phonology with LFG

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

CS Machine Learning

Phonological Processing for Urdu Text to Speech System

Learning Computational Grammars

Florida Reading Endorsement Alignment Matrix Competency 1

Accurate Unlexicalized Parsing for Modern Hebrew

Transcription:

Course Roadmap Informatics 2A: Lecture 2 Mary Cryan, Shay Cohen School of Informatics University of Edinburgh mcryan@inf.ed.ac.uk scohen@inf.ed.ac.uk 19 September 2018 1 / 24

What Is Inf2a about? Formal and natural languages The language processing pipeline Comparison between FLs and NLs Course overview Levels of language complexity Formal language component Natural language component 2 / 24

Formal and natural languages This course is about methods for describing, specifying and processing languages of various kinds: Formal (computer) languages, e.g. Java, Haskell, HTML, SQL, Postscript,... Natural (human) languages, e.g. English, Greek, Japanese. Languages that represent the possible legal behaviours of some machine or system. E.g. for a vending machine, the following sequence might be legal: insert50p. pressbutton1. delivermarsbar Languages that represent the legal sequences of moves in a game, e.g. chess. 3 / 24

A common theoretical core We ll be focusing on certain theoretical concepts that can be applied to each of the above domains: regular languages finite state machines context-free languages, syntax trees types, compositional semantics The fact that the same underlying theory can be applied in such diverse contexts suggests that the theory is somehow fundamental, and worth learning about! Mostly, we ll be looking at various aspects of formal languages (mainly MC) and natural languages (mainly SC). As we ll see, there are some important similarities between formal and natural languages and some important differences. 4 / 24

Syntax trees: a central concept In both FLs and NLs, phrases have structure that can be represented via syntax trees. Com S Var Assg Expr NP VP x2 = - Var Det N V x1 The sun shone Determining the structure of a phrase is an important first step towards doing other things with it. Much of this course will be about describing and computing syntax trees for phrases of some given language. 5 / 24

The language processing pipeline (FL version) Think about the phases in which a Java program is processed: Raw source text (e.g. x2=-x1) Stream of tokens (e.g. x2, =, -, x1) Syntax tree (as on previous slide) Annotated syntax tree Java bytecode JVM state Program behaviour lexing parsing typechecking etc. compiling linking running 6 / 24

Language processing for programming languages In the case of programming languages, the pipeline typically works in a very pure way: each phase depends only on the output from the previous phase. In this course, we ll be concentrating mainly on the first half of this pipeline: lexing, parsing, typechecking. (Especially parsing). We ll be looking both at the theoretical concepts involved (e.g. what is a syntax tree?) And at algorithms for the various phases (e.g. how do we construct the syntax tree for a given program)? We won t say much about techniques for compilation etc. However, we ll briefly touch on how the intended runtime behaviour of programs (i.e. their semantics) may be specified. 7 / 24

Language processing for natural languages We ll look at fundamental parts of the NL processing pipeline. Our main focus is on how to get computers to perform these tasks, for applications such as machine translation (e.g. Google Translate) speech recognition and dialogue systems (e.g. Siri, Google Voice) question answering (e.g. IBM Watson) text summarization and simplification speech synthesis But there ll also be a couple of lectures on scientific studies of how we as humans perform them. 8 / 24

The language processing pipeline (NL version) A broadly similar pipeline may be considered e.g. for spoken English: Raw soundwaves phonetics Phones (e.g. [p h ] pot, [p] spot) phonology Phonemes (e.g. /p/, /b/) segmentation, tagging Words, morphemes parsing Parse tree agreement checking etc. Annotated parse tree semantics Logical form or meaning 9 / 24

Comparison between FLs and NLs There are close relationships between these two pipelines. However, there are also important differences: FLs can be pinned down by a precise definition. NLs are fluid, fuzzy at the edges, and constantly evolving. 10 / 24

Comparison between FLs and NLs (continued) There are close relationships between these two pipelines. However, there are also important differences: NLs are riddled with ambiguity at all levels. This is normally avoidable in FLs. 11 / 24

Comparison between FLs and NLs (continued) There are close relationships between these two pipelines. However, there are also important differences: For FLs the pipeline is typically pure. In NLs, information from later stages is sometimes used to resolve ambiguities at earlier stages, e.g. Time flies like an arrow. Fruit flies like a banana. 12 / 24

Kinds of ambiguity in NL Phonological ambiguity: e.g. an ice lolly vs. a nice lolly. Lexical ambiguity: e.g. fast has many senses (as noun, verb, adjective, adverb). Syntactic ambiguity: e.g. two possible syntax trees for complaints about referees multiplying. Semantic ambiguity: e.g. Please use all available doors when boarding the train. 13 / 24

14 / 24

More on the NL pipeline In the case of natural languages, one could in principle think of the pipeline... either as a model for how an artificial speech processing system might be structured, or as a proposed (crude) model for what naturally goes on in human minds. In this course, we mostly emphasize the former perspective. Also, in the NL setting, it s equally sensible to think of running the pipeline backwards: starting with a logical form or meaning and generating a speech utterance to express it. But we won t say much about this in this course. 15 / 24

Levels of language complexity Some languages / language features are more complex (harder to describe, harder to process) than others. In fact, we can classify languages on a scale of complexity (the Chomsky hierarchy): Regular languages: those whose phrases can be recognized by a finite state machine (cf. Informatics 1). Context-free languages. The basic structure of most programming languages, and many aspects of natural languages, can be described at this level. Context-sensitive languages. Some NLs involve features of this level of complexity. Recursively enumerable languages: all languages that can in principle be defined via mechanical rules. Roughly speaking, we ll start with regular languages and work our way up the hierarchy. Context-free languages get most attention. 16 / 24

The Chomsky Hierarchy (picture) Regular Context free Context sensitive Recursively enumerable 17 / 24

Formal Language component: overview Regular languages: Definition using finite state machines (as in Inf1A). Equivalence of deterministic FSMs, non-deterministic FSMs, regular expressions. Applications: pattern matching, lexing, morphology. The pumping lemma: proving a given language isn t regular. Context-free languages: Context-free grammars, syntax trees. The corresponding machines: pushdown automata. Parsing: constructing the syntax tree for a given phrase. A parsing algorithm for LL(1) languages, in detail. 18 / 24

Formal Language component: overview (continued) After a break to cover some NL material, we ll glance briefly at some concepts from further down the pipeline: e.g. typechecking and semantics for programming languages. Then we continue up the Chomsky hierarchy: Context-sensitive languages: Definition, examples. Relationship to linear bounded automata. Recursively enumerable languages: Turing machines; theoretical limits of what s computable in principle. Undecidable problems. 19 / 24

Natural language component: overview Some specific topics: Complexity of human languages: E.g. whereabouts do human languages sit in the Chomsky hierarchy? Parsing algorithms: Because NLs differ from FLs in various ways, it turns out that different kinds of parsing algorithms are suitable. Probabilistic versions of FL concepts: In NL, because of ambiguity, we re typically looking for the most likely way of analysing a phrase. For this purpose, probabilistic analogues of e.g. finite state machines or context-free grammars are useful. Use of text corpora: Rather than building in all the relevant knowledge of the language by hand, we sometimes get a NLP system to learn it for itself from some large sample of pre-existing text. 20 / 24

Natural language semantics Consider the sentence: Every student has access to a computer. The meaning of this can be expressed by a logical formula: x. (student(x) y. (computer(y) hasaccessto(x, y))) Or perhaps: y. (computer(y) x. (student(x) hasaccessto(x, y))) Problem: how can (either of) these formulae be mechanically generated from a syntax tree for the original sentence? This is what semantics is all about. 21 / 24

The Python programming language Invented by Guido van Rossum (pictured) Object-oriented programming language (like Java): has classes and objects. Dynamic typing (unlike Java). More flexibility but more chance of run-time errors. Clear and powerful syntax very succinct (unlike Java). Especially convenient for string processing. Typically driven interactively via a console session (like Haskell). Interfaces to many system calls, libraries, window systems, and other programming languages. 22 / 24

Natural language processing with Python NLTK: Natural Language Toolkit Developed by Steven Bird, Ewan Klein and Edward Loper; mainly addresses education and research; the book is online: http://www.nltk.org The NLTK provides support for many parts of the NL processing pipeline, e.g. Part-of-speech tagging Parsing Meaning extraction (semantics) Lab sessions will introduce you to both Python and NLTK. In Assignment 2, we ll show how one can fit these together to construct a (very simple) natural language dialogue system. 23 / 24

Summary What is Inf2a about? We will learn about formal and natural languages. We will discuss their similarities and differences. We will cover finite state machines, context-free grammars, syntax trees, parsing, pos-tagging, ambiguity. We will use Python for natural language processing. We will have lots of fun! Next lecture: Finite state machines (revision) Reading: Kozen chapter 1, 2; J&M[2nd Ed] chapter 1 24 / 24