AUTOMATA AND GRAMMARS. (Linguistic machinery)

Similar documents
CS 598 Natural Language Processing

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

Grammars & Parsing, Part 1:

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Parsing of part-of-speech tagged Assamese Texts

An Introduction to the Minimalist Program

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Parsing natural language

Context Free Grammars. Many slides from Michael Collins

Language properties and Grammar of Parallel and Series Parallel Languages

LNGT0101 Introduction to Linguistics

Developing a TT-MCTAG for German with an RCG-based Parser

Chapter 4: Valence & Agreement CSLI Publications

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Natural Language Processing. George Konidaris

Construction Grammar. University of Jena.

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

GACE Computer Science Assessment Test at a Glance

Linking Task: Identifying authors and book titles in verbose queries

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

Proof Theory for Syntacticians

Som and Optimality Theory

A General Class of Noncontext Free Grammars Generating Context Free Languages

Detecting English-French Cognates Using Orthographic Edit Distance

Using a Native Language Reference Grammar as a Language Learning Tool

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

The Structure of Relative Clauses in Maay Maay By Elly Zimmer

Top US Tech Talent for the Top China Tech Company

CS Machine Learning

Refining the Design of a Contracting Finite-State Dependency Parser

Learning Methods in Multilingual Speech Recognition

Part I. Figuring out how English works

A R "! I,,, !~ii ii! A ow ' r.-ii ' i ' JA' V5, 9. MiN, ;

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Applications of memory-based natural language processing

Beyond the Pipeline: Discrete Optimization in NLP

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

WSU Five-Year Program Review Self-Study Cover Page

Hans-Ulrich Block, Hans Haugeneder Siemens AG, MOnchen ZT ZTI INF W. Germany. (2) [S' [NP who][s does he try to find [NP e]]s IS' $=~

Introduction, Organization Overview of NLP, Main Issues

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Korean ECM Constructions and Cyclic Linearization

Control and Boundedness

A Version Space Approach to Learning Context-free Grammars

AQUA: An Ontology-Driven Question Answering System

Pre-Processing MRSes

Visual CP Representation of Knowledge

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Adapting Stochastic Output for Rule-Based Semantics

WORK OF LEADERS GROUP REPORT

Word Segmentation of Off-line Handwritten Documents

Constraining X-Bar: Theta Theory

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

A relational approach to translation

LING 329 : MORPHOLOGY

"f TOPIC =T COMP COMP... OBJ

Using dialogue context to improve parsing performance in dialogue systems

1. Introduction. 2. The OMBI database editor

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

ARNE - A tool for Namend Entity Recognition from Arabic Text

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Some Principles of Automated Natural Language Information Extraction

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Ensemble Technique Utilization for Indonesian Dependency Parser

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Problems of the Arabic OCR: New Attitudes

Erkki Mäkinen State change languages as homomorphic images of Szilard languages

CAS LX 522 Syntax I. Long-distance wh-movement. Long distance wh-movement. Islands. Islands. Locality. NP Sea. NP Sea

We are strong in research and particularly noted in software engineering, information security and privacy, and humane gaming.

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Hindi Aspectual Verb Complexes

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

Words come in categories

The Strong Minimalist Thesis and Bounded Optimality

Course Content Concepts

Universal Grammar 2. Universal Grammar 1. Forms and functions 1. Universal Grammar 3. Conceptual and surface structure of complex clauses

GRAMMAR IN CONTEXT 2 PDF

A Usage-Based Approach to Recursion in Sentence Processing

Emotional Variation in Speech-Based Natural Language Generation

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

On the Notion Determiner

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Feature-Based Grammar

The Smart/Empire TIPSTER IR System

PRODUCT PLATFORM DESIGN: A GRAPH GRAMMAR APPROACH

THE INTERNATIONAL JOURNAL OF HUMANITIES & SOCIAL STUDIES

Derivational and Inflectional Morphemes in Pak-Pak Language

Transcription:

1 AUTOMATA AND GRAMMARS (Linguistic machinery)

2

3 From the inbox We are looking for enthusiastic and ambitious people to join our Semantics team in Munich! We won the title of the best FinTech startup of 2015. Here we are developing a smart personal assistant which will do all the paperwork for you and make all the payments magically simple. If you are looking for a place to try out your ideas (crazy ideas are welcome!), grow professionally and personally and not to be bound by a slow-moving hierarchical and conservative organization, we is the right place for you. Semantics Engineer Your playground You optimize and further develop our smart extraction system for information retrieval from financial documents You construct domain specific knowledge bases from unstructured data and natural language texts You develop methods to constantly improve our extraction system using user feedback You implement information extraction, knowledge base population and machine learning algorithms You help to make sense of semantic information from multiple documents Your Profile You have at least 1-2 years of work experience as a Semantic Specialist You are familiar with compiling ontologies and taxonomies to be built into NLP pipelines You have good knowledge of Scala & Java and solid skills in script languages, such as Python, Perl, etc. You develop test driven software and comply with data privacy policies You have hands-on experience with NLP toolkits (e.g., NLTK, Stanford NLP, GATE, Factorie, Elastic Search) You feel comfortable working with relational and schematic databases and triplestores You are experienced in the fields of information extraction, machine learning and automata theory You are comfortable writing and testing RegExp You are passionate about high quality code and pay attention to details

4 Finite-state automaton Finite set of states (initial, final) Finite set of symbols (labels, alphabet) Finite set of transitions (state x symbol) Fairly easily implemented Representable as network, table

5 Sample FSA a big dog yawned

6 Acceptors Given: fsa, input string Traverse automaton Report success/failure Limitations: Recognition only

{c*b{a b c}*c c} 7

8 Swahili morphology FSA Subj Tense Root 1 2 3 4 5 Object 1 2 3 4 5 1. 0 Subj 0 0 0 Subj: ni, u, a, tu, wa Tense: ta, na, me, li Object: ni, ku, m, tu, wa Root: penda, piga, sumbua, lipa 2 0 0 Tense 0 0 3 0 0 0 Obj 0 4 0 0 0 0 Root 5: 0 0 0 0 0

9 Determinism Deterministic: no choice at any transition(s) Nondeterministic: choice at some transition(s) >1 arc from given state for given symbol Empty arc: labelled or Any NFA: can construct equivalent DFA Regular languages: representable

10 Sample NDFA my furry cat sneezed a yawned dog the happy sleeps

11 Equivalence Regular expressions NFA s with -transitions NFA s DFA s Same class of languages (called regular sets) Transition tables

12 Dealing with non-determinism Backup Save search-state, search Stacks, queues Look-ahead Parallelism

13 Transducers Given: fsa, input string Traverse automaton Report (mapped) output string Applications: Sublanguage translation TTS transduction for some languages Morphological processing Limitation: non-recursive

14 Transducer WH+ BV NF 1 2 3 4 5 DetF WH+: where où BV: is est DetF: the la DetM: the le DetM 6 NF: exit sortie, shop boutique, toilet toilette NM: bar bar, policeman gendarme NM

15 Language is recursive John said that Mary is a genius. Fred thinks that I doubt Fred thinks that The man fell. The woman who saw the man fell. The man who the woman I met saw fell.

16 Recursive transition networks Not just categories, but name of another network on arc 4 types of arcs: category, empty, seek, exit Uses a pushdown stack Pros/cons: Modularity, more powerful than FSA s Linearity constraint, computationally costly

17 Sample RTN NP: AdjP: S: VP: NP VP Det AdjP N V NP big dog yawned CP: Compz S N: dog, cat, boy, girl, Det: a, the, my, your, seven, every, Adj: happy, tall, big, uninteresting, V: sneezed, saw, died, barks, Compz: that, because, whether,

18 Pushdown transducers RTN + paired transitions Output string via traversal

19 Problems No memory: straight input, output Language has gaps: What movie did you see? Where did you go? I saw the dog that barked. Who will Bill date? What is Fred so very upset about?

20 ATN s Add register(s) Add procedures to arcs Procedural in nature Pros/cons: Relaxes linearity constraint Nonlocal dependencies (gapping, particles) Contradicts prevalence of declarative formalisms

21 Building automata Specialized languages (lex, yacc) Generalized toolkits, utilities http://odur.let.rug.nl/~vannoord/fsa/ GraphViz Foma (follow the forward link to BitBucket for the code) Generalized applications (PC-Kimmo) Specific toolkits

22 Checking dates Years 1-9999: 3.7 million days Leap years, calendar changes Accept valid ones, reject invalid ones Xerox: 1346 states, 21,006 arcs

23 Applications Phonology: speech applications Morphology: word formation Syntax: phrase structure Discourse: dialogue structure Text mining: word spotting, threat detection Text tools: grammar checking, correction

24 ATT s FSM toolkit digraph finite_state_machine { rankdir=lr; size="8,5" orientation=land; node [shape = doublecircle]; LR_0 LR_3 LR_4 LR_8; node [shape = circle]; LR_0 -> LR_2 [ label = "SS(B)" ]; LR_0 -> LR_1 [ label = "SS(S)" ]; LR_1 -> LR_3 [ label = "S($end)" ]; LR_2 -> LR_6 [ label = "SS(b)" ]; LR_2 -> LR_5 [ label = "SS(a)" ]; LR_2 -> LR_4 [ label = "S(A)" ]; LR_5 -> LR_7 [ label = "S(b)" ]; LR_5 -> LR_5 [ label = "S(a)" ]; LR_6 -> LR_6 [ label = "S(b)" ]; LR_6 -> LR_5 [ label = "S(a)" ]; LR_7 -> LR_8 [ label = "S(b)" ]; LR_7 -> LR_5 [ label = "S(a)" ]; LR_8 -> LR_6 [ label = "S(b)" ]; LR_8 -> LR_5 [ label = "S(a)" ]; }

Graphing via FSM 25

26 What is a grammar? A type of knowledge representation Description of combinations of constituents Provides no procedural information (how) Provides implicit structural description (strings => structure) System of rules and categories that underlies some level of language

27 Formal language theory Alphabets String operations: concatenation, reversal Set operations: intersection, difference, union, complementation Closure