Natural Language Processing CS 6320 Lecture 5 Words and Transducers. Instructor: Sanda Harabagiu

Similar documents
Derivational and Inflectional Morphemes in Pak-Pak Language

CS 598 Natural Language Processing

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

BULATS A2 WORDLIST 2

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

Grammars & Parsing, Part 1:

Context Free Grammars. Many slides from Michael Collins

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Developing Grammar in Context

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Words come in categories

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Basic concepts: words and morphemes. LING 481 Winter 2011

LING 329 : MORPHOLOGY

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Ch VI- SENTENCE PATTERNS.

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Year 4 National Curriculum requirements

Parsing of part-of-speech tagged Assamese Texts

2017 national curriculum tests. Key stage 1. English grammar, punctuation and spelling test mark schemes. Paper 1: spelling and Paper 2: questions

What the National Curriculum requires in reading at Y5 and Y6

More Morphology. Problem Set #1 is up: it s due next Thursday (1/19) fieldwork component: Figure out how negation is expressed in your language.

Natural Language Processing. George Konidaris

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Phonological Processing for Urdu Text to Speech System

Coast Academies Writing Framework Step 4. 1 of 7

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Adjectives tell you more about a noun (for example: the red dress ).

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Constraining X-Bar: Theta Theory

Using a Native Language Reference Grammar as a Language Learning Tool

Compositional Semantics

Mercer County Schools

Morphotactics as Tier-Based Strictly Local Dependencies

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

Florida Reading Endorsement Alignment Matrix Competency 1

Common Core ENGLISH GRAMMAR & Mechanics. Worksheet Generator Standard Descriptions. Grade 2

1. Introduction. 2. The OMBI database editor

Phonological and Phonetic Representations: The Case of Neutralization

Language acquisition: acquiring some aspects of syntax.

Language properties and Grammar of Parallel and Series Parallel Languages

Dear Teacher: Welcome to Reading Rods! Reading Rods offer many outstanding features! Read on to discover how to put Reading Rods to work today!

Word Stress and Intonation: Introduction

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Minimalism is the name of the predominant approach in generative linguistics today. It was first

Constructing Parallel Corpus from Movie Subtitles

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Loughton School s curriculum evening. 28 th February 2017

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Correspondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy

The Acquisition of English Grammatical Morphemes: A Case of Iranian EFL Learners

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

AN ANALYSIS OF GRAMMTICAL ERRORS MADE BY THE SECOND YEAR STUDENTS OF SMAN 5 PADANG IN WRITING PAST EXPERIENCES

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

The Use of Inflectional Suffixes by Third Year English Undergraduates at the College of Education, University of Mosul Adday Mahmood Adday (1)

The Impact of Morphological Awareness on Iranian University Students Listening Comprehension Ability

Refining the Design of a Contracting Finite-State Dependency Parser

Emmaus Lutheran School English Language Arts Curriculum

Test Blueprint. Grade 3 Reading English Standards of Learning

Underlying Representations

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Construction Grammar. University of Jena.

Primary English Curriculum Framework

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

NAME: East Carolina University PSYC Developmental Psychology Dr. Eppler & Dr. Ironsmith

Controlled vocabulary

AF~-SUttA~ :tc.a~ v~ t~* Salah Alnajem. Abstract. Department of Arabic, College of Arts Kuwait University

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

A Pumpkin Grows. Written by Linda D. Bullock and illustrated by Debby Fisher

CORPUS ANALYSIS CORPUS ANALYSIS QUANTITATIVE ANALYSIS

UNIT PLANNING TEMPLATE

A Grammar for Battle Management Language

Part I. Figuring out how English works

5 Star Writing Persuasive Essay

Chapter 4: Valence & Agreement CSLI Publications

Proof Theory for Syntacticians

California Department of Education English Language Development Standards for Grade 8

UKLO Round Advanced solutions and marking schemes. 6 The long and short of English verbs [15 marks]

The Strong Minimalist Thesis and Bounded Optimality

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Underlying and Surface Grammatical Relations in Greek consider

Argument structure and theta roles

Sample Goals and Benchmarks

Erkki Mäkinen State change languages as homomorphic images of Szilard languages

Transcription:

Natural Language Processing CS 6320 Lecture 5 Words and Transducers Instructor: Sanda Harabagiu 1

Morphology and Finite-State Transducers 1/2 Morphology is the study of the way words are built from smaller units called morphemes A morpheme is a minimal meaning-bearing unit in language. Example: the word DOG has a single morpheme; the word CATS has two: (1) CATand (2) S There are two classes of morphemes: stems and affixes. Stems -- are the main morphemes of words. Affixes add additional meaning to the stems to modify their meanings and grammatical functions 2

Morphology and Finite-State Transducers 2/2 There are four forms of affixes: 1. Prefixes precede the stem 2. Suffixes follow the stem 3. Infixes inserted inside the stem 4. Circumfixes both precede and follow the stem. Examples: Prefixes - un, a Suffixes - plurals, ing Infixes - not common in English Circumfixes : unbelievably un + believe + able + ly 3

Kinds of Morphology The usage of prefixes and suffixes concatenated to the stem creates a concatenative morphology. When morphemes are combines in more complex ways, we have a non-concatenative morphology. Inflectional morphology - is the combination of a word stem with a grammatical morpheme usually resulting in a word of the same class Special case: Templatic morphology, also known as root-and-pattern morphology: Used in semitic languages, e.g. Arabic, Hebrew 4

Example of Templatic Morphology In Hebrew, a verb is constructed using two components: o A root (consisting usually of 3 consonants CCC) carrying the main meaning o A template which gives the ordering of consonants and vowels and specifies more semantic information about the resulting verb, e,g, the voice (active, passive) o Example: the tri-consonant root lmd (meaning: learn, study) can be combined with a template CaCaC for active voice to produce the word lamad = he studied can be combined with the template CiCeC for intensive to produce the word limed = he tought can be combined with a template CuCaC for active voice to produce the word lumad = he was taught 5

Producing words from morphemes By inflection Inflectional morphology - is the combination of a word stem with a grammatical morpheme usually resulting in a word of the same class By derivation Derivational Morphology - is the combination of a word stem with a grammatical morpheme usually resulting in a word of a different class. By compounding by combining multiple words together, e.g. doghouse By cliticization by combining a word stem with a clitic. A clitic is a morpheme that acts syntactically like a word but it is reduced in forms and it is attached to another word. Eg.g I ve. 6

Clitization A clitic is a unit whose status lies between that of an affix and a word! The phonological behavior of clitics is like affixes: they then to be short and unaccented The syntactic behavior is more like words: they often act as pronouns, articles, conjunctions or verbs. 7

Inflectional Morphology 1/2 Nouns: have an affix for plural and an affix for possessive Plural the suffix s and the alternative spelling es for regular nouns Singular Plural Singular Plural dog dogs ibis ibises farm farms waltz waltzes school schools box boxes car cars butterfly butterflies Irregular nouns: Singular mouse ox Plural mice oxen Possessive - for words not ending in s the affix is s (children/ children s ) and for words ending in s the affix is (llama/llama s; llamas/llamas ) (Euripides/Euripides comedies ) 8

Inflectional Morphology 2/2 Verbal inflection is more complex than nominal inflection There are three classes of verbs in English: Main verbs (eat, sleep, walk) Modal verbs (can will, should) Primary verbs (be, have, do) Regular/ Irregular Verbs 9

Spanish Verb System To love in Spanish. Some of the inflected forms of the verb amar in European Spanish 50 distinct verb forms for each regular verb. Example the verb amar = to love 10

Derivational Morphology Changes of word class Nominalization: the formation of new nouns from verbs or adjectives. Adjectives can also be derived from verbs. 11

Morphological Parsing In general parsing means taking an input and producing a structure for it. Morphological parsing takes as an input words and produces a structure that reveals its morphological features. 12

Building a Morphological Parser We need: 1. Lexicon - is the list of stems and affixes and basic information about them 2. Morphotactics the model of morpheme ordering that expains which classes of morphemes can follow other classes of morphemes inside a word. 3. Ortographic rules - or spelling rules model the changes that occur in a word when morphemes are combined. 13

Morpholgy and FSAs We d like to use the machinery provided by FSAs to capture these facts about morphology Accept strings that are in the language Reject strings that are not And do so in a way that doesn t require us to in effect list all the words in the language 9/11/2011 Speech and 14

Start Simple Regular singular nouns are ok Regular plural nouns have an -s on the end Irregulars are ok as is 9/11/2011 Speech and 15

Building a finite-state lexicon: simple rules A lexicon is a repository of words. Possibilities: List all words in the language Computational lexicons: list all stems and affixes of a language + representation of the morphotactics that tells us how they fit together. A example of morphotactics: the finite-state automaton (FSA) for English nominal inflection Nouns 16

Now Plug in the Words 9/11/2011 Speech and 17

FSA for English Verb Inflection 18

Models for derivational morphology More complex than inflectional morphology. FSA tend to be more complex. Simple case: morphotactics of English adjectives Examples from Antworth (1990) big, bigger, biggest happy, happier, happiest, happily unhappy, unhappier, unhappiest, unhappily clear, clearer, clearest, clearly, unclear, unclearly cool, cooler, coolest, coolly red, redder, reddest real, unreal, really While this will recognize most of the adjectives, it will also recognize ungrammatical forms like: unbig, redly, realest. 19

Derivational Rules If everything is an accept state how do things ever get rejected? 9/11/2011 20

Parsing/Generation vs. Recognition We can now run strings through these machines to recognize strings in the language But recognition is usually not quite what we need Often if we find some string in the language we might like to assign a structure to it (parsing) Or we might have some structure and we want to produce a surface form for it (production/generation) Example From cats to cat +N +PL 9/11/2011 Speech and 21

Finite-State Transducers A transducer maps between one representation and another. A finite-state transducer (FST) is a type of finite automaton which maps between two sets of symbols. An FST is a two-tape automaton which recognizes or generates pairs of strings. Thus we can label each arc in the finite-state machine with two symbols, one for each tape. 22

Finite State Transducers The simple story Add another tape Add extra symbols to the transitions On one tape we read cats, on the other we write cat +N +PL 9/11/2011 Speech and 23

Transitions c:c a:a t:t +N: ε +PL:s c:c means read a c on one tape and write a c on the other +N:ε means read a +N symbol on one tape and write nothing on the other +PL:s means read +PL and write an s 9/11/2011 Speech and Language Processing - Jurafsky and Martin 24

Typical Uses Typically, we ll read from one tape using the first symbol on the machine transitions (just as in a simple FSA). And we ll write to the second tape using the other symbols on the transitions. 9/11/2011 Speech and 25

Ambiguity Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state. Didn t matter which path was actually traversed In FSTs the path to an accept state does matter since different paths represent different parses and different outputs will result 9/11/2011 Speech and 26

FSTs and FSAs FSTs have a more general function than FSAs: An FSA defines a formal language by defining a set of strings An FST defines a relation between sets of strings Another view: an FST is a machine that reads one string and generates another one. 27

A four-fold way of thinking FST as recognizer: a transducer that takes a pair of strings as input and outputs accept if the string pair is in the string-pair language, and reject if it is not. FST as generator: a machine that outputs pairs of strings of the language. The output is a yes or a no, and a pair of output strings. FST as translator: a machine that reads a string and outputs another string FST as set relater: a machine that computes relations between sets. Parsing is done with the FST. A transducer maps one set of symbols into another. An FST uses a finite automata. Two level morphology : for each word is the correspondence between lexical level and surface level. 28

Formal definition of an FST Defined by 7 parameters: 1. Q: a finite state of N states q 0, q 1,, q n 2. Σ: a finite set corresponding to the input alphabet 3. : a finite set corresponding to the output alphabet 4. q o Q: the start state 5. F Q: the set of final states 6. δ(q,w) : the transition function (or transition matrix between states) 7. σ(q,w): the output function, giving the set of all possible output strings for each state and input. 29

Regular Relations Remember: FSAs are isomorphic to regular languages. FSTs are isomorphic to regular relations. Definition: Regular relations are sets of pairs of strings!!! (extension to regular languages, which are sets of strings) Operations: Inversion: the inversion of a Transducer T (T -1 ) switches the input with the output labels Composition: If T 1 is a transducer from I 1 to O 1 and T 2 is a transducer from O 1 to O 2 then T 1 ot 2 is a transducer from I 1 to O 2 30

FSTs for Morphological Parsing 9/11/2011 Speech and 31

Applications The kind of parsing we re talking about is normally called morphological analysis It can either be An important stand-alone component of many applications (spelling correction, information retrieval) Or simply a link in a chain of further linguistic analysis 9/11/2011 Speech and 32

Morphological Parsing with Finite-State Transducers In finite-state morphology, we represent a word as a correspondence between a lexical level (concatenations of morphemes) and a surface level (concatenation of letters which make up the spelling of the word). For finite-state morphology, it is convenient to view an FSt as having two tapes. The lexical tape is composed from characters from one alphabet Σ. The surface tape is composed from characters from another alphabet. The two alphabets can be combined in a new alphabet of complex symbols Σ. Each complex symbol is composed of an input-output pair: (i,o) with i Σ and o, thus Σ Σ. Note, Σ and may include the epsilon symbol ε 33

Feasible pairs The pairs of symbols from Σ are called feasible pairs. Meaning: each feasible pair symbol a:b from Σ expresses how the symbol a from one tape is mapped in the symbol b on the other tape. Example: a:ε means that a on the upper tape will correspond to nothing on the lower tape. Pairs a:a are called default pairs and we refer to them by the single letter a. An FST mophological parser can be built from a morphotactic FSA by adding an extr lexical tape and the appropriate morphological features. How?? 34

How?? Use nominal morphological features (+Sg and +Pl) to augment the FSA for nominal inflection: The symbol ^ indicates morpheme boundary The symbol # indicates word boundary 35

Expanding FSTs with lexicons Update the lexicon such that irregular plurals such as geese will parse into the correct stem: goose +N +Pl The lexicon can have also two levels Example: g:g o:e o:e s:s e:e or g o:e o:e s e The lexicon will be more complex: 36

Expanding a nominal inflection FST T num T lex 37

Intermediary Tapes 38

Transducers and Orthographic Rules Problem: English often requires spelling changes at morpheme boundaries Solution: introduce spelling rules (orthographic rules) How to handle context-specific spelling rules. cat + N + PL -> cats (this is OK) fox + N + PL -> foxs (this is not OK) 39

Example of lexical, intermediate and surface tapes. Between each tape there is a two-level transducer! FST between the lexical and intermediate levels The E-insertion rule between the intermediate Land the surface level 40

E-insertion rule How do we formalize it? 41

Chomsky and Halle s Notation a -> b / c d rewrite a as b when it occurs between c and d. ε -> e / {x, s, z} ^ s# Insert e on the surface tape when the lexical tape has a morpheme ending in x, s or z and the next morpheme is s. The E-insertion rule 42

Q o models having seen only default pairs, unrelated to the rule Q 1 models having seen z,s or x Q 2 models having seen the morpheme boundary after the z,s,x Q 3 models having just seen the E-insertion not an accepting state Q 5 is there to insure that e is always inserted when needed 43

44

Combining FST Lexicon and Rules 45

46

Some difficulties Parsing the lexical tape from the surface tape Generating the surface tape from the lexical tape. Parsing has to deal with ambiguity Disambiguation requires some external evidence: Example: I saw two foxes yesterday Fox is a noun! That trickster foxes me everytime! Fox is a verb! 47

Automaton Intersection Transducers in parallel can be combined by automaton intersection: Take the Cartesian product of the states and create a new set of output states 48

Composition 1. Create a set of new states that correspond to each pair of states from the original machines (New states are called (x,y), where x is a state from M1, and y is a state from M2) 2. Create a new FST transition table for the new machine according to the following intuition 9/11/2011 49

Composition There should be a transition between two states in the new machine if it s the case that the output for a transition from a state from M1, is the same as the input to a transition from M2 or 9/11/2011 Speech and 50

Composition δ 3 ((x a,y a ), i:o) = (x b,y b ) iff There exists c such that δ 1 (x a, i:c) = x b AND δ 2 (y a, c:o) = y b 9/11/2011 Speech and 51