Advanced NLP. Lecture 4 Morphology. Morphological Segmentation. Basic Task: segment an utterance into a sequence of

Similar documents
Linking Task: Identifying authors and book titles in verbose queries

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Basic concepts: words and morphemes. LING 481 Winter 2011

Derivational and Inflectional Morphemes in Pak-Pak Language

Constructing Parallel Corpus from Movie Subtitles

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Developing a TT-MCTAG for German with an RCG-based Parser

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Cross Language Information Retrieval

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Modeling full form lexica for Arabic

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Lecture 1: Machine Learning Basics

Finding Translations in Scanned Book Collections

Coast Academies Writing Framework Step 4. 1 of 7

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

arxiv: v1 [cs.cl] 2 Apr 2017

More Morphology. Problem Set #1 is up: it s due next Thursday (1/19) fieldwork component: Figure out how negation is expressed in your language.

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Information Retrieval

Universiteit Leiden ICT in Business

Language Independent Passage Retrieval for Question Answering

Loughton School s curriculum evening. 28 th February 2017

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

What the National Curriculum requires in reading at Y5 and Y6

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Knowledge-Free Induction of Inflectional Morphologies

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

LING 329 : MORPHOLOGY

Memory-based grammatical error correction

Parsing of part-of-speech tagged Assamese Texts

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

ROSETTA STONE PRODUCT OVERVIEW

The Acquisition of Person and Number Morphology Within the Verbal Domain in Early Greek

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Multilingual Sentiment and Subjectivity Analysis

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

The taming of the data:

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

ScienceDirect. Malayalam question answering system

Dictionary-based techniques for cross-language information retrieval q

Word Stress and Intonation: Introduction

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Training and evaluation of POS taggers on the French MULTITAG corpus

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Literature and the Language Arts Experiencing Literature

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

CORPUS ANALYSIS CORPUS ANALYSIS QUANTITATIVE ANALYSIS

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Detecting English-French Cognates Using Orthographic Edit Distance

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

A Bayesian Model of Stress Assignment in Reading

Using dialogue context to improve parsing performance in dialogue systems

Timeline. Recommendations

A Syllable Based Word Recognition Model for Korean Noun Extraction

Development of the First LRs for Macedonian: Current Projects

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Words come in categories

A Bayesian Learning Approach to Concept-Based Document Classification

Disambiguation of Thai Personal Name from Online News Articles

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

A Statistical Model for Word Discovery in Transcribed Speech

The Role of the Head in the Interpretation of English Deverbal Compounds

On document relevance and lexical cohesion between query terms

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Learning Methods in Multilingual Speech Recognition

Adjectives tell you more about a noun (for example: the red dress ).

To appear in The TESOL encyclopedia of ELT (Wiley-Blackwell) 1 RECASTING. Kazuya Saito. Birkbeck, University of London

Prediction of Maximal Projection for Semantic Role Labeling

Grammars & Parsing, Part 1:

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

Character Stream Parsing of Mixed-lingual Text

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Oakland Unified School District English/ Language Arts Course Syllabus

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Lecture 10: Reinforcement Learning

Phonological and Phonetic Representations: The Case of Neutralization

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

Emmaus Lutheran School English Language Arts Curriculum

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

AQUA: An Ontology-Driven Question Answering System

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

Transcription:

Advanced NLP Lecture 4 Morphology Morphological Segmentation Basic Task: segment an utterance into a sequence of morphemes (the smallest meaningful linguistic units) Example: unresolved un resolv ed Extensions: Identify role of each morpheme (stem vs. affix) Identify canonical form of the morpheme (e.g., the root of unresolved is resolve, the root of took is take ) 1

Related Problem: Word Segmentation Task: divide text into a sequence of words Word is a string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes but no other punctuation marks (Kucera and Francis) The problem is relative easy for English ``Wash. vs wash'' ``won't'', ``John's'' ``pro Arab'', ``the idea of a child as required yuppie possession Hard for other languages (Chinese, Arabic, ) Words are not separated by white spaces Morphological Segmentation: Cross Lingual Perspective The distinction between the notion of word and morpheme is vague across languages In English, in is a word while it is a prefix in Hebrew In English, passive is realized using an auxiliary ( have ), while it is part of the stem in Hebrew Languages vary greatly in how morphemes are combined to produce words 2

Morphological Structures Two classes of morphemes: Stems the main morphemeof of the word that carries its semantic meaning Affixes an auxiliary morpheme that carries additional semantic and grammatical functions Prefix: precedes the stem (English: unresolved ) Suffix: follows the stem (English: unresolved ) Infix: inside the stem (Tagalog: humingi ) Circumfix: combines prefix and suffix (German: gesagt ) Morphological Compounding Inflectional: grammatical transformations within the same grammatical category Example: computer + s = computers Derivational: production of words in a different class Example: computer + ation = computerization Compounding: combination of multiple word stems together Example: dog + house = doghouse Cliticization: combination of a stem with clitic Example: I + ve = I ve 3

Prefixing vs Suffixing in Inflectional Morphology Human Morphological Processing How human store morphological variants? Full words are stored as units Stem/affixes stored separately Experimental Methods: Reading Time: measure reading time for each word Findings: reading time depends on the size of morphological family Priming: measure change in recognition time when morphologically related words are repeated Findings: regularly inflected forms are not distinct in the lexicon from their stems Analysis of Speech Errors: analyze speech errors (slips of tongue) Findings: inflectional and derivational suffixes appear separately from their stems 4

How Children Learn Morphology? Saffran, Newport & Aslin (1996): Children estimate the probability of each syllable in the language conditioned on its predecessor Children segment utterances at low points of transitional probability Computational Approaches to Morphological Segmentation Harris (1954): the successor of letters within words will tend to be more constrained than the successors of letters at the ends of words Example: compare possible fillings for the two strings dog? vs zeb? Idea:1 1. compute suprisingness ofeachletter 2. place boundaries at local maxima of these values 5

Learning of Word Segmentation: Non probabilistic Approach Ando and Lee (2001) Mostly unsupervised statistical segmentationofjapanese: of Application to kanji Identifies word boundaries in Japanese Doesn t assume the presence of lexicon (aka knowledgelean) Uses simple N gram statistics to place boundaries Optimization criteria inspired by Harris Outperforms lexicon and grammar based morphological analyzers Word Segmentation 6

Example Algorithm 7

Algorithm (Cont.) Experimental Set Up Corpus: 150 megabytes of 1993 Nikkei newswire Manual annotations: 50 sequences for development (parameter tuning) and 50 sequences for test data Compare against two manually crafted word segmentors (Chasen and Juman) 8

Evaluation Measures Precision (P): Percentage of system identified words that arecorrect Recall (R): Percentage of words actually present in the input that were correctly identified by the system F Measure (F): PR F = 2 P + R Results 9

Learning of Morphology: Probabilistic Approach Creutz and Lagus (2002) Unsupervised Discovery of Morphemes Identifies morphemic boundaries in Finnish. Successfully applied for many other languages Doesn t assume the repository of morphemes is known apriori Objective: find a concise morpheme repository that yields ildconcise representation tti of data dt Formulated in Bayesian Framework Delivers state of the art performance for several languages Model Structure Notations: D a corpus of words w 1 w n (morphologically unsegmented) S segmentation over D Lex a lexicon which lists a set of allowed morphemes m along with their probabilities θ(m) Goal: Find lexicon and segmentation Lex*,S* = argmax Lex,S P(Lex,S D) (Note this is a MAP estimate) argmax Lex,S P(Lex,S D) = argmax Lex,S P (D Lex,S) P(Lex,S) = argmax Lex,S P(Lex,S) = argmax Lex,S P(Lex) P(S lex) We assume that P(D Lex,S) =1 if segmentation S is consistent with corpus D 10

The model: Estimating P(S Lex) D= w 1 w n, where w i = m i1 m ili θ (m) probability of morpheme m specified by Lex The likelihood of corpus D with segmentation S given Lex: i P (S Lex) = n l i= 1 j= 1 Θ( m ) ij The Model: Estimating P(Lex) Prior P(Lex) incorporates our belief about the form of the lexicon (its size, the length and letter composition of a morpheme, the frequency distribution of morphemes in text) The prior of our model encodes: lexicon size is distributed uniformly letters in morphemes are selected based on their frequency in text letters in morphemes are selected based on their frequency in text morpheme length follows Gamma distribution morpheme frequency follows Zipfian distribution 11

The Model: Estimating P(Lex) Assuming lexicon of length M: P( Lex) = M! P( M, N ) M P( l ) i i i= 1 j= 1 ) P( Θ N ) M! accounts for different orders in which morphemes in the lexicon could be generated P (M,N) probability that the number of morpheme types in Lex is M and the number of morpheme tokens is N Assume that P(M,N) is constant for all reasonable M and N l P( c ij i The Model: Estimating P(Lex) Assuming lexicon of length M: M l i P ( Lex ) = M! P ( M, N ) P ( li ) P ( cij ) P ( Θi N ) i= 1 j= 1 P(l) probability that morpheme m has length l Modeled using Gamma distribution with α and β as hyperparameters α 1 1/ β l e P( l) = α Γ(α )β The Gamma distribution ib ti peaks at (α 1), and controls skewdness of the distribution. If the most frequent morpheme length is 4, then we set α=5 We set β =1 12

The Model: Estimating P(Lex) Assuming lexicon of length M: P( Lex) = M! P( M, N ) M P( li ) Probability of a character c in a morpheme l i i= 1 j= 1 count c p ( c) = count of all characters P( c ) P( Θ ij i N ) Morpheme probability is computed using unigram LM 13

The Model: Estimating P(Lex) Assuming lexicon of length M: P( Lex) = M! P( M, N ) M P( li ) l i i= 1 j= 1 P( c ) P( Θ N ) Prior on the probability of morpheme occurrence (this distribution ensures Zipfian behaviour) P ( Θ N ) = ( Θ N ) + ij log 2 (1 h ) log 2 (1 ) ( 1 ) h Θ N h is a probability that a morph type will be expected to occur only once in the corpus i Search Start with a segmentation where each word corresponds to a single morpheme Consider all possible splits for the i thh word in the corpus: Select the split with the highest probability P(Lex,S D) across all possible splits or no split In the case of split, continue recursively to process the two fragments Compute MLE lexicon for given segmentation Repeat the previous step until convergence This is a greedy searchwith no theoretical guarantees In few lectures, we will study more effective search strategies 14

Results: Finnish Results: English 15

Projection: Stem Prediction David Yarowsky, Grace Ngai, Richard Wicentowski Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora, 2001 Task: find a root of the word given its inflected form defies > defy skipped > skip took > take Input: parallel text in two languages annotated with partof speech tags Tags discirminate between roots and inflections Lemmatizer that connects roots and inflections for one language Direct bridge French inflection/root alignment Inflection croyant and root croire are connected via believing (their English translation) (this approach is limited since typically translation preserves tenses) 16

Multi bridge French inflection root alignment Use English lemmatizer to compute a multi step transitive association: croyaient believed believe croire We can build similar chains for other translations of the word of interest croyaient thought think croire Multi bridge French inflection root alignment Notations: E lemi all English lemmas (belived, belive, believing) F inf foreign inflection (croyaient) l foreign root (croire) F root Example: 17

Results Our model: MProj Adding More Monolignual Parallel Data 18

Supervised: Stem Prediction Assume manually annotated data for stem prediction (e.g., 250 verbs and their inflections) We predict stems by considering probabilities of different transformations: Example: Summary Unsupervised algorithms for morphological analysis capitalize on the difference in recurrence patterns within and across morphemes. Probabilistic methods provide effective means for incorporating our prior beliefs about the structure of morphological dictionary. The performance of unsupervised methods varies greatly across languages. 19