ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Similar documents
Derivational and Inflectional Morphemes in Pak-Pak Language

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Linking Task: Identifying authors and book titles in verbose queries

LING 329 : MORPHOLOGY

Constructing Parallel Corpus from Movie Subtitles

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

What the National Curriculum requires in reading at Y5 and Y6

Coast Academies Writing Framework Step 4. 1 of 7

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

Disambiguation of Thai Personal Name from Online News Articles

Detecting English-French Cognates Using Orthographic Edit Distance

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Knowledge-Free Induction of Inflectional Morphologies

Memory-based grammatical error correction

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Florida Reading Endorsement Alignment Matrix Competency 1

On document relevance and lexical cohesion between query terms

Modeling full form lexica for Arabic

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

The Acquisition of Person and Number Morphology Within the Verbal Domain in Early Greek

South Carolina English Language Arts

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Test Blueprint. Grade 3 Reading English Standards of Learning

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Probabilistic Latent Semantic Analysis

arxiv: v1 [cs.cl] 2 Apr 2017

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Semantic Modeling in Morpheme-based Lexica for Greek

First Grade Curriculum Highlights: In alignment with the Common Core Standards

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Multilingual Sentiment and Subjectivity Analysis

Using dialogue context to improve parsing performance in dialogue systems

Correspondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy

Cross Language Information Retrieval

(Sub)Gradient Descent

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

A Case Study: News Classification Based on Term Frequency

Course Outline for Honors Spanish II Mrs. Sharon Koller

BULATS A2 WORDLIST 2

HinMA: Distributed Morphology based Hindi Morphological Analyzer

Developing a TT-MCTAG for German with an RCG-based Parser

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Year 4 National Curriculum requirements

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

Universiteit Leiden ICT in Business

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

A Computational Evaluation of Case-Assignment Algorithms

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Comprehension Recognize plot features of fairy tales, folk tales, fables, and myths.

The Acquisition of English Grammatical Morphemes: A Case of Iranian EFL Learners

THE VERB ARGUMENT BROWSER

Using a Native Language Reference Grammar as a Language Learning Tool

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Words come in categories

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

AQUA: An Ontology-Driven Question Answering System

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Mercer County Schools

TEKS Resource System. Effective Planning from the IFD & Assessment. Presented by: Kristin Arterbury, ESC Region 12

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

How to analyze visual narratives: A tutorial in Visual Narrative Grammar

A heuristic framework for pivot-based bilingual dictionary induction

A Comparison of Two Text Representations for Sentiment Analysis

1.2 Interpretive Communication: Students will demonstrate comprehension of content from authentic audio and visual resources.

Publisher Citations. Program Description. Primary Supporting Y N Universal Access: Teacher s Editions Adjust on the Fly all grades:

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Mandarin Lexical Tone Recognition: The Gating Paradigm

Phenomena of gender attraction in Polish *

National Literacy and Numeracy Framework for years 3/4

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Name of Course: French 1 Middle School. Grade Level(s): 7 and 8 (half each) Unit 1

On the final vowel in Kikae

Minimalism is the name of the predominant approach in generative linguistics today. It was first

Learning Methods in Multilingual Speech Recognition

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Presentation Exercise: Chapter 32

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Kent Island High School Spring 2016 Señora Bunker. Room: (Planning 11:30-12:45)

A Syllable Based Word Recognition Model for Korean Noun Extraction

1. Introduction. 2. The OMBI database editor

The Strong Minimalist Thesis and Bounded Optimality

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Transcription:

Linguistica Y & W ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly Inflected Languages Resource-light Approaches to Morphology

Overview Linguistica Y & W 1 Linguistica Intro Signatures Process Evaluation & Problems 2 Yarowsky & Wicentowski 2000 Intro Similarity measures Combination Resources Problems

Linguistica Linguistica Y & W Intro Signatures Process Evaluation & Problems (Goldsmith 2001) http://linguistica.uchicago.edu/ Learns signatures (paradigms) together with roots they combine with Completely unsupervised: input = raw text (5K-500K tokens) Assumes suffix-based morphology

Signatures Linguistica Y & W Intro Signatures Process Evaluation & Problems Signatures are sets of suffixes that are used with a given set of stems. NULL.ed.ing NULL.ed.ing.s NULL.s e.ed.ing.es betray, betrayed, betraying remain, remained, remaining, remains cow, cows notice, noticed, noticing, notices Similar to but not the same as paradigms: Includes both derivational and inflectional affixes; Purely corpus based, thus often not complete See NULL.ed.ing vs NULL.ed.ing.s above (the corpus contains remains but no betrays) Purely concatenative, so blow/blew would be analyzed as bl + ow/ew (if analyzed at all)

Linguistica Y & W Intro Signatures Process Evaluation & Problems Top English signatures Rank Signature #Stems Rank Signature #Stems 1 NULL.ed.ing 69 16 e.es.ing 7 2 e.ed.ing 35 17 NULL.ly.ness 7 3 NULL.s 253 18 NULL.ness 20 4 NULL.ed.s 30 19 e.ing 18 5 NULL.ed.ing.s 14 20 NULL.ly.s 6 6 s.null.s 23 21 NULL.y 17 7 NULL.ly 105 22 NULL.er 16 8 NULL.ing.s 18 23 e.ed.es.ing 4 9 NULL.ed 89 24 NULL.ed.er.ing 4 10 NULL.ing 77 25 NULL.es 16 11 ed.ing 74 26 NULL.ful 13 12 s.null 65 27 NULL.e 13 13 e.ed 44 28 ed.s 13 14 e.es 42 29 e.ed.es 5 15 NULL.er.est.ly 5 30 ed.es.ing 5

Process Linguistica Y & W Intro Signatures Process Evaluation & Problems 1 A set of heuristics is used to generate candidate signatures (together with roots they combine with) 2 The MDL metrics is used to accept or reject them

Linguistica Y & W Intro Signatures Process Evaluation & Problems Step 1: Candidate generation Word segmentation Uses heuristics to generate a list of potential affixes: Collect all word-tails up to length six, For each tail n 1, n 2... n k, compute the following metric (where N k is the total number of tail of length k): C(n 1,n 2...n k ) N k C(n 1,n 2...n k ) C(n 1)C(n 2)...C(n k ) log The first 100 top ranking candidates are chosen Other heuristics are possible Words in the corpus are segmented according to these candidates. For each stem collect the list associated suffixes (incl. NULL), i.e., the signature for that stem. All signatures associated only with one stem or only with one suffix are dropped.

Linguistica Y & W Intro Signatures Process Evaluation & Problems Step 2: Candidate evaluation Not all suggested signatures are useful. They need to be evaluated. Use Minimum Description Length to filter them

Linguistica Y & W Intro Signatures Process Evaluation & Problems Minimum description length (MDL) Criterion for selecting among models Developed by (Rissanen 1989); see also (Kazakov 1997; Marcken 1995) According to MDL, the best model is the one which gives the most compact description of the data, including the description of the model itself. In our case: A grammar (the model) can be used to compress a corpus. The better the morphological description is, the better the compression is. The size of the grammar and corpus is measured in bits.

Evaluation Linguistica Y & W Intro Signatures Process Evaluation & Problems Applied to English, French, Italian, Spanish, and Latin. Identification of morpheme boundaries in 1000-word corpus Evaluated subjectively, because there is no gold standard Not always clear where the boundary should be: aboli-tion vs. abol-ish; Alexand-er, Alex-is, John-son; alumn-i English: precision = 85.9 %; recall = 90.4 %

Problems Linguistica Y & W Intro Signatures Process Evaluation & Problems Analyzes only suffixes (easily generalizable to prefixes as well). Handling stem-internal changes would require significant overhaul. All phonological/graphemic changes accompanying inflection, must be factored into suffixes: English: hated (hate+ed) analyzed as hat-ed Russian: plak-at cry inf and plač-et cry pres.3pl analyzed as pla-kat / pla-čet Considers only information contained in individual words and their frequencies. Ignores any contextual information (reflecting syntactical and semantical information).

Yarowsky & Wicentowski 2000 Resource-light induction of inflectional paradigms (suffixal and irregular). Tested on induction of English/Spanish present-past verb pairs Forms of the same lexeme are discovered using a combination of four measures: expected frequency distributions, context similarity, phonemic/orthographic similarity, model of suffix and stem-change probabilities.

Process 1 Estimate a probabilistic alignment between inflected forms 2 Train a supervised morphological analysis learner on a weighted subset of these aligned pairs. 3 Use the result of Step 2 as either a stand-alone analyzer or a probabilistic scoring component to iteratively refine the alignment in Step 1.

Frequency similarity Two forms belong to the same lexeme, when their relative frequency fits the expected distribution. sing/sang 1204/1427 sing/singed 1204/9 singe/singed 2/9 The distribution is approximated by the distribution of regular forms.

Frequency similarity Two forms belong to the same lexeme, when their relative frequency fits the expected distribution. sing/sang 1204/1427 sing/singed 1204/9 singe/singed 2/9 The distribution is approximated by the distribution of regular forms. Works for verbal tense, but sometimes one can expect multimodal distribution. For example, for nouns, the distribution is different for count nouns, mass nouns, plurale-tantum nouns, currency names, proper nouns,...

Context similarity Forms of the same lemma have similar selectional preferences Related verbs tend to occur with similar subjects/objects. Arguments identified by simple regular expressions. Neither recall nor precission is perfect, but with a large corpus this is tolerable.

Context similarity Forms of the same lemma have similar selectional preferences Related verbs tend to occur with similar subjects/objects. Arguments identified by simple regular expressions. Neither recall nor precission is perfect, but with a large corpus this is tolerable. Works well for verbs, but other POS have much less strict subcategorization requirements. Some inflectional categories influence subcategorization, e.g., aspect in Slavic

Form similarity Form (phonemic/graphemic) similarity is measured by weighted Levenshtein measure (Levenshtein 1966).

Form similarity Form (phonemic/graphemic) similarity is measured by weighted Levenshtein measure (Levenshtein 1966). Levenshtein distance (edit distance) Distance between two strings is the minimal number of character substitutions, insertion or deletions Used in many different applications Can be calculated by an efficient dynamic programming algorithm Various modifications exists additional operations, operations cost depend on the modified characters, etc.

Form similarity Form (phonemic/graphemic) similarity is measured by weighted Levenshtein measure (Levenshtein 1966). Levenshtein distance (edit distance) Distance between two strings is the minimal number of character substitutions, insertion or deletions Used in many different applications Can be calculated by an efficient dynamic programming algorithm Various modifications exists additional operations, operations cost depend on the modified characters, etc. Edit cost operate on character clusters Four types of clusters are distinguished: V, V+, C, C+

Morphological Transformation Probabilities In step k+1, a probabilistic generative model is trained on the basis of the analyzer obtained in step k. P(form root, suffix, pos) = P(a b root, suffix, pos) = P(cb + s ca, +s, pos) = P(a b ca, +s, pos) = λ 1 P(a b last 3 (root), suffix, pos) + (1 λ 1 )λ 2 P(a b last 2 (root), suffix, pos) + (1 λ 2 )λ 3 P(a b last 1 (root), suffix, pos) + (1 λ 3 )λ 4 P(a b suffix, pos) + (1 λ 4 )P(a b)

Combination Of the four measures, no single model is sufficiently effective on its own. English present-past tense verb pairs: Iteration Accuracy Frequency 1 9.8 % Levenshtein 1 31.3% Context 1 28.0 % F+L+C 1 71.6 % F+L+C+M 1 96.5% F+L+C+M conv 99.2% Therefore, traditional classifier combination techniques are applied to merge scores of the four models.

Required resources 1 List of inflectional categories, each with canonical suffixes. 2 A large unannotated text corpus. 3 A list of the candidate noun, verb, and adjective base forms (typically obtainable from a dictionary) 4 A rough mechanism for identifying the candidate parts of speech of the remaining vocabulary, not based on morphological analysis 5 A list of consonants and vowels. 6 Optionally, a list of common function words. 7 Optionally, various distance/similarity tables generated by the same algorithm on previously studied (related) languages - used as seed information.

Problems Suffix/tail based Generalized by (Wicentowski 2004), but no longer unsurpervised. The rough mechanism for identifying POS relies on word-order templates. Good for English, not so much for Polish. Other problems mentioned above