A Framework for Learning Morphology using Suffix Association Matrix

Similar documents
S. RAZA GIRLS HIGH SCHOOL

HinMA: Distributed Morphology based Hindi Morphological Analyzer

DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook

क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD


Derivational and Inflectional Morphemes in Pak-Pak Language

Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Question (1) Question (2) RAT : SEW : : NOW :? (A) OPY (B) SOW (C) OSZ (D) SUY. Correct Option : C Explanation : Question (3)

LING 329 : MORPHOLOGY

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

ENGLISH Month August

Knowledge-Free Induction of Inflectional Morphologies

Constructing Parallel Corpus from Movie Subtitles

ScienceDirect. Malayalam question answering system

ह द स ख! Hindi Sikho!

Basic concepts: words and morphemes. LING 481 Winter 2011

More Morphology. Problem Set #1 is up: it s due next Thursday (1/19) fieldwork component: Figure out how negation is expressed in your language.

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Semantic Modeling in Morpheme-based Lexica for Greek

F.No.29-3/2016-NVS(Acad.) Dated: Sub:- Organisation of Cluster/Regional/National Sports & Games Meet and Exhibition reg.

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Underlying Representations

The Impact of Morphological Awareness on Iranian University Students Listening Comprehension Ability

BULATS A2 WORDLIST 2

Words come in categories

Year 4 National Curriculum requirements

The Acquisition of Person and Number Morphology Within the Verbal Domain in Early Greek

phone hidden time phone

Semi-supervised learning of morphological paradigms and lexicons

Rule Learning With Negation: Issues Regarding Effectiveness

2017 national curriculum tests. Key stage 1. English grammar, punctuation and spelling test mark schemes. Paper 1: spelling and Paper 2: questions

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Coast Academies Writing Framework Step 4. 1 of 7

Language Model and Grammar Extraction Variation in Machine Translation

California Department of Education English Language Development Standards for Grade 8

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Lexical phonology. Marc van Oostendorp. December 6, Until now, we have presented phonological theory as if it is a monolithic

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

Morphotactics as Tier-Based Strictly Local Dependencies

The Acquisition of English Grammatical Morphemes: A Case of Iranian EFL Learners

To appear in the Papers from the 2002 Chicago Linguistics Society Meeting. Comments welcome:

Development of the First LRs for Macedonian: Current Projects

On the final vowel in Kikae

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Program in Linguistics. Academic Year Assessment Report

Florida Reading Endorsement Alignment Matrix Competency 1

Administrative Master Syllabus

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

Rule Learning with Negation: Issues Regarding Effectiveness

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

A Novel Approach for the Recognition of a wide Arabic Handwritten Word Lexicon

GRAMMATICAL MORPHEME ACQUISITION: AN ANALYSIS OF AN EFL LEARNER S LANGUAGE SAMPLES *

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

A Bayesian Learning Approach to Concept-Based Document Classification

A Simple Surface Realization Engine for Telugu

Processes of Word Formation

Indian Institute of Technology, Kanpur

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Cross Language Information Retrieval

Circumfixation: Interface of Morphology and Syntax in Igbo Derivational Morphology

Lexical specification of tone in North Germanic

CS 598 Natural Language Processing

Progressive Aspect in Nigerian English

Noisy SMS Machine Translation in Low-Density Languages

Travis Park, Assoc Prof, Cornell University Donna Pearson, Assoc Prof, University of Louisville. NACTEI National Conference Portland, OR May 16, 2012

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Pethau weird ac atmosphere gwych Conflict sites in Welsh-English mixed nominal constructions

AN ERROR ANALYSIS ON THE USE OF DERIVATION AT ENGLISH EDUCATION DEPARTMENT OF UNIVERSITAS MUHAMMADIYAH YOGYAKARTA. A Skripsi

English Academic Word Knowledge in Tertiary Education in Sweden

STANDARDS. Essential Question: How can ideas, themes, and stories connect people from different times and places? BIN/TABLE 1

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

Character Stream Parsing of Mixed-lingual Text

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

A Computational Evaluation of Case-Assignment Algorithms

Problems of the Arabic OCR: New Attitudes

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

Decomposing.Words into their Constituent Morphemes: Evidence from English and Hebrew*

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

THE EFFECTS OF TEACHING THE 7 KEYS OF COMPREHENSION ON COMPREHENSION DEBRA HENGGELER. Submitted to. The Educational Leadership Faculty

Visit us at:

5/29/2017. Doran, M.K. (Monifa) RADBOUD UNIVERSITEIT NIJMEGEN

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Beyond constructions:

Negation through reduplication and tone: implications for the LFG/PFM interface 1

Intension, Attitude, and Tense Annotation in a High-Fidelity Semantic Representation

GCSE English Language 2012 An investigation into the outcomes for candidates in Wales

An Online Handwriting Recognition System For Turkish

Minimalism is the name of the predominant approach in generative linguistics today. It was first

On the Notion Determiner

AF~-SUttA~ :tc.a~ v~ t~* Salah Alnajem. Abstract. Department of Arabic, College of Arts Kuwait University

The Use of Inflectional Suffixes by Third Year English Undergraduates at the College of Education, University of Mosul Adday Mahmood Adday (1)

Transcription:

A Framework for Learning Morphology using Suffix Association Matrix Mrs. Shilpa Desai Dr. Jyoti Pawar Prof. Pushpak Bhattacharya The 5 th Workshop on South and Southeast Asian Natural Language Processing The 25 th International Conference on Computational Linguistics Dublin, Ireland 23 rd August 2014

Outline of the presentation Morphology Introduction Types For Indian Languages Hindi and Konkani Approaches to Morphology Learning Suffix Association Matrix (SAM) Experimental Results Using SAM Learning Morphology Using SAM Conclusion 2

Morphology A study of word structure (1/2) Words are made up of Morphemes walking = walk + ing unplugged = un + plug + ed 3

Morphology A study of word structure (2/2) Words are made up of Morphemes walking = walk + ing unplugged = un + plug + ed Morphemes Stems Affixes Prefixes, suffixes, infixes and circumfixes 4

Types of Morphology Inflectional Deal with the variations of forms of the same word walk walks, walking, Give rise to inflectional affixes Derivational Deal with the production of new words learn (Verb) + er learner (Noun) Give rise to derivational affixes 5

Morphology For Indian Languages Hindi Affixes that apply Prefixes Suffixes Inflectional Suffixes Noun (moderate) Verb (high) Derivational Suffixes (moderate) Konkani Affixes that apply Prefixes ( very rare) Suffixes (common) Inflectional Suffixes Noun ( high > 100) Verb ( very high > 800) Derivational Suffixes (moderate ) 6

Approaches used to Learn Morphology Rule Based / Finite State Based Used for word segmentation Used by Stemmers and Morphological Analyzers Unsupervised Used for word segmentation, affix identification, stemming Can be used for automatic paradigm generation 7

Approaches used to Learn Morphology Rule Based / Finite State Based Linguistic knowledge of language required to build Time consuming, linguistic experts are required hence costly Unsupervised Language independent Data driven approach 8

Suffix Association Matrix (SAM) SAM measures how many times a suffix occurs with some other suffix in corpus. Sample instance of SAM NULL er ing ed NULL - 46 225 129 er 46-22 15 ing 225 22-21 ed 129 15 21-9

Learning Morphology using Suffix Association Matrix (SAM) Unsupervised approach. Identifies derivational suffixes using lexicon as input. Identifies inflectional and derivational suffixes using corpus as input. Works for concatenative morphology. 10

Learning Morphology using Suffix Association Matrix (SAM) Generates paradigms Paradigm is defined as a set of suffixes which go with a stem. For Indian languages like Konkani where most inflectional forms have suffixes, SAM helps identify stem and suffixes 11

Experimental Results Paradigms generated using Lexicon as input Language Suffix Set Corresponding Word Stem English {ist, y} anarch, entomolog, metallurg, misogyn, phthalmolog, optometr, ornitholog,... English {NULL, ation, ed} confirm, disorient, ferment, fix, infest, Sample segmentation obtained: anarchist = anarch + ist 12

Experimental Results Paradigms generated using Lexicon as input Language Suffix Set Corresponding Word stems Hindi {क, ण, Hindi त} {NULL, न, } आर araksh, नय niyantr, नध र nirdhar, प ष posh, द ष pradush, श ष shosh,... गड़बड़ gadbad, गरम garam, झल मल zilmil, द त dost, धमक dhamak, म लक malik, म हनत mehanat, Sample segmentation obtained: नय ण = नय + ण nityantran = nitayantra + n 13

Experimental Results Paradigms generated using Lexicon as input Language Suffix Set Corresponding Words Konkani Konkani {NULL, च#, } {NULL, वप, त} अवत र avtar, आयसम ज aryasamaj, उप ग upegh, एकमत ekmath, करप karap, ग ल ब gulab,... उजव ड uzvad, क चक च kuchkuch, खटखट katkat, खडखड khadkhad, Sample segmentation obtained: उजव ड वप = उजव ड + वप ujvadavap= ujvad+ avap 14

A Framework for Learning Morphology using SAM 15

Learning Morphology using SAM Step 1 Suffix Identifier Module : Identifies candidate stem and candidate suffix Example : Input L = {walk, walks, walking, talk, talks, tall, talking, take} Candidate Stem = {walk, talk} Candidate Suffix = {s, ing, NULL} Here every stem occurs with at least two suffixes and every suffix occurs with at least two stems. To get possible stem from two words {walk, walking} look at maximum common beginning letters. If a stem is found for a word the remaining part is considered suffix {walker, walking} 16

Learning Morphology using SAM Step 2 Stem Suffix Pruner Module : Fixes problem of over-stemming applying Heuristic H1 Example: Input L = {addict, addiction, addictive, affirmation, affirmative, apprehension, apprehensive,contradict, contradiction, contradictive} Before pruning Candidate Stem = {addict, affirmati, apprehensi, contradict} Candidate Suffix = {NULL, ion, ive, on, ve} After pruning Stem = {addict, affirmat, apprehens, contradict} Suffix = {NULL, ion, ive} 17

Learning Morphology using SAM Step 3 Primary paradigm Generator : Generates paradigm for Stem Suffix List Example : Input L = {addict, addiction, addictive, affirmation, affirmative, apprehension, apprehensive, contradict, contradiction, contradictive} Stem = {addict, affirmat, apprehens, contradict} Suffix = {NULL, ion, ive} Paradigm 1. {NULL, ion, ive} {addict, contradict} 2. {ion, ive} {affirmat, apprehens} 18

Learning Morphology using SAM Step 4 Suffix Association Matrix (SAM) Generator: Generates the suffix association matrix. 1. {NULL, ion, ive} {addict, contradict, extort, extract, insert, intercept} 6 stems 2. {ion, ive} {affirmat, apprehens} 2 stems SAM NULL ion ive NULL 6 6 ion 6 8 ive 6 8 19

Learning Morphology using SAM Step 5 Morphology Paradigm Generator : Refines initial paradigms generated using suffix association matrix to prune chance segmentations like cannot = canno+ t cannon = canno+ n 20

Conclusion (1/3) Significance of Suffix Association Matrix (SAM) SAM can be used to segment words correctly. Example 1: Input word: cannon Possible segmentation cannon = canno+ n if the word cannot is in corpus Check value for (n,t) in SAM, value will be low so reject segmentation cannon = canno + n 21

Conclusion (2/3) Significance of Suffix Association Matrix (SAM) Example 2: Input word: bother Possible segmentation bother = both + er Value for (er,null) in SAM is high so check for some different high association suffixes of er such as ing Check for existence of bothing in large corpus. If many high association suffix words are found, accept the segmentation, otherwise reject 22

Conclusion (3/3) Related methods, normally place a restriction on stem lengths SAM helps remove stem length restriction and is an alternate method which works for short stem length words 23

Thank You द व बर# क/ Dev bore koru