Nepali Spellchecking. Laxmi Prasad Khatiwada Linguist Nepali Language Computing Project Madan Puraskar Pustakalaya

Similar documents
DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook

HinMA: Distributed Morphology based Hindi Morphological Analyzer

S. RAZA GIRLS HIGH SCHOOL

क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD


Question (1) Question (2) RAT : SEW : : NOW :? (A) OPY (B) SOW (C) OSZ (D) SUY. Correct Option : C Explanation : Question (3)

ENGLISH Month August

ह द स ख! Hindi Sikho!

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL

Florida Reading Endorsement Alignment Matrix Competency 1

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Derivational and Inflectional Morphemes in Pak-Pak Language

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

Underlying Representations

F.No.29-3/2016-NVS(Acad.) Dated: Sub:- Organisation of Cluster/Regional/National Sports & Games Meet and Exhibition reg.

Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

English Language and Applied Linguistics. Module Descriptions 2017/18

Phonological Processing for Urdu Text to Speech System

LING 329 : MORPHOLOGY

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Test Blueprint. Grade 3 Reading English Standards of Learning

Basic concepts: words and morphemes. LING 481 Winter 2011

Program in Linguistics. Academic Year Assessment Report

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

The ABCs of O-G. Materials Catalog. Skills Workbook. Lesson Plans for Teaching The Orton-Gillingham Approach in Reading and Spelling

A NOTE ON UNDETECTED TYPING ERRORS

Coast Academies Writing Framework Step 4. 1 of 7

Phonological and Phonetic Representations: The Case of Neutralization

Phonology Revisited: Sor3ng Out the PH Factors in Reading and Spelling Development. Indiana, November, 2015

The Acquisition of Person and Number Morphology Within the Verbal Domain in Early Greek

On the Formation of Phoneme Categories in DNN Acoustic Models

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

व रण क ए आ दन-पत र. Prospectus Cum Application Form. न दय व kऱय सम त. Navodaya Vidyalaya Samiti ਨਵ ਦ ਆ ਦਵਦ ਆਦ ਆ ਸਦ ਤ. Navodaya Vidyalaya Samiti

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Lexical phonology. Marc van Oostendorp. December 6, Until now, we have presented phonological theory as if it is a monolithic

MARK¹² Reading II (Adaptive Remediation)

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

The Impact of Morphological Awareness on Iranian University Students Listening Comprehension Ability

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

Consonants: articulation and transcription

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Parsing of part-of-speech tagged Assamese Texts

MARK 12 Reading II (Adaptive Remediation)

Semantic Modeling in Morpheme-based Lexica for Greek

Publisher Citations. Program Description. Primary Supporting Y N Universal Access: Teacher s Editions Adjust on the Fly all grades:

Journal of Phonetics

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Modeling full form lexica for Arabic

Comprehension Recognize plot features of fairy tales, folk tales, fables, and myths.

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Chapter 5. The Components of Language and Reading Instruction

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

A Simple Surface Realization Engine for Telugu

Criterion Met? Primary Supporting Y N Reading Street Comprehensive. Publisher Citations

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

Disambiguation of Thai Personal Name from Online News Articles

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Fisk Street Primary School

cambridge occasional papers in linguistics Volume 8, Article 3: 41 55, 2015 ISSN

The Acquisition of English Grammatical Morphemes: A Case of Iranian EFL Learners

Words come in categories

CS 598 Natural Language Processing

Holy Family Catholic Primary School SPELLING POLICY

A Neural Network GUI Tested on Text-To-Phoneme Mapping

UKLO Round Advanced solutions and marking schemes. 6 The long and short of English verbs [15 marks]

Natural Language Processing. George Konidaris

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

NAME: East Carolina University PSYC Developmental Psychology Dr. Eppler & Dr. Ironsmith

On the final vowel in Kikae

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

Language and Computers. Writers Aids. Introduction. Non-word error detection. Dictionaries. N-gram analysis. Isolated-word error correction

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

Character Stream Parsing of Mixed-lingual Text

CELTA. Syllabus and Assessment Guidelines. Third Edition. University of Cambridge ESOL Examinations 1 Hills Road Cambridge CB1 2EU United Kingdom

Universal contrastive analysis as a learning principle in CAPT

ENGLISH LANGUAGE ARTS SECOND GRADE

raıs Factors affecting word learning in adults: A comparison of L2 versus L1 acquisition /r/ /aı/ /s/ /r/ /aı/ /s/ = individual sound

Developing a TT-MCTAG for German with an RCG-based Parser

The Effect of Extensive Reading on Developing the Grammatical. Accuracy of the EFL Freshmen at Al Al-Bayt University

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Get Your Hands On These Multisensory Reading Strategies

Tutorial on Paradigms

(3) Vocabulary insertion targets subtrees (4) The Superset Principle A vocabulary item A associated with the feature set F can replace a subtree X

Year 4 National Curriculum requirements

1. Introduction. 2. The OMBI database editor

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Prentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Correlated to Nebraska Reading/Writing Standards (Grade 10)

Double Double, Morphology and Trouble: Looking into Reduplication in Indonesian

Natural Language Analysis and Machine Translation in Pilot - ATC Communication. Boh Wasyliw* & Douglas Clarke $

Phonological encoding in speech production

Transcription:

Nepali Spellchecking Laxmi Prasad Khatiwada Linguist Nepali Language Computing Project Madan Puraskar Pustakalaya www.mpp.org.np <laxmi@mpp.org.np>

Contents Relation of spell checking and linguistics Error allocation in spelling Phonology Characters of Nepali language Morphophonemic Types of error allocation Modeling orthographic rules in Nepali Implementation in OOo Nepali spelling rule for myspell

Relation of Spellchecking and linguistics Language Evolution Pragmatics I d e a Language Acquisition Natural Language Processin g Semantics Syntax Morphology Phonology Grammar Concept Word Speech Lexicograp hy

Error allocation in spelling Phonemic Morphemic Lexical Syntactic Semantic Pragmatic

Phonology What is phonology? The study of speech sounds (Represented by phonemes) in language or a language with reference to their distribution

Characters of Nepali Language Vowels Consonants Unvoiced Voiced Nasal Velar Affricates Other Consonant Palatal Dental Labial

Morphology What is morphology? The study of words (Represented by morphemes) in language or a language with reference to their distribution Morphemes: the smallest units of a language that carry meaning. A word can be comprised of one or more morpheme.

Types of morphemes Free morpheme: can occur as a simple word. Example: घर, गर, फर क,... Bound morpheme: can only occur in connection with other morphemes. Example: -अ स,-न, आर...

sandhi i.e. Morphophonemics Philosophy: When two different sounds are uttered in close proximity, they join to give a new sound to the listener. Rules are enumerated in every detail and mathematical rigour. Just like rewrite rules of Finite State Transducers Aα + βb AγB

How do spelling mistakes occur? 1. Keyboard Adjacencies 2. Shift-key Characters 3. Phonetic Similarity 4. Visual Similarity

Types of error allocation Typographic Errors occur when writer knows the correct spelling of the word but mistypes the word by mistake, these errors are mostly related to the keyboard. Cognitive Errors (also called orthographic errors) occur when writer does not know or has forgotten the correct spelling of a word or words.

Typographic mistypes the word by mistake, these errors are mostly related to the keyboard. Shift-key Characters कसरत/कसर त धनध न य/धनध य Phonetic Similarity फ र /फ र त र/ तर Visual Similarity प च/प च स ग/स ग

Cognitive Errors (also called orthographic errors) occur when writer does not know or has forgotten the correct spelling of a word or words. अ थ ई / अ थ य अग य न / अज ञ न न द / न

Error correction Automatic Error is automatically replaced by correction without user intervention. Automatic error correction is the requirement for those speech processing and NLP (Natural Language Processing) related systems.

Lexicon lookup Interactive User can interactively select one of the suggested corrections for replacement. spellchecker can give multiple correction suggestion.

Single Error Correction Technique Errors and correction 1. Single letter insertion 2. Single letter deletion 3. Single letter substitution 4. Single letter transposition

Morphological system Inflectional र प यक Derivational व य त प दक

Inflection + ओ ब ठ प. व. ब ( व) + ई ब ठ. व र प यन + आ ब ठ व. बव

Derivational Affixation सग प त Prefix प व सग व+कल प = वकल प Compound सम सप त ल म +ख ट ट =ल मख ट ट Duplication त वप त झल +झल +ई =झलझल Suffix परसग Primary क त सग ल ख +ओट= ल ख ट Secondary तसग न प ल+ई =न प ल

Modeling Orthographic Rules English Spelling changes in morpheme boundaries bus+s buses, watch+s watches fly+s flies make+ing making Rules E-insertion takes place if the stem ends in s, z, ch, sh etc. y maps to ie when pluralization marker s is added

Modeling Orthographic Rules Nepali 1. Single letter insertion बज उ+ छ = बज उ छ / ख + छ = ख न छ द + आइ = दय इ / र + आइ = र व इ 2. Single letter deletion उ ल + आल = उक ल / उ ल + आल = उक ल 3. Single letter substitution क ट + आइ = कट इ / भ ग + आइ = भग इ 4. Transposition उ ल +आ = उल क / वर म +ई = वम र

Implementation in OpenOffice.org OpenOffice.org uses MySpell for spellchecking MySpell works only in single byte character encoding Nepali char encoded in multi-byte encoding system So cannot work directly in OOo Both dictionary file and affix file have to be saved in ISCII-DEVANAGARI encoding Dictionary file saved as ne_np.dic and affix file saved as ne_np.aff for Nepali language Both files to be stored in <OOo_dir>/share/dict/ooo directory Following needs to be added to dictionary.lst file: DICT ne NP ne_np

Sample of MySpell.aff file SFX D y 1 SFX D y ied # imply -> implied SFX D y 1 SFX D ल आल ल # उ ल -> उक ल Problem Unicode Phonetic उक ल > V C C> V C C V

Nepali Spelling rule for MySpell Nepali Suffix Rule Root Character Position Suffixes Built Words Initial Character End Character Delete Insert Doubled Initial Char End Char Flag Verb(अ) v आ (^अ-औ) ~आ अ आ ह स आइ ह स इ ऊ (^अ-औ) ~ऊ उ इ दध इल द धल ई (^अ-औ) ~ई इ आ भ ख आर भख र Verb(आ) v ज ह ~ज, ह ग, भ ए ज एर गएर ह एर भएर Verb(इ) v क - इ य आ द आइ दय इ ल आइ लय इ