Nepali Spellchecking Laxmi Prasad Khatiwada Linguist Nepali Language Computing Project Madan Puraskar Pustakalaya www.mpp.org.np <laxmi@mpp.org.np>
Contents Relation of spell checking and linguistics Error allocation in spelling Phonology Characters of Nepali language Morphophonemic Types of error allocation Modeling orthographic rules in Nepali Implementation in OOo Nepali spelling rule for myspell
Relation of Spellchecking and linguistics Language Evolution Pragmatics I d e a Language Acquisition Natural Language Processin g Semantics Syntax Morphology Phonology Grammar Concept Word Speech Lexicograp hy
Error allocation in spelling Phonemic Morphemic Lexical Syntactic Semantic Pragmatic
Phonology What is phonology? The study of speech sounds (Represented by phonemes) in language or a language with reference to their distribution
Characters of Nepali Language Vowels Consonants Unvoiced Voiced Nasal Velar Affricates Other Consonant Palatal Dental Labial
Morphology What is morphology? The study of words (Represented by morphemes) in language or a language with reference to their distribution Morphemes: the smallest units of a language that carry meaning. A word can be comprised of one or more morpheme.
Types of morphemes Free morpheme: can occur as a simple word. Example: घर, गर, फर क,... Bound morpheme: can only occur in connection with other morphemes. Example: -अ स,-न, आर...
sandhi i.e. Morphophonemics Philosophy: When two different sounds are uttered in close proximity, they join to give a new sound to the listener. Rules are enumerated in every detail and mathematical rigour. Just like rewrite rules of Finite State Transducers Aα + βb AγB
How do spelling mistakes occur? 1. Keyboard Adjacencies 2. Shift-key Characters 3. Phonetic Similarity 4. Visual Similarity
Types of error allocation Typographic Errors occur when writer knows the correct spelling of the word but mistypes the word by mistake, these errors are mostly related to the keyboard. Cognitive Errors (also called orthographic errors) occur when writer does not know or has forgotten the correct spelling of a word or words.
Typographic mistypes the word by mistake, these errors are mostly related to the keyboard. Shift-key Characters कसरत/कसर त धनध न य/धनध य Phonetic Similarity फ र /फ र त र/ तर Visual Similarity प च/प च स ग/स ग
Cognitive Errors (also called orthographic errors) occur when writer does not know or has forgotten the correct spelling of a word or words. अ थ ई / अ थ य अग य न / अज ञ न न द / न
Error correction Automatic Error is automatically replaced by correction without user intervention. Automatic error correction is the requirement for those speech processing and NLP (Natural Language Processing) related systems.
Lexicon lookup Interactive User can interactively select one of the suggested corrections for replacement. spellchecker can give multiple correction suggestion.
Single Error Correction Technique Errors and correction 1. Single letter insertion 2. Single letter deletion 3. Single letter substitution 4. Single letter transposition
Morphological system Inflectional र प यक Derivational व य त प दक
Inflection + ओ ब ठ प. व. ब ( व) + ई ब ठ. व र प यन + आ ब ठ व. बव
Derivational Affixation सग प त Prefix प व सग व+कल प = वकल प Compound सम सप त ल म +ख ट ट =ल मख ट ट Duplication त वप त झल +झल +ई =झलझल Suffix परसग Primary क त सग ल ख +ओट= ल ख ट Secondary तसग न प ल+ई =न प ल
Modeling Orthographic Rules English Spelling changes in morpheme boundaries bus+s buses, watch+s watches fly+s flies make+ing making Rules E-insertion takes place if the stem ends in s, z, ch, sh etc. y maps to ie when pluralization marker s is added
Modeling Orthographic Rules Nepali 1. Single letter insertion बज उ+ छ = बज उ छ / ख + छ = ख न छ द + आइ = दय इ / र + आइ = र व इ 2. Single letter deletion उ ल + आल = उक ल / उ ल + आल = उक ल 3. Single letter substitution क ट + आइ = कट इ / भ ग + आइ = भग इ 4. Transposition उ ल +आ = उल क / वर म +ई = वम र
Implementation in OpenOffice.org OpenOffice.org uses MySpell for spellchecking MySpell works only in single byte character encoding Nepali char encoded in multi-byte encoding system So cannot work directly in OOo Both dictionary file and affix file have to be saved in ISCII-DEVANAGARI encoding Dictionary file saved as ne_np.dic and affix file saved as ne_np.aff for Nepali language Both files to be stored in <OOo_dir>/share/dict/ooo directory Following needs to be added to dictionary.lst file: DICT ne NP ne_np
Sample of MySpell.aff file SFX D y 1 SFX D y ied # imply -> implied SFX D y 1 SFX D ल आल ल # उ ल -> उक ल Problem Unicode Phonetic उक ल > V C C> V C C V
Nepali Spelling rule for MySpell Nepali Suffix Rule Root Character Position Suffixes Built Words Initial Character End Character Delete Insert Doubled Initial Char End Char Flag Verb(अ) v आ (^अ-औ) ~आ अ आ ह स आइ ह स इ ऊ (^अ-औ) ~ऊ उ इ दध इल द धल ई (^अ-औ) ~ई इ आ भ ख आर भख र Verb(आ) v ज ह ~ज, ह ग, भ ए ज एर गएर ह एर भएर Verb(इ) v क - इ य आ द आइ दय इ ल आइ लय इ