Vibhakti Identification Approach for Sanskrit Nouns

Similar documents
HinMA: Distributed Morphology based Hindi Morphological Analyzer

क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD

S. RAZA GIRLS HIGH SCHOOL

DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook

Linking Task: Identifying authors and book titles in verbose queries

AQUA: An Ontology-Driven Question Answering System

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

Parsing of part-of-speech tagged Assamese Texts

Using dialogue context to improve parsing performance in dialogue systems


A Case Study: News Classification Based on Term Frequency

A Simple Surface Realization Engine for Telugu

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

1. Introduction. 2. The OMBI database editor

ScienceDirect. Malayalam question answering system

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Question (1) Question (2) RAT : SEW : : NOW :? (A) OPY (B) SOW (C) OSZ (D) SUY. Correct Option : C Explanation : Question (3)

Disambiguation of Thai Personal Name from Online News Articles

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

F.No.29-3/2016-NVS(Acad.) Dated: Sub:- Organisation of Cluster/Regional/National Sports & Games Meet and Exhibition reg.

What the National Curriculum requires in reading at Y5 and Y6

Word Segmentation of Off-line Handwritten Documents

Modeling full form lexica for Arabic

Derivational and Inflectional Morphemes in Pak-Pak Language

CS Machine Learning

Rule Learning With Negation: Issues Regarding Effectiveness

ENGLISH Month August

Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Grammar Extraction from Treebanks for Hindi and Telugu

English to Marathi Rule-based Machine Translation of Simple Assertive Sentences

Syntactic types of Russian expressive suffixes

ह द स ख! Hindi Sikho!

Developing a TT-MCTAG for German with an RCG-based Parser

Rule Learning with Negation: Issues Regarding Effectiveness

Analysis: Evaluation: Knowledge: Comprehension: Synthesis: Application:

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Indian Institute of Technology, Kanpur

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Cross Language Information Retrieval

Detecting English-French Cognates Using Orthographic Edit Distance

Test Blueprint. Grade 3 Reading English Standards of Learning

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

The College Board Redesigned SAT Grade 12

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Interactive Corpus Annotation of Anaphor Using NLP Algorithms

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Myths, Legends, Fairytales and Novels (Writing a Letter)

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

The Smart/Empire TIPSTER IR System

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Applications of memory-based natural language processing

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Improving the Quality of MT Output using Novel Name Entity Translation Scheme

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

Coast Academies Writing Framework Step 4. 1 of 7

Compositional Semantics

Universiteit Leiden ICT in Business

Loughton School s curriculum evening. 28 th February 2017

Prediction of Maximal Projection for Semantic Role Labeling

Australian Journal of Basic and Applied Sciences

Dr. Ramesh C Gaur. PGDCA, MLISc,Ph.D. Fulbright Scholar (Virginia Tech, USA)

The stages of event extraction

Some Principles of Automated Natural Language Information Extraction

Problems of the Arabic OCR: New Attitudes

Phenomena of gender attraction in Polish *

Software Maintenance

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Text-mining the Estonian National Electronic Health Record

Research Journal ADE DEDI SALIPUTRA NIM: F

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Transliteration Systems Across Indian Languages Using Parallel Corpora

Emmaus Lutheran School English Language Arts Curriculum

Vocabulary Usage and Intelligibility in Learner Language

A Bayesian Learning Approach to Concept-Based Document Classification

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

CS 598 Natural Language Processing

Grade 7. Prentice Hall. Literature, The Penguin Edition, Grade Oregon English/Language Arts Grade-Level Standards. Grade 7

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Language Independent Passage Retrieval for Question Answering

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Python Machine Learning

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

The MEANING Multilingual Central Repository

1.2 Interpretive Communication: Students will demonstrate comprehension of content from authentic audio and visual resources.

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Transcription:

Vibhakti Identification Approach for Sanskrit Nouns Shweta A. Patil Information Technology Department Pillai s Institute of Information Technology New Panvel, Navi Mumbai, India. ABSTRACT Natural language processing is focused on developing systems that allow computers to communicate with people using everyday language. Sanskrit is considered as the mother language for almost all Indian languages (indo Aryan).It is a huge source ofvast knowledge.sanskrit is grammatically well structured andheavily inflected language.these inflections called vibhakti in SanskritVibhakti identification is language processing task in which original root word is generated from its inflected form. In Sanskrit language, these inflected words follow the rules. These rules can be formulated in computational model for developing system to perform vibhakti analysis. This paper gives description of vibhakti analysis system. Keywords Natural language processing, Vibhakti, Sanskrit INTRODUCTION The goal of the Natural Language Processing is to design software that will understand meaning of the natural language input and generate meaningful output. Sanskrit is huge source of knowledge in many fields like physics, medicine, mathematics, law, astronomy, philosophy chemistry, and many more. But today that information is not accessible to the people who do not have enough exposure to Sanskrit language. There is need to digitize and preserve information in Sanskrit as it is a heritage language. With the help of natural language processing, today it is possible to develop language processing tools to provide access to these Sanskrit texts. Vibhakti identification System based on the subanta formulations rules of Panini. In a Sanskrit vibhaktis can be formed by combination of root word with suffixes, hence they are part of Sanskrit inflectional morphology. Inflected words in used sentence carry information about entities in terms of endings, case, stem, number, gender and case relation etc. Extracting this information is the first step towards understanding the language. According to Panini, there are 21 morphological suffixes which are attached to the nominal bases ( words) 27 Varunakshi Bhojane Computer Engineering Department Pillai s Institute of Information Technology New Panvel, Navi Mumbai, India. according to syntactic category, gender and end character of the base. The vibhakti identification system will accept Sanskrit text as input. First it will check for punctuations and remove unwanted punctuations. Then the indeclinable called avyayas and the verbs called tinanta are recognized and filtered from further processing. After the recognition of these words, the system recognizes all remaining words as subanta and sends for the analysis process. Analysis performed on the basis of algorithm. EXISTING SYSTEM In the past decade vibhakti analyzers for different languages were developed.most of the research work has been done on English language computation. Although some research work related to Sanskrit language has been done and there is work going on towards developing many computational toolkits and research. 1. Analyzer [1] analyzer was designed by Dr. Girish Nath Jha, Sudhir K Mishra and teamat JNU Delhi. This system identifies and analyses inflected input text. The methodology adopted was rule based approach and example based approach. The system analyzes inflected noun forms and verb-forms in any given sandhi free text. They have done comprehensive research on the subantarule of Panini and developing the rule base.system has some limitations like it does not attempt to provide the gender information of the input words. It gives multiple results in ambiguous cases. 2. TDIL - Morph Analyser [11] It is Sanskritmorphological Analyser developed by cooperation of seven institutes. They have not stated supporting document which describe which methodology or algorithm used for this system development. This system accepts input as single Sanskrit word in different encoding forms e.g. WX - Alphabetic, Unicode - Devanagari. The analysis

provides original root word with its case, number and gender. INTRODUCTION TO SUBANTA In Sanskrit, the words are classified into two categories, Declinable and Indeclinable. Declinable or inflectional words are those words whose base form can be changed or inflected (Vibhakti). Inflected nouns are called as subanta Padas and inflected verbs are called tinanta Padas. Nominal inflection deals with combination of bases with case suffixes. For example, the base form Rama can be inflected in 8 vibhaktis. These declinable/inflectional words can again be categorized as Nouns, Pronouns, Adverbs and Verbs. Indeclinable are those words which do not change their form under any inflection. Indeclinable words in Sanskrit are called avyaya. figure1.illustrates taxonomy of sanskrit word. Fig.1.Classification of word [13] These vibhaktis are formed by inflecting the stems. According to Paini, there are 21 morphological suffixes (seven vibhaktis and combination of three numbers = 21) which are attached to the nominal bases (pratipadika) according to syntactic category, gender and end character of the base [4].For example the root word is र म inflected form is र म etc. VIBHAKTI IDENTIFICATION SYSTEM The proposed system is designed to identify and analyze inflected noun words. It performs several tasks to recognize accurate root word and provide its syntactic information. 28 Input Text: The system will take input text in Sanskrit language.it can be a word or sentence or paragraph. For Example:क ष णचन द र न मकश च त अस त Verb Rules Input Text Pre-Processor Categorization Recognizer Analyzer Analysis Fig 2. recognition process Avyaya Exception Noun Recognition of punctuations: In this phase the input text is filtered so as to remove all the Punctuations. Special characters and numbers are considered aspunctuations. र &&@?[[म:, द *%@व: र म :द व: Words Categorization : This is grammatical categorization process, in which input words classified under three basic categories Noun, Verbs and Indeclinable i.e. Avaya. Avyaya recognition: If an input word is found in the avyaya database it will mark as Avyaya and not sent to the subantaanalyzer for further processing.कश च त, ऄथव, आत, श चववत आद न म Verb recognition: Verb Database is the stored list of verbs.if an input is found in the verb set, it is labeled verb and thus excluded from Vibhaktianalysis.क ष णचन द र न मकश च त अस त अस त _ VERB recognition: The noun words in the given Sanskrit input are identified by filtration process. Exception dataset: All noun forms which are not analyzed according to any rule are stored the dataset. ऄहम = ऄवमद प रथम प र ष रथम एकवचन Analysis:

Formation of vibhakti takes place by combination of Sanskrit root word and suffixes. In vibhakti analysis those rules take in reverse form to identify original root word from its inflected form. Mostly Vowel ending words follow generalized rules.its slightly changed according to gender and ending character. ALGORITHM FOR SPLITTING THE INFLECTED NOUN All the words within a given sentence will process and splits to find the root word and their corresponding suffix.this algorithm basically required three databases. for Noun_suffix: This dataset is the collection of all possible noun suffixes with one unique key value.each suffix mapped with one key, key is formed by 4 digits. Each digit represents syntactical information.e.g.:is the suffix in Noun_suffix dataset and its key value is 1111. Key_Mapping : This is a dataset which stores mapping of each digit in the key.for Example if key value is 1111 then its mapping will be 1: a ending (ऄक र न द त) 1: masculine (प श चल ग ) 1:रथम 1: एकवचन Noun : It stores all possible nouns to check whether the analysis is correct. Algorithm: Step1: Input word w. Step2: Scan w from right hand side to find the suffixand match it with noun_suffix dataset. Step3: If match found in noun_suffix dataset a) Store the suffix to set suffix list b) Store their respective number in set key c) Extract the first and second most significant digit from the number and store it in k1 and k2 respectively Step4: Depending on the value of k1 and k2 split the word as root word and suffix.add the character as per table 1 at the end of the root word. Step5: Retrieve the possible results. Step6: If more than one solution is obtained, map the root word in the dictionary.analysis whose root word is found in the dictionary is the valid root words and final answer. Step 7: Send the result to next block. 29 Value Of K1 Table.1 suffix for a ending words Value of K2 Character added 1 1 (ऄक र न द तप श चल ग ) ऄ 1 2 (ऄक र न द तस त र ल ग) ऄ 1 3 (ऄक र न द तनप सकक ग) ऄ 2 1 (आक र न द तप श चल ग ) आ 2 2 (आक र न द तस त र ल ग) आ 3 1 (ईक र न द तप श चल ग ) ई 3 2 (ईक र न द तस त र ल ग) ई 4 1 (रकरन द तप श चल ग ) र 5 1 (तक रन द तप श चल ग ) त 6 2 (इक र न द तस त र ल ग) इ For example if the word is र म then suffix database={':'} and Key={1111} where k1=1 and k2 = 1 therefore it is identified as a ending word. If we spilt the word with respect to suffix : we get root word as र म which becomes र म after adding ऄ at ending and suffix obtained as :.let us examine some more cases When word= र म भ य म then suffix database={ 'म ' ' म ' 'य म ' 'य म ' 'भ य म ' ' भ य म ' ' भ य म ' ' भ य म '}Key={1121 1321 1252 2271 1232 1132 1142 1152 1232}if k1 and k2 are analyzed,it is observed that all the values are from a ending word as k1=1 and gender can be masculine, feminine or neuter as k2 takes values 1,2 and 3. After adding end character i.e. ऄ it will give correct word from noun database र म which is masculine.र म भ य म is as follows:र म भ य म =र म + भ य म Vibhakti Analysis The final result of vibhakti Analysis contains root word, end character of the root word, its vibhakti number and case, gender of the word. क ष णचन द र क ष णचन द र(root word) ऄक र न द त रथम, एकवचन प श चल ङ ग RESULTS AND DISCUSSION The proposed vibhakti identification system tested using different Sanskrit input. In this result analysis multiple documents are given as input. The results

obtained by systems are plotted for three documents, Figure 3 shows result. 60 50 40 30 20 10 0 Fig. 3 Results of vibhakti identification system In further analysis to calculate accuracy of system, precision and recall is calculated. Figure 4 shows result analysis of system. The possible classification cases are as follows: True Positives (TP): number of positive examples, labeled as such. False Positives (FP): number of negative examples, labeled as positive. True Negatives (TN): number of negative examples, labeled as such. False Negatives (FN): number of positive examples, labeled as negative. 120 100 80 60 40 20 0 Verb Avyaya Exceptions Nouns Result Analysis Accuracy Precision Recall Doc 1 Doc 2 Doc 3 Figure 5 Result Analysis of Proposed System. Precision: how many of the returned documents are correct Precision = TP/(TP+FP). Recall: how many of the positives the system returns. Recall: TP/(TP+FN). Accuracy= (TP+TN)/(TP+FP+FN+TN) CONCLUSION Sanskrit is huge repository of knowledge in different fields. To provide exposure to this knowledge Sanskrit computational tools are required. Vibhakti identification of Sanskrit sentenceis the basic requirement for the processing Sanskrit text. Inflections in sentence carry information about word in terms of stem, endings, gender, case, number. Extracting and annotating this information is the first step towards understanding the language. Therefore vibhakti identification is the basic tool needed for any NLP applications. In vibhakti identification each input word is processed to generate its analysis. This system is process sandhi free text.one of the main tasks in analysis is identifying the correct root word from its inflected form. This system is depending on rule base and other linguistic data sources. It identifies root words by using Panini s rules. Root words display with its syntactic information that is case, number, gender and end character. These results are very useful in many Sanskrit based NLP applications. Accuracy of the system is depending on appropriate vibhakti rules and effective datasets. REFERENCES [1] Girish Nath Jha, Subash, Sudhir K. Mishra, Diwakar Mani, Diwakar Mishra, Manji Bhadra, Surjit K. Singh, Inflectional Morphology Analyser for Sanskrit, L.S.I. at Hyderabad University, Hyderabad, pp-34 2008. [2] Girish Nath Jha, Subhash, Morphological analysis of nominal inflections in Sanskrit by atspecial Centre for Sanskrit Studies Jawaharlal Nehru University, New Delhi-67. [3] Smita Selot, A.S. Zadgaonkar And Neeta Tripathi, Pada Analyzer For Sanskrit ", Oriental Journal Of Computer Science & Technology, May 15, 2010. [4] Akshar Bharati, Amba Kulkarni, V Sheeba, "Building a Wide Coverage Sanskrit Morphological Analyzer: A Practical Approach", IIT Kanpur, 2009. [5] Pawan Goyal, Vipul Arora, Laxmidhar Behera, " Analysis of Sanskrit Text: Parsing And Semantic Relations ", IIT Kanpur, 2009. [6] Akshar Bharati, Amba Kulkarni, Sanskrit and Computational Linguistics, Hyderabad, 30th Oct 2007. [7] N. Murali, Dr. R.J. Ramasreee and Dr. K.V.R.K. Acharyulu. Kridanta Analysis for Sanskrit, International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014. 30

[8] Huet, Gerard Parsing Sanskrit by Computer XIIth World Sanskrit Conference, Helsinki. [9] N. Shailaja Parser for Simple Sanskrit Sentences M.Phil. Dissertation submitted to University of Hyderabad, 2009. Websites [10] http://sanskrit.jnu.ac.in. [11] http://tdil-dc.in/san/morph/morph.html. 31