Vibhakti Identification Approach for Sanskrit Nouns

Size: px

Start display at page:

Download "Vibhakti Identification Approach for Sanskrit Nouns"

Randall Hines
5 years ago
Views:

1 Vibhakti Identification Approach for Sanskrit Nouns Shweta A. Patil Information Technology Department Pillai s Institute of Information Technology New Panvel, Navi Mumbai, India. ABSTRACT Natural language processing is focused on developing systems that allow computers to communicate with people using everyday language. Sanskrit is considered as the mother language for almost all Indian languages (indo Aryan).It is a huge source ofvast knowledge.sanskrit is grammatically well structured andheavily inflected language.these inflections called vibhakti in SanskritVibhakti identification is language processing task in which original root word is generated from its inflected form. In Sanskrit language, these inflected words follow the rules. These rules can be formulated in computational model for developing system to perform vibhakti analysis. This paper gives description of vibhakti analysis system. Keywords Natural language processing, Vibhakti, Sanskrit INTRODUCTION The goal of the Natural Language Processing is to design software that will understand meaning of the natural language input and generate meaningful output. Sanskrit is huge source of knowledge in many fields like physics, medicine, mathematics, law, astronomy, philosophy chemistry, and many more. But today that information is not accessible to the people who do not have enough exposure to Sanskrit language. There is need to digitize and preserve information in Sanskrit as it is a heritage language. With the help of natural language processing, today it is possible to develop language processing tools to provide access to these Sanskrit texts. Vibhakti identification System based on the subanta formulations rules of Panini. In a Sanskrit vibhaktis can be formed by combination of root word with suffixes, hence they are part of Sanskrit inflectional morphology. Inflected words in used sentence carry information about entities in terms of endings, case, stem, number, gender and case relation etc. Extracting this information is the first step towards understanding the language. According to Panini, there are 21 morphological suffixes which are attached to the nominal bases ( words) 27 Varunakshi Bhojane Computer Engineering Department Pillai s Institute of Information Technology New Panvel, Navi Mumbai, India. according to syntactic category, gender and end character of the base. The vibhakti identification system will accept Sanskrit text as input. First it will check for punctuations and remove unwanted punctuations. Then the indeclinable called avyayas and the verbs called tinanta are recognized and filtered from further processing. After the recognition of these words, the system recognizes all remaining words as subanta and sends for the analysis process. Analysis performed on the basis of algorithm. EXISTING SYSTEM In the past decade vibhakti analyzers for different languages were developed.most of the research work has been done on English language computation. Although some research work related to Sanskrit language has been done and there is work going on towards developing many computational toolkits and research. 1. Analyzer [1] analyzer was designed by Dr. Girish Nath Jha, Sudhir K Mishra and teamat JNU Delhi. This system identifies and analyses inflected input text. The methodology adopted was rule based approach and example based approach. The system analyzes inflected noun forms and verb-forms in any given sandhi free text. They have done comprehensive research on the subantarule of Panini and developing the rule base.system has some limitations like it does not attempt to provide the gender information of the input words. It gives multiple results in ambiguous cases. 2. TDIL - Morph Analyser [11] It is Sanskritmorphological Analyser developed by cooperation of seven institutes. They have not stated supporting document which describe which methodology or algorithm used for this system development. This system accepts input as single Sanskrit word in different encoding forms e.g. WX - Alphabetic, Unicode - Devanagari. The analysis

2 provides original root word with its case, number and gender. INTRODUCTION TO SUBANTA In Sanskrit, the words are classified into two categories, Declinable and Indeclinable. Declinable or inflectional words are those words whose base form can be changed or inflected (Vibhakti). Inflected nouns are called as subanta Padas and inflected verbs are called tinanta Padas. Nominal inflection deals with combination of bases with case suffixes. For example, the base form Rama can be inflected in 8 vibhaktis. These declinable/inflectional words can again be categorized as Nouns, Pronouns, Adverbs and Verbs. Indeclinable are those words which do not change their form under any inflection. Indeclinable words in Sanskrit are called avyaya. figure1.illustrates taxonomy of sanskrit word. Fig.1.Classification of word [13] These vibhaktis are formed by inflecting the stems. According to Paini, there are 21 morphological suffixes (seven vibhaktis and combination of three numbers = 21) which are attached to the nominal bases (pratipadika) according to syntactic category, gender and end character of the base [4].For example the root word is र म inflected form is र म etc. VIBHAKTI IDENTIFICATION SYSTEM The proposed system is designed to identify and analyze inflected noun words. It performs several tasks to recognize accurate root word and provide its syntactic information. 28 Input Text: The system will take input text in Sanskrit language.it can be a word or sentence or paragraph. For Example:क ष णचन द र न मकश च त अस त Verb Rules Input Text Pre-Processor Categorization Recognizer Analyzer Analysis Fig 2. recognition process Avyaya Exception Noun Recognition of punctuations: In this phase the input text is filtered so as to remove all the Punctuations. Special characters and numbers are considered aspunctuations. र &&@?[[म:, द *%@व: र म :द व: Words Categorization : This is grammatical categorization process, in which input words classified under three basic categories Noun, Verbs and Indeclinable i.e. Avaya. Avyaya recognition: If an input word is found in the avyaya database it will mark as Avyaya and not sent to the subantaanalyzer for further processing.कश च त, ऄथव, आत, श चववत आद न म Verb recognition: Verb Database is the stored list of verbs.if an input is found in the verb set, it is labeled verb and thus excluded from Vibhaktianalysis.क ष णचन द र न मकश च त अस त अस त _ VERB recognition: The noun words in the given Sanskrit input are identified by filtration process. Exception dataset: All noun forms which are not analyzed according to any rule are stored the dataset. ऄहम = ऄवमद प रथम प र ष रथम एकवचन Analysis:

3 Formation of vibhakti takes place by combination of Sanskrit root word and suffixes. In vibhakti analysis those rules take in reverse form to identify original root word from its inflected form. Mostly Vowel ending words follow generalized rules.its slightly changed according to gender and ending character. ALGORITHM FOR SPLITTING THE INFLECTED NOUN All the words within a given sentence will process and splits to find the root word and their corresponding suffix.this algorithm basically required three databases. for Noun_suffix: This dataset is the collection of all possible noun suffixes with one unique key value.each suffix mapped with one key, key is formed by 4 digits. Each digit represents syntactical information.e.g.:is the suffix in Noun_suffix dataset and its key value is Key_Mapping : This is a dataset which stores mapping of each digit in the key.for Example if key value is 1111 then its mapping will be 1: a ending (ऄक र न द त) 1: masculine (प श चल ग ) 1:रथम 1: एकवचन Noun : It stores all possible nouns to check whether the analysis is correct. Algorithm: Step1: Input word w. Step2: Scan w from right hand side to find the suffixand match it with noun_suffix dataset. Step3: If match found in noun_suffix dataset a) Store the suffix to set suffix list b) Store their respective number in set key c) Extract the first and second most significant digit from the number and store it in k1 and k2 respectively Step4: Depending on the value of k1 and k2 split the word as root word and suffix.add the character as per table 1 at the end of the root word. Step5: Retrieve the possible results. Step6: If more than one solution is obtained, map the root word in the dictionary.analysis whose root word is found in the dictionary is the valid root words and final answer. Step 7: Send the result to next block. 29 Value Of K1 Table.1 suffix for a ending words Value of K2 Character added 1 1 (ऄक र न द तप श चल ग ) ऄ 1 2 (ऄक र न द तस त र ल ग) ऄ 1 3 (ऄक र न द तनप सकक ग) ऄ 2 1 (आक र न द तप श चल ग ) आ 2 2 (आक र न द तस त र ल ग) आ 3 1 (ईक र न द तप श चल ग ) ई 3 2 (ईक र न द तस त र ल ग) ई 4 1 (रकरन द तप श चल ग ) र 5 1 (तक रन द तप श चल ग ) त 6 2 (इक र न द तस त र ल ग) इ For example if the word is र म then suffix database={':'} and Key={1111} where k1=1 and k2 = 1 therefore it is identified as a ending word. If we spilt the word with respect to suffix : we get root word as र म which becomes र म after adding ऄ at ending and suffix obtained as :.let us examine some more cases When word= र म भ य म then suffix database={ 'म ' ' म ' 'य म ' 'य म ' 'भ य म ' ' भ य म ' ' भ य म ' ' भ य म '}Key={ }if k1 and k2 are analyzed,it is observed that all the values are from a ending word as k1=1 and gender can be masculine, feminine or neuter as k2 takes values 1,2 and 3. After adding end character i.e. ऄ it will give correct word from noun database र म which is masculine.र म भ य म is as follows:र म भ य म =र म + भ य म Vibhakti Analysis The final result of vibhakti Analysis contains root word, end character of the root word, its vibhakti number and case, gender of the word. क ष णचन द र क ष णचन द र(root word) ऄक र न द त रथम, एकवचन प श चल ङ ग RESULTS AND DISCUSSION The proposed vibhakti identification system tested using different Sanskrit input. In this result analysis multiple documents are given as input. The results

4 obtained by systems are plotted for three documents, Figure 3 shows result Fig. 3 Results of vibhakti identification system In further analysis to calculate accuracy of system, precision and recall is calculated. Figure 4 shows result analysis of system. The possible classification cases are as follows: True Positives (TP): number of positive examples, labeled as such. False Positives (FP): number of negative examples, labeled as positive. True Negatives (TN): number of negative examples, labeled as such. False Negatives (FN): number of positive examples, labeled as negative Verb Avyaya Exceptions Nouns Result Analysis Accuracy Precision Recall Doc 1 Doc 2 Doc 3 Figure 5 Result Analysis of Proposed System. Precision: how many of the returned documents are correct Precision = TP/(TP+FP). Recall: how many of the positives the system returns. Recall: TP/(TP+FN). Accuracy= (TP+TN)/(TP+FP+FN+TN) CONCLUSION Sanskrit is huge repository of knowledge in different fields. To provide exposure to this knowledge Sanskrit computational tools are required. Vibhakti identification of Sanskrit sentenceis the basic requirement for the processing Sanskrit text. Inflections in sentence carry information about word in terms of stem, endings, gender, case, number. Extracting and annotating this information is the first step towards understanding the language. Therefore vibhakti identification is the basic tool needed for any NLP applications. In vibhakti identification each input word is processed to generate its analysis. This system is process sandhi free text.one of the main tasks in analysis is identifying the correct root word from its inflected form. This system is depending on rule base and other linguistic data sources. It identifies root words by using Panini s rules. Root words display with its syntactic information that is case, number, gender and end character. These results are very useful in many Sanskrit based NLP applications. Accuracy of the system is depending on appropriate vibhakti rules and effective datasets. REFERENCES [1] Girish Nath Jha, Subash, Sudhir K. Mishra, Diwakar Mani, Diwakar Mishra, Manji Bhadra, Surjit K. Singh, Inflectional Morphology Analyser for Sanskrit, L.S.I. at Hyderabad University, Hyderabad, pp [2] Girish Nath Jha, Subhash, Morphological analysis of nominal inflections in Sanskrit by atspecial Centre for Sanskrit Studies Jawaharlal Nehru University, New Delhi-67. [3] Smita Selot, A.S. Zadgaonkar And Neeta Tripathi, Pada Analyzer For Sanskrit ", Oriental Journal Of Computer Science & Technology, May 15, [4] Akshar Bharati, Amba Kulkarni, V Sheeba, "Building a Wide Coverage Sanskrit Morphological Analyzer: A Practical Approach", IIT Kanpur, [5] Pawan Goyal, Vipul Arora, Laxmidhar Behera, " Analysis of Sanskrit Text: Parsing And Semantic Relations ", IIT Kanpur, [6] Akshar Bharati, Amba Kulkarni, Sanskrit and Computational Linguistics, Hyderabad, 30th Oct [7] N. Murali, Dr. R.J. Ramasreee and Dr. K.V.R.K. Acharyulu. Kridanta Analysis for Sanskrit, International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June

5 [8] Huet, Gerard Parsing Sanskrit by Computer XIIth World Sanskrit Conference, Helsinki. [9] N. Shailaja Parser for Simple Sanskrit Sentences M.Phil. Dissertation submitted to University of Hyderabad, Websites [10] [11] 31

HinMA: Distributed Morphology based Hindi Morphological Analyzer

HinMA: Distributed Morphology based Hindi Morphological Analyzer Ankit Bahuguna TU Munich ankitbahuguna@outlook.com Lavita Talukdar IIT Bombay lavita.talukdar@gmail.com Pushpak Bhattacharyya IIT Bombay