A Toolkit for Sanskrit Processing

Size: px

Start display at page:

Download "A Toolkit for Sanskrit Processing"

Tabitha Loren Terry
6 years ago
Views:

1 A Toolkit for Sanskrit Processing Hasmukh M Parmar Computer Science and Automation Indian Institute of Science Bangalore, Advisor: Prof. K. Gopinath Abstract There are many tools available for linguistic analysis and natural language processing (NLP) of English language like Unix diff, Latent Dirichlet Allocation etc. But when we consider Sanskrit and other Indian languages, tool support is still insufficient: for example, consider supporting Sandhi as part of morphological analysis. We have implemented toolkit for Indian language processing and to support Sanskrit linguistic analysis for documents with Unicode character set. In this toolkit, we have implemented important functionalities such as word search in sandhied text (Sanskrit text), Unicode text validation (syntax) etc. Using this toolkit we have studied three use-cases. Using toolkit, we have implemented digital Aṣṭādhyāyī in Java for learning. In the first use-case, we compute the frequency of samyuktasharas to investigate alternate syllable level coding schemes. In this case, by analysing frequencies of syllables of different documents like crawled web pages and text files we have implemented syllable level coding scheme and compared it with Unicode in terms of search and sorting algorithm complexity and also in terms of natural language processing. In the second use-case, we discussed about additive and subtractive model for encoding. In the third use-case, we discussed analysis of Sanskrit using statistical techniques and various problems related to preprocessing of Sanskrit text to make it suitable for statistical techniques. Keywords Sanskrit, Sandhi, Morphology and Unicode I. INTRODUCTION Indic languages belong to four major families: Indo-Aryan (or Indo-European), Dravidian, Austro-Asiatic and Sino- Tibetan. The majority of languages belong to the Indo-Aryan and Dravidian families. Indic writing systems are orthographic, and combine phonetic and syllabic systems. Each syllable is formed by a combination of vowels and consonants. Indic scripts originate from one script system, Brahmi. Consequently, some Indic languages share the same script (Hindi, Sanskrit, Marathi, Gujarati etc.) and others have scripts that are very similar (Tamil-Malayalam, Kannada- Telugu). Linguistically, India is a unique country. No other region has a comparative variety of distinct languages and scripts. Apart from some shared general characteristics, they are different enough that developers should understand their individual characteristics. There are different kinds of documents in Indian languages available like scanned documents, web pages, and other files (.doc,.txt, etc.). Scanned documents are important because most of Indian literature available only in scanned form. But for time being we are not considering them for analysis because to get text from scanned page, we need to use OCR and other pre-processing. We have focused only on web pages and files (.doc,.txt, etc.). There are some groups, who are working on Indian language processing and Sanskrit linguistic analysis. There are some basic functionalities, which we all require for language processing and Sanskrit linguistic analysis. As these functionalities are not available as tools or library, we have to spend important time in implementing these functionalities again. So, to analyse and process these documents we have implemented a toolkit. This toolkit offers functionalities for Devanagari script like checking whether a character is a consonant or not, checks whether a character is nasal or not, check for nasalization, Unicode text validation (syntax), word search in sandhied text (Sanskrit), process segments in Pāṇini s Śivasūtra, segment text into syllables, transliteration from International Alphabet of Sanskrit Transliteration (IAST) to Unicode Devanagari etc. This toolkit follows Unicode. Using this toolkit, we have analysed the problems related to Indian languages which are discussed hereafter. The multilingual multi-culture environment in India demands new approaches to computing with Indian languages. There is a need for a coding scheme across the country which deals with information common to all the regions. That is we need to define a standard coding scheme that satisfies the following requirements [1]: 1. Accommodate all basic characters All the basic vowels and consonants must be included in the code space. All the symbols which have the information about text (Vedic symbols, Accounting symbols etc.) must be coded. Punctuation marks, the ten numerals must be included in the code space. 2. Lexical Ordering A meaningful ordering of the vowels and consonants will help in text processing. Coding structure needs to reflect linguistic information If codes are assigned to the basic vowels and consonants in such way that codes convey some linguistic information. For example, the consonants in our languages are classified as cerebrals, palatals etc., based on sound generated. 3. Ease of data entry The scheme proposed for data entry must provide for typing in all symbols without having to install additional software or use multiple keyboard schemes.

2 It is also important that data entry modules restrict data entry to only those strings that carry meaningful linguistic content. In the context of Unicode, data entry scheme may permit typing in any valid Unicode character though it may convey nothing linguistically. It would therefore help if schemes allowed only linguistically valid text strings. 4. Transliteration Government and public institutions in the country are doing bilingual documentation with English as base and the regional language as the way to communicate with the people. Information originates in the regional language even if it is going to be transmitted in English. Ease of transliteration across the scripts is important to collect the common information and set up centralized database etc. There is also need to transliterate between English and the regional script because this is going to help people to learn language. 5. Handling spelling error 6. Dictionary look up When we search for the meaning of a word in the dictionary, we are not just looking for the meaning but also for a synonym. Hence we need to show synonyms and opposites of a word. There is knowledge-net called Amarakosha for Sanskrit which shows relations like synsets, ontology, holonymy, meronymy, hypernymy, hyponymy etc. There are two main encoding standards for Indian languages, namely, Unicode and Indian Script Code for Information Interchange (ISCII) which are widely accepted. Unicode is derived from ISCII. Unicode and ISCII are character level encodings. They are following rules for rendering characters. Considering above requirements for good encoding scheme, there are some technical problems [1, 2] with both Unicode and ISCII which are given below: 1. Uniform text rendering across applications is quite difficult. 2. Interpreting the syllabic content involves context depended processing, that too with a variable number of codes for each syllable. 3. Transliteration across Indian scripts will not be easy to implement. The Unicode assignments for linguistically equivalent Aksharas across languages are not sufficiently uniform to permit quick and effective transliteration. One requires independent table for each pair of scripts. Indian writing system is syllabic. Unique shape is used for each syllable. Acharya Text Editor [4] has raised important points about syllable level encoding scheme for Indian languages. We have started our analysis with frequencies of syllables. For frequency analysis we needed text in Indian language as well as in Unicode. We have implemented multithreaded web crawler to download web pages as well as documents. We have crawled Hindi, Kannada, Oriya, Bengali and Gujarati web pages. Considering the network traffic and data size, we have analysed the number of bits required for each syllable by using frequencies of syllables and Huffman encoding. We have proposed syllable level encoding which is UNIX safe and self-correcting like UTF-8. Searching and sorting are two important algorithms in terms of their applications. So, we have compared Unicode with our encoding in terms of these two algorithms. To analyse the effect of encoding schemes on natural language processing, namely, we have considered the language with complete grammar, Sanskrit, for this analysis. We have analysed the effect of Unicode and syllable level encoding while implementing different Sanskrit tools like Chhandas and sandhi-splitter. Analysis is related to the increase in complexity of tool when particular encoding is used for text. When we started analysis of Unicode, we were curious about why Unicode is using subtractive model instead of additive model. So, we have analysed both models using frequencies of pure consonant and consonant. In recent years, Sanskrit Computational Linguistics has gained momentum. There have been several efforts towards developing computational tools for accessing Sanskrit texts ([6, 7, and 18]). Most of these tools handle morphological analysis and sandhi splitting. Some of them ([8, 10 and 16]) also do the sentential parsing. However, there have been almost negligible efforts in statistical analysis of Sanskrit texts. Only one work has been done on this concept, Etymological trends in the Sanskrit vocabulary [22]. This work examines how the etymological composition of the Sanskrit lexicon is influenced by time and whether this composition can be used to date Sanskrit texts automatically. Suppose a researcher want to search a word in a Sanskrit text. Using just character comparison for searching is not going to be useful here because Sanskrit sentences are continuous in ancient Sanskrit text. Words in the sentences are combined using euphonic sandhi rules. Characters at word boundaries get modified because of sandhi rules. So, to solve this problem we have implemented Sanskrit Search algorithm. There are tools like Sanskrit-Hindi Accesor Cum Machine Translator [18] and Sanskrit Reader [27] for Sanskrit sentence segmentation. We can use these tools to get padapāṭha of Sanskrit text; we have used padapāṭha for statistical analysis. Using statistical analysis we want to find out how efficient these tools are. We have found two problems with this tool, namely, unnecessary sandhi splits and Samāsaḥ. Using n- gram, we gave solution to Samāsaḥ problem faced by Sanskrit Reader tool. II. BACKGROUND A. Writing System of Indian Languages The languages of India employ a syllabic writing method where a unique shape is used for each syllable (known as Akshara in Hindi and Varna in Sanskrit). There are sufficient reasons to study languages individually since variations will help to understand the problems involved in using computer to deal with the scripts.

3 Pāṇini s phonetic classification of Indian alphabets into vowels (V) and consonants (C), (known as Swaras and Vyannjanas in Sanskrit, Hindi, Marathi and Konkani, Atchulu and Hallulu in Telugu, uir and mey in Tamil, etc.), serves as a common base for all Indian languages of non-perso-arabic origin. It also provides us with a unique encoding for any word in the language. There are differences in their written forms, as different letter shapes and different shaping rules get used. In addition to the vowels and consonants, there are also a few graphical signs used for denoting nasal consonants, for nasalization of vowels, etc. (denoted by G). The effective unit of writing system for all the Indian languages is the orthographic syllable, consisting either of a lone vowel, optionally, followed by a graphical sign with the structure (V) (G) or a consonantal syllable consisting of a lone consonant or consonant and a vowel (CV) core and, optionally, one or more preceding consonants, and an optionally following graphical sign. The canonical structure for a syllables is thus of the form (C (C)) (C (V)) (G) [22]. In some syllables the number of consonants can go even up to five. The methodology of combining these two basic groups (C and V) to form various syllable is in itself a unique and scientific approach, common to all the Indian scripts. The various combinations of one (or more) consonants with each of the vowels turn out to be a perfect matrix called Barakhadi in Hindi, Matralu in Telugu and Varnamala in Sanskrit. To a linguist, consonants in Indic scripts contain an inherent vowel 'a'. For example, the first consonant in most Indic scripts is 'ka' (in Devanagari it is क), which is the equivalent of the 'pure' consonant sound 'k' and the vowel 'a'. The pure consonant in Indic script is also known as the 'dead' consonant. It is depicted by a special character called 'halant' ( ) placed below a character (क ) to suppress the inherent vowel in the consonant. B. Indian Script Code for Information Interchange ISCII [3] was proposed in the eighties and a suitable standard was evolved by Here are some aspects of the ISCII representation: 1. It is a single representation for all the Indian scripts. 2. Codes have been assigned in the upper ASCII region ( ) for the Aksharas of the languages. 3. The scheme also assigns codes for the Matras (Vowel extensions). 4. Special characters have been included to specify how a consonant in a syllable should be rendered. Rendering of Devanagari has been kept in mind. 5. A special attribute character has been included to identify the script to be used in rendering specific section of the text. C. UNICODE for Indian languages Unicode [2] was the first attempt at producing a standard for multilingual documents. Unicode owes it origin to the concept of the ASCII code extended to accommodate international languages and scripts. Unicode is usually used as a generic term referring to a two-byte character-encoding scheme. The Unicode CCS (coded character set) 3.1 is officially known as the ISO Universal Multiple Octet Coded Character Set (UCS). Unicode 3.1 adds 44,946 new encoded characters. With the 49,194 already existing characters in Unicode 3.0, the total is now 94,140. The Unicode CCS utilizes a four-dimensional coding space of 128 three-dimensional groups. Each group has 256 two-dimensional planes. Each plane consists of 256 one-dimensional rows and each row has 256 cells. A cell codes a character at this coding space or the cell is declared unused. This coding concept is called UCS-4; four octets of bits are used to represent each character specifying the group, plane, row and cell. The first plane (plane 00 of the group 00) is the Basic Multilingual Plane (BMP). The BMP defines characters in general use in alphabetic, syllabic and ideographic scripts as well as various symbols and digits. Subsequent planes are used for additional characters or other coded entities not yet invented. This full range is needed to cope with all of the world's languages; specifically, some East Asian languages that have almost 64,000 characters. To represent a character coding system (CCS) of more than 256 characters with eightbit bytes, a character-encoding scheme (CES) is required. Unicode Transformations: The most-used characterencoding scheme is UTF-8. It allows for full support of the entire Unicode, all pages and planes, and will still read standard ASCII correctly. Unicode Transformation Formats (UTFs) are CESs that supports the use of Unicode by mapping a value in a multi-byte code. The UTF-8 transformation format is becoming a dominant method for exchanging international text information because it can support all of the world's languages and is compatible ASCII. UTF-8 uses variable-width encoding. The characters numbered 0 to 0x7f (127) encode to themselves as a single byte, and larger character values are encoded into 2 to 6 bytes. It is UNIX safe and self correcting encoding scheme. The major scripts of India proper, including Devanagari, are all encoded according to a common plan, so that comparable characters are in the same order and relative location. This structural arrangement, which facilitates transliteration to some degree, is based upon the Indian national standard (ISCII) encoded for Indian scripts. [28] D. Sanskrit Sanskrit language, (from Sanskrit: saṃskṛta, adorned, cultivated, purified ) an old Indo-Aryan language in which the most ancient documents are the Vedas, composed in what is called Vedic Sanskrit. Over its long history, Sanskrit has been written both in Devanāgarī script and in various regional scripts, such as Śāradā from the north (Kashmir), Bāṅglā (Bengali) in the east, Gujarātī in the west, and various southern scripts, including the Grantha alphabet, which was especially devised for Sanskrit texts. Sanskrit texts continue to be published in regional scripts, although in fairly recent times Devanāgarī has become more generally used. There is a large corpus of literature in Sanskrit which includes Mahābhārata, Rāmāyaṇa, Vedas, Śakuntalā etc.

4 What is generally called Classical Sanskrit was elegantly described in one of the finest grammar ever produced, the Aṣṭādhyāyī ( Eight Chapters ) composed by Pāṇini (c. 6th 5th century BCE). The Aṣṭādhyāyī in turn was the object of a rich commentatorial literature, documents of which are known from the time of Kātyāyana (4th 3rd century bce) onward. Fig. 1 Śivasūtra To explain greatness of Pāṇini s Sanskrit grammar, let s consider only one sutra, Śivasūtra, is a table which defines the natural classes of phonological segments in Sanskrit by intervals. Pāṇini wanted to make the Aṣṭādhyāyī as brief as possible. One of the ways he did so was by organizing the Sanskrit sounds into different groups. Since many rules deal with particular letters, these groups made it easier to refer to particular letters. WIEBKE PETERSEN present a formal argument in [12] which shows that, using his representation method, Pāṇini s way of ordering the phonological segments to represent the natural classes is optimal. The argument is based on a strictly set-theoretical point of view depending only on the set of natural classes and does not explicitly take into account the phonological features of the segments, which are, however, implicitly given in the way a language clusters its phonological inventory. III. RELATED WORK Acharya Text Editor [4]: Acharya is a multi-platform text editor that supports Asamiya, Bangla, Devanagari, Gujarati, Kannada, Malayalam, Oriya, Punjabi, Tamil and Telugu. It achieves this functionality by storing Indic text in syllabic units instead of characters as most other editors do. Although it is using custom encoding, the editor supports conversion of text to standard encoding like ISCII and UNICODE. This encoding tries to capture the syllabic nature of Indic scripts. In this encoding, each syllable can be view as one of V, C, CV, CCV, CCCV, CCCCV combinations. The initial C is the base consonant and the subsequent Cs represents conjunct combinations. The memory representation of each syllable is a 16-bit value with the following bit distribution: V-4 CNJ-5 C-6 Etymological trends in the Sanskrit vocabulary [22], in this work they have used Linear Discriminant Analysis and Neural Network (NN) for dating Sanskrit text automatically. Sanskrit Segmentation [8], this work examines how to solve by computer software the problem of identifying in a Sanskrit sentence the division of a continuous enunciation into a sequence of discrete word forms. Comparing Sanskrit Texts for Critical Editions [21], a critical edition takes into account all the different known versions of the same text in order show the difference between any two distinct versions, in terms of words missing, changed or omitted. Sanskrit Compound Processor [11], Sanskrit is very rich in compound formation. Typically a compound does not code the relation between its components explicitly. In this paper, authors discuss the automatic segmentation and type identification of a compound using simple statistics that result from the manually annotated data. Automatic Sanskrit Segmentizer using Finite State Transducers [15], in this paper, authors proposed a novel method for automatic segmentation of a Sanskrit string either encoded as a Unicode string or as a Roman transliterated string and the output is a set of possible splits with weights associated with each of them. SanskritTagger, a stochastic lexical and POS tagger for Sanskrit [18], SanskritTagger is a stochastic tagger for unpreprocessed Sanskrit text. The tagger tokenises text with a Markov model and performs part-ofspeech tagging with a Hidden Markov model.his document is a template. An electronic copy can be downloaded from the conference website. For questions on paper guidelines, please contact the conference publications committee as indicated on the conference website. Information about final paper submission is available from the conference website. IV. TOOLKIT For ASCII character set, functionalities like isalpha(), isalphanumeric(), diff etc are available for general text processing of English documents. For processing documents and web pages in Indian languages, functionalities like these are not available. We have implemented a toolkit in Java which provides basic functionalities for Indian language text processing. Currently we support only documents in Devanagari script. While reading literature related to Indian language processing, we have noticed that most implementations have been done using transliterated text like IAST, WX-alphabetic, KH, Velthius etc instead of directly using Unicode Devanagari. Our toolkit is using only Unicode and there is no need of unnecessary conversion. Functionalities provided by this toolkit are: boolean isconsonant (char) : This function returns true if character is a Devanagari consonant; otherwise false. boolean isshortvowel (char): This function returns true if character is a Devanagari short vowel; otherwise false. boolean islongvowel (char): This function returns true if character is a Devanagari long vowel; otherwise false. boolean isvowel (char): This function returns true if character is a Devanagari vowel; otherwise false. boolean isdevalpha (char): This function returns true if character is either a Devanagari vowel or consonant; otherwise false.

5 boolean ishalant (char): This function returns true if the character is halant; otherwise false. boolean isdevnumeric (char): This function returns true if the character is a Devanagari digit; otherwise false. boolean isvowelsign (char): This function returns true if character is a Devanagari vowel sign; otherwise false. There are similar functions like isanuswara (char), isnukta (char), isavagraha (char), isom (char), ischandrabindu (char), isinvchandrabindu (char). All these functions are for signs. There is also one more function issign (char) to detect sign. boolean isnasal (char): This function returns true if the character belongs to { ङ, ञ, ण, न, भ}; otherwise false. boolean ishardconsonant (char): This function returns true if the character belongs to { क, च, ट, त, ऩ, ख, छ, ठ, थ, प, श, ष, स}; otherwise false. boolean issoftconsonant (char): This function returns true if the character belongs to { ग, ज, ड, द, फ, घ, झ, ढ, ध, ब, ह}; otherwise returns false. Similar to above functions; there are other functions to detect semi vowels, dipthongs and to classify into five classes namely, Guttarals, Palatals, Cerebrals, Dentals and Labials. syllablize (text): This function segments input text into syllables. For example, म =,, म. getnextsamyuktakshara (): returns next samyuktakshara. isconsonantconjunct (text): This function returns true if text is a consonant conjunct like, otherwise false. converttolaghuguru (text): This function converts text in Laghu and Guru sequence. Validateunicode (text): This function is for validating Unicode text syntax. Using following patterns [36] for syllable, we have implemented getnextsamyuktakshar (). Functions like validateunicode () can also be included in toolkit. Consonant syllable: {C+[N]+<H+[<ZWNJ ZWJ>] <ZWNJ ZWJ>+H>}+C+[N] +[A] + [< H+[<ZWNJ ZWJ>] {M}+[N]+[H]>]+[SM]+[(VD)] Vowel-based syllable: [Ra+H]+V+[N]+[<[<ZWJ ZWNJ>]+H+C ZWJ+C>]+[{M} +[N]+[H]]+[SM]+[(VD)] Stand Alone cluster (at the start of the word only): #[Ra+H]+NBSP+[N]+[<[<ZWJ ZWNJ>]+H+C>]+[{M}+[ N]+[H]]+[SM]+[(VD)] Where { } denotes zero or more occurrences, [ ] denotes optional occurrence, < > denotes one of, ( ) denotes one or two occurrences, C denotes consonant, V denotes independent vowel, N denotes nukta, H denotes halant/virama, ZWNJ denotes zero width non-joiner, ZWJ denotes zero width joiner, M denotes matra (up to one of each type: pre-, above-, belowor post- base), SM denotes syllable modifier signs, VD denotes vedic, A denotes anudatta (U+0952), NBSP denotes NO-BREAK SPACE. We have found invalid character sequences in existing Unicode documents which are as given below: Errors in Mahabharata ( : Book 1: chapter 171, द : Book 2: chapter 59, ब Unicode is considering : Book 10: chapter 1: Problem is that as sign (Devanagari sign avagraha), भ : Book 13: chapter 33, : Book 1: chapter 114 Errors in Devanagari web documents (1.4GB):,,, श, श, म etc. Errors in Rig-Veda ( भ, म ण, ह, ह, र म, म, ग, ध, म, र, ग etc. We have explained the importance of Pāṇini s Śivasūtra in section II. So, we have implemented function to check whether a given character belongs to a given segment or pratyāhāras. boolean charinpratyahara (char, pratyahara) sanskritwordsearch(): Sanskrit text is written as a continuous string of letters without space rather than a sequence of words. Thus, the Sanskrit text consists of a very long sequence of phonemes. Words in Sanskrit text are combined by applying sandhi rules on initial and final characters of words. So, original words are modified because of sandhi rules. When you search a word in Sanskrit text using simple search algorithm you may not find the all occurrences of word. For example, जर द म ऩकऩ थ ब तम गऩ थ त [11] = जर- द- म ऩक-ऩ थ - ब - तम ग- ऩ थ त When you search द in above sentence you will not find द even if it s there in the sentence. There are two solutions to this problem. First one is segmentation and second is design search algorithm for this problem. First solution is segmentation; this is a task of splitting Sanskrit text into its constituent words. This is complex and time consuming task. So, we have developed Sanskrit search algorithm. This algorithm works with word of length two or more than two. For word of length one there are lots of possibilities because of recursive sandhi rule and also result table contains wrong index of word. Fig. 2 Sandhi of words

6 In Fig. 2, W2 is a search word and W1 and W3 are adjacent words. Because of sandhi rules all three words may get modified. v + x = p and y + z = q. The substring R of W2 is not changed. We can use R for partial searching. We will discuss later about the case when R is empty. For partial search we are using Knuth-Morris-Pratt [17] (KMP) Algorithm. KMP algorithm is string searching algorithm searches for occurrence of pattern P within the text T. KMP algorithm is tight analysis of naive search algorithm. After a shift of the pattern, the naive algorithm forgets all information about previously matched symbols. KMP algorithm makes use of the information gained by previous symbol comparisons. It never re-compares a text symbols that has matched a pattern symbol. KMP algorithms with input R will give us table of results which may contain invalid results because of R which is a substring of W2. So, to further process the result table we use known value of x and y. We have list of all sandhi rule [19]. We are using two HashMaps which are InitialMap and FinalMap. InitialMap represents the first equation and is use to map x to list of p. FinalMap represents the second equation and is use to map y to list of q. The idea is that in above equations keep x and y constant and apply sandhi rule for all possible v and z respectively. Consider following sandhi rules for FinalMap, + = + = + = ए + ई = ए Algorithm 1 Sanskrit Word Search Algorithm Procedure SanskritSearch(w, T) Inputs: w sanskrit word Output: table: contains the index of each occurrence of w in text T sandhirule read(sandhirule.txt) Initialize InitiaMap using sandhirule Initialize FinalMap using sandhirule x getinitial(w) y getfinal(w) R getmiddle(w) plist InitialMap.get(x) qlist FinalMap.get(y) if R is empty for all i ϵ {1 to sizeof(plist)} do p plist[i] for all j ϵ {1 to sizeof(qlist)} do q qlist[j] table KMP(p+q) end end for Rtable KMP(R) for all i ϵ {1 to sizeof(rtable)} do if Text.substring(Rtable[i]-x.len,Rtable[i]) equals x if Text.substring(Rtable[i]+R.len,Rtable[i]+R.len + y.len) equals y table Rtable[i]-x.len for all j ϵ {1 to sizeof(qlist)} do q qlist[j] if Text.substring(Rtable[i]+R.len,Rtable[i] +R.len+q.len) equals q table Rtable[i]-x.len end for if Text.substring(Rtable[i]+R.len,Rtable[i]+R.len + y.len) equals y for all j ϵ {1 to sizeof(plist)} do p plist[j] if Text.substring(Rtable[i]-p.len,Rtable[i]) equals p table Rtable[i]- p.len end for for all j ϵ {1 to sizeof(plist)} do p plist[j] if Text.substring(Rtable[i]-p.len,Rtable[i]) equals p initialflag 1 end for for all j ϵ {1 to sizeof(qlist)} do q qlist[j] if Text.substring(Rtable[i]+R.len,Rtable[i] +R.len+q.len) equals q finaflag 1 end for if initialflag equals 1 and finalflag equals 1 table Rtable[i] initialflag 0 finalflag 0 end for Return table end Procedure FinalMap maps final of some word to, ए. Consider following sandhi rules for InitiaMap, + = + = म उ + = ए + = म

InitialMap maps initial of some word to, म,, म. When R is non-empty, use a KMP algorithm which gives result table which contains the index of occurrence of R.

.. are all possible versions of word after applying sandhi rule. Now consider a case when R is empty, we cannot use KMP algorithm. We have initial x and final y of word.

7 InitialMap maps initial of some word to, म,, म. When R is non-empty, use a KMP algorithm which gives result table which contains the index of occurrence of R. To get the indices of word, for each index in KMP result table check for the string xry, (v1 + x) R (y + z1) = p1rq1, p2rq2... Where xry, (v1 + x) R (y + z1) = p1rq1, p2rq2... are all possible versions of word after applying sandhi rule. Now consider a case when R is empty, we cannot use KMP algorithm. We have initial x and final y of word. When words in Sanskrit sentence are joined by applying sandhi rules both x and y may get changed. To search word, we have to search for xy, (v1 + x) (y + z1) = p1q1, p2q2... Where p1q1, p2q2... are all possible versions of word after applying sandhi rules. While implementing getinitial and getfinal procedures in Algorithm 1, it is not always that procedure returns first and last character of word because if last character is consonant than return inherent vowel. If last character is visarg than we have to also consider second last character because of rules like + श = श. Using Sanskrit word search algorithm, we have implemented Firefox Add-on (Fig. 3). We have used the Addon builder for developing and testing the Add-on. The Addon Builder is a web-based development environment that provides additional functionality for working with the Add-on SDK. All add-ons created with the Add-on Builder or Add-on SDK are restart less by default, hence users do not have to interrupt their browsing to begin using the add-on right away. E. Encoding Issues The linguistic unit for analyzing text in Indian languages is a syllable. We can get some idea of the commonly occurring sounds in speech by analyzing frequency of syllables. Also from the point of view of grammar, the syllable assumes significance since syllables affixed or prefixed to root forms result in words that conform to the rules of grammar. Therefore, it should be possible to perform computations on a string of syllables to arrive at the underlying structure of a sentence and in the process, understand the sentence as well. In Indic languages, when a compound word is seen in text, its meaning could be determined from the connected words and one can break the word in multiple ways, all leading to perfect but different meanings. Indian poets had cleverly used compounds to hide the true meaning of the sentence from all but those who could correctly split the word. In the Mahabharata Epic, consisting of 100,000 śloka, every thousandth śloka could be interpreted in two ways. Study of frequency of occurrences of sounds could possibly lead to interesting results and could give explanation for the way in which changes in language had taken place. So we have analyzed the text of Mahabharata, Rig-Veda and Devanagari web documents (1.4 GB). The analysis yields information on syllables of the form V, C, CV, CCV, CCCV and CCCC...V. For Devanagari web documents most frequents are: य = , क = , न = , स = , ऩ = For Devanagari web documents less frequents are: म = 1, द = 1, म = 1, = 1, न = 2 Fig. 3 Add-on details string getunicodestring (codepoint): This function returns Unicode character string given codepoint. transliterate(text, from, to): Currently this function is useful for transliterating from IAST to Unicode Devanagari. We are going o extend this function to transliterate between Indian scripts. It is difficult to analyse and process web pages because they contain html tags, other language text etc. So, before we process web pages, we have to extract Devanagari text from these web pages. We have provided functions for this: extractdevanagari(pagelink), gotofirstdevchar(pagelink). iterativegetdev(pagelink): This function iteratively returns Devanagari text. We are using this toolkit for analysing problems which are explained in hereafter sections. V. USECASES We have used our toolkit to investigate problems related to Indian language processing and Sanskrit linguistic analysis. Details about these problems are given below: For Mahabharata ( most frequents are: भ = , त = , य = 95883, स = 94388, न = For Mahabharata ( less frequents are: झ = 1, म = 1, म = 1, फ = 2, म = 3 For Rig-Veda ( most frequents are: भ = 17700, = 15730, य = 13411, त = 13315, न = For Rig-Veda ( less frequents are: ट = 1, न = 1, ज = 2, त = 2, ध = 3 Huffman coding Algorithm 2 shows how Huffman tree is formed. We have used Huffman encoding to check whether we can fit syllable with minimum frequency into 2 bytes, but

8 for Mahabharata ( म =1: :21 is 21 bits Algorithm 2 Huffman Coding Algorithm Procedure HUFFMAN(C) Inputs: C - Syllables with their frequency Output: root: Root of Huffman tree n C Q C for all i ϵ {1 to n} do do allocate a new node z left[z] x Extract _Min (Q) right[z] y Extract_ Min (Q) f[z] f[x] + f[y] Insert (Q, z) end for Return Extract_Min (Q) end Procedure We have proposed a coding scheme at syllable level. We have assigned codes to syllables in dictionary order to make sorting easy and to remove ambiguity between traditional sorting order and sorting order according to Unicode. 8 Bit code with most significant bit 0. There are 2 7 possibilities and MSB 0 is important because it make this coding scheme UNIX safe. 0 x x x x x x x 1 x x x x x x x 16 Bit code represents 2 11 syllables x x x x x 1 0 x x x x x x 24 Bit code represents 2 16 syllables. In this code first byte with represent the escape sequence which is used to give information about the script x x x x x 1 0 x x x x x x 1 0 x x x x x x Just as UTF-8 is UNIX safe and self-synchronizing, our coding idea is also UNIX safe and self-synchronizing. Comparison between Unicode and Our Coding Scheme: 1. Sorting and Searching Algorithm: Text searching and sorting is one of the most well researched areas in computer science. Sorting and searching non-english text presents a number of challenges. Primary source of difficulty is accents, which have very different meanings in different languages, and sometimes even within the same language: Letters like é in café are treated as minor variants of the letter that is accented, in this case e. Sometimes the accented form of a letter is treated as a distinct letter for comparison. For example, In Danish Æ comes after Z. Other difficulties arise when multiple characters are compared as if they were one. For example, in traditional Spanish ch is treated as a single letter that sorts just after c and before d. In case of Indian languages, the Unicode two-part vowels, Tamil vowel sign O can be composed with E + AA. Though the resultant output looks identical, it adds additional logic to sort, search and replace, and so on. These kinds of difficulties are solved by using Unicode Collation Element Table [30, 31]. Java is using Unicode as its character set. In Unicode there are 65,535 distinct characters that cover all modern languages of the world. In general this is good; it makes the task of developing global application a great deal easier. However, algorithms like Boyer-Moore that rely on an array indexed by character codes are very wasteful of memory and take a long time to initialize in this environment. To make the Boyer-Moore algorithm work [31], first consider what happens when a letter occurs twice in the pattern. There are two possible shift distance for that letter, one for each occurrence. We always want to enter the smaller of the two shift distances in the table. If we used the larger one, we might shift the pattern too far and miss a match. In a sense, the shift table is not required to be perfectly accurate, and conservative estimates of shift distances are OK. As long as we don t shift the pattern too far, we re fine. This realization leads to a simple technique for applying the algorithm to Java collation elements: simply map all possible elements down to a much smaller set of shift table indices (say, 256). If two or more elements in your pattern happen to collide and end up with the same index, it s not a problem as long as you enter the smaller of the shift distances in the table. So, complexity of sorting algorithm of Java is O (n*log n) and complexity of searching algorithm of Java is linear-time. Another problem with Unicode sorting with respect to Indian languages is that Unicode code-point order is admittedly not intended to solve culturally acceptable sorting [32]. However, sorting is frequently a source of confusion. Providing a default collating order for each script would be helpful in clarifying this development issue. Now consider proposed syllable level encoding, There are no difficulties with this encoding scheme which are discussed above. So, it is easy to implement sorting algorithm and searching algorithm (Using same logic which is used by Java searching algorithm) for syllable level encoding. Complexity of sorting algorithm is O (n*log n) and complexity of searching algorithm is linear-time.

9 2. Analysis of Encoding schemes with respect to Natural language processing: In this section we will discuss how Unicode and syllable level encoding schemes make impact on language technology. The efficiency of Unicode and syllable level encoding scheme is analysed on tools like Sandhi-Splitter, Sandhi-Generator, Sanskrit Chhandas, Sanskrit word search, Sanskrit Morphological analyser, Sanskrit Reader. Consider a Sanskrit Chhandas [24], which is used find laghu and guru syllable sequence of Sanskrit poetry. Using this sequence we can classify poems into chhandas. Akshara is as much of a word as can be pronounced distinctly at once or by one effort of the voice. So a vowel with or without one or more consonants is considered as one syllable. A syllable can be laghu or guru depending on whether its vowel is short or long. Laghu syllable: The vowel (a), (i), उ (u), ऋ (ṛ), ऌ (ḷ) are laghu. Whenever any of these is used in a verse separately or with one or more consonants, it will be considered as a short syllable. For example क (ka), क (ki), etc, are laghu syllables. The vowels (ā), ई (ī), ऊ (ū), ॠ (ṝ), ए (e), ऐ (ai), (o) and औ (au) are guru. Whenever any of these is used in a verse separately or with one or more consonants, it will be considered as guru. However, a laghu vowel becomes long under the three conditions given below: a. If a vowel is followed by an anusvāra. For example, क (kaṃ) etc. b. If a vowel is followed by a visarga. For example क (kaḥ) etc. c. If a vowel is followed by a conjunct consonant. For example ग ध (gandha). Here, even though ग is laghu, it has to be considered as guru because it is followed by conjunct consonant ध (ndha). Now, let's compare two encoding schemes with respect to Sanskrit chhanda. Algorithm is simple: 1. syllablize(text) 2. for each syllable check laghu and guru, using isshortsyllable(syllable) and islongsyllable(syllable) functions respectively First step of algorithm can be solved in single pass in both encoding schemes. But in the second step, to find whether particular conjunct consonant is laghu or guru is difficult in syllable level encoding because conjunct consonant in syllable level encoding is single unit and to find last character of conjunct consonant we have to generate conjunct consonant by trying all combination or keep table of all possible syllable with their characters. As there are more that 15,000 syllable, it consumes more time and memory. Consider a Sandhi-Generator [19], Sanskrit grammar contains the set of euphonic rules called Sandhi rules, which when applied, cause phonological changes at word or morph boundaries. These rules are given in Pāṇini's Aṣṭādhyāyī. W1x + yw2 W1x + W2 W1zW2 W1x' + W2 In above two equations, x and y are final character of W1 and initial character of W2 respectively. Here z is euphonic transformation of x and y. Now consider the case, when final character x of W1 is conjunct consonant and initial character y of W2 is also conjunct consonant. In Unicode it is simple to get last or character of word but when we consider syllable level encoding, it is difficult to find initial and final character of word as explained for Sanskrit chhandas. F. Subtractive vs. Additive Model When we write word in any Indian language using Unicode characters and its rules, we use subtractive model to write a word. For example, consider a pure consonant क, we first write क then to remove inherent vowel. We want to compare subtractive model with additive model. In additive mode, we use code for pure consonant. For example, assign code to क instead of क and to write consonant, use code for inherent vowel (IHV) and add IHV to pure consonant (PC). For example, PC + IHV = C. Now, to compare additive and subtractive model we need frequency of pure consonant and consonant because we want to compare both models in terms of size of document. Size is an important attribute for text because its impact on lots of other things like storage, network file transfers etc. As we can see in Table 1, the frequency of consonants is higher than the frequency of pure consonants in both Bhagavadgita and Mahabharata, but there is one exception in Mahabharata-frequency, -5380, which has higher frequency than ङ-110. The result of analysis of frequency of pure consonant and consonant in crawled web documents in Hindi language is also similar. So, from Table 1, we can conclude that size of Unicode document less than document which uses the additive model because frequency of most of consonant is higher than pure consonant and also in subtractive model, for consonant we use one code-point and for pure consonant we use two code-points, while in additive model, for consonant we use two code-points and for pure consonant we use one code-point. G. Analysing Sanskrit using Statistical Techniques In ancient India, Sanskrit was orally transmitted from generation to generation. This process causes the euphonic changes in the text at the word boundaries. As we have mentioned in the previous section that Sanskrit text is continuous. There is no space between words.

10 There are two forms of Sanskrit text, saṃhitapāṭha and padapāṭha. Saṃhitapāṭha is the continuous text according to the mode of recitation and Padapāṭha is the text with word separated with space after sandhi vichhed according to mode of recitation. Padapāṭha is for understanding the meaning of text. Table 1: Frequency of pure consonant and consonant Bhagavadgita Consonant Pure Consonant क 879 ख 101 ग 467 घ 25 च 690 छ 70 ज 414 झ 1 ञ 191 ट 83 ठ 47 ड 25 ढ 32 ण 379 त 2725 थ 337 द 748 ध 461 न 1750 प 871 फ 35 ब 168 भ 629 म 2287 य 2669 र 1853 ल 280 व 2113 श 401 ष 405 स 1245 ह 602 क 446 ख 15 ग 74 घ 7 99 च 131 ज 9 ज 255 ञ 58 ट 27 ड 2 ढ 1 ण 53 त 1194 थ 3 द 503 ध 83 न 825 प 458 ब 83 भ 50 म 388 य 10 र 980 ल 39 व 134 श 401 ष 296 स 737 ह 118 Mahabharata Consonant Pure Consonant क ख 7417 ग घ च छ 8688 ज झ 71 ञ 9022 ट 9855 ठ 6951 ड 8660 ढ 1181 ण त थ द ध न प फ 2130 ब भ म य र ल व श ष स ह क ख 1910 ग 7083 घ च 7939 ज 387 ज झ 1 ञ 4143 ट 4578 ठ 43 ड 603 ढ 63 ण त थ 745 द ध 6767 न प फ 3 ब 3727 भ 5692 म 7385 य 353 र ल 4612 व 7519 श ष स ह 8313 The Natural Language Processing is concerned with the design and implementation of effective natural language input and output components for computational system [29]. The most modern natural language processing depends heavily on statistics and complex statistics and complex statistical models. For example, Language modelling for automatic speech recognition using smoothed n-grams to find the most probable string of words w 1,..., w n out of set of candidate strings compatible with the acoustic data, Part-of-speech tagging using hidden Markov models to find the most probable tag sequence t 1,...t n given a word sequence w 1,...w n, Word sense disambiguation using Bayesian classifiers to find the most probable sense s for word w in context C. Sentiment mining refers to the application of NLP, Computational Linguistics, and text analytics to identify and extract subjective information in source material. Sentiment analysis is about determining polarity of given text at the document, sentence level. There are different methods for sentiment analysis like Latent Dirichlet Allocation (LDA) [21], Support Vector Machine (SVM) etc. Also using word-net, knowledge-net (Amarkosha [19]) etc. Most of NLP, application of NLP, Computational Linguistics is depends on words. But Sanskrit sentences are continuous. We have to split Sanskrit sentences in to words. The task of splitting sentences into its constituent word, segmentation, is not simple. There are two different works on this problem, namely, Sanskrit-Hindi Accesor Cum Machine Translator [19, 15] and Sanskrit Segmentation [8]. The segmentation is not simple task because there are number of possible segmentation and it requires morphological analysis. For example, consider a following sentence, "त व स य व च" There are 28 possible segmentations because we cannot directly find the boundaries of words. At every character we can apply sandhi rule and split the sentence. This problem of over-generation is solved by prioritizing solutions based frequencies and graph matching algorithm in [15] and [10] respectively. Now, we have words but still we cannot use these words directly because we cannot find their meaning in dictionary. Also knowledge-net like Amarkosha takes only Pratipadika (uninflected forms) as input. Morphology, in linguistics, is the study of the forms of words, and the ways in which words are related to other words of the same language. Formal differences among words serve a variety of purposes, from the creation of new lexical items to the indication of grammatical structure. Sanskrit has rich inflectional as well as derivational morphology. Sanskrit is having a formal grammar in the form of Aṣṭādhyāyī. One may think, therefore, it would be trivial task to build a morphological analyser based on this grammar. But it is not so. The issues related to the development of Sanskrit morphological analyser are well described in [9]. In Sanskrit every word (except adverbs and particles) is inflected and the grammatical inflection itself shows the relation in which one word stands to another. Thus grammatically speaking, there is no order as such that need be much attended to.... But if there is no grammatical order, there is a sort of logical sequence of ideas, which must follow one another in a particular order.... Words must be so arranged that the ideas will follow one another in their natural order, and the words in their natural connection.... [26]. There is no sentence structure in Sanskrit and no fix word order in sentence. So, NLP problems like Part-of-speech

11 tagging are difficult. Word formation in Sanskrit is described by Fig. 4. We have used Bhagavadgita for statistical analysis provided by GRETIL [33]. This file is in Unicode encoding. This file has lots of typing mistakes which we have corrected manually by comparing it with printed Bhagavadgita [34]. Sanskrit Reader is online tool, we have used to this tool to get padapāṭha of Bhagavadgita. In the paper related to this tool, there is no information regarding performance of this tool. We have tested this tool using Bhagavadgita. We have transliterated Bhagavadgita from Unicode Devanagari to WX- ALPHABETIC encoding for Devanagari script because Sanskrit Reader accept only WX-ALPHABETIC, Velthius, Kyoto-Harvard and SLP1 encodings. The performance of this tool is 80.90%. There are are 1403 sentences (some removed) in Bhagavadgita. Fig. 4 Sanskrit Word Formation [11] To improve the performance of this tool, we have integrated this tool with Dictionary [35], Sandhi-Splitter [19] and Morphological Analyser [19]. The performance after integration is 99.71% (some words are removed from some sentences or manually processed). Algorithm 3 gives details about integrated tool. We have integrated 3 output files of IntegratedTool and Bhagavadgita file using line number. We have padapāṭha of Bhagavadgita, therefore, we can use statistical method for analysis. We have used Latent Dirichlet Analysis (LDA) for analysis of Bhagavadgita. In the documents, words are related to each other, in terms of synonyms, hypernyms, hyponyms etc. Also, their cooccurrence in a document infers these relations. They are key to the hidden semantics in document. To discover these semantics different probabilistic and non-probabilistic techniques are used. LDA is one of them. It is a probabilistic graphical model that is used to find hidden semantics in documents. It is based on projecting words (basic units of representation) to topics (group of correlated words). Being a generative model it tries to find probabilities of features (words) to generate data points (documents). In other words, it finds a topical structure in a set of documents, so that each document may be viewed as a mixture of various topics. Another way of looking at LDA is by viewing the documents as a weighted mixture of topics i.e. the probability distribution of topics for each document are the weights of the topics in that document such that the sum of the weights is 1. Similarly, the topics can be viewed as a weighted mixture of words with the probability distribution of the topic over the words as the weights Algorithm 3 Integrated Tool Procedure IntegratedTool(S) Inputs: S Sanskrit Sentence Output: padapatha: padapatha of Sanskrit sentence padapatha SanskritReader(S) if padapatha contains No solution to chunk chunk getchunk(padapatha) word chunk[chunk.length] if Dictionary contains word remove&writetodictrm(word) return SanskritReader(S') morphoword MorphAnalyser(word) if morphword is empty sandhiword SandhiSplitter(word) if sandhiword contains No Split remove&writetospitrm(word) return SanskritReader(S') replace word with sandhiword return SanskritReader(S') endif replace word with morphword return SanskritReader(S') Return padapatha end Procedure We want to compare the results obtained by applying LDA on padapāṭha of Bhagavadgita obtained using Sanskrit Reader with results obtained by applying LDA on padapāṭha of Bhagavadgita provided by Bhaktivedanta VedaBase Network [25]. Padapāṭha of Bhagavadgita by Bhaktivedanta VedaBase Network is manually written. There are 18 chapters in Bhagavadgita. We have considered each chapter as a document. Results are given in table 2. Results in table 2 are of documents without morphological analysed. There is huge difference between the results. The reasons for difference are samāsaḥ, some unnecessary sandhi splits. Consider a some unnecessary sandhi splits like: सजय = [सन {nom. sg. m.}[सत {ppr. [2] ac.}[ स _1]] <n j>] [ जय {voc. sg. m. voc. sg. n. } [जय ] ]

12 पस य = [ पसम <m g ] [ ग य { abs. }[गम ] ] ध क त = [ध {iic.} [ध {pp.} [धष ]] <>] [क त { nom. sg. m. }[क त ] ] क शर ज = [क श {iic.}[क श] <>] [र ज {acc. pl. m. nom. pl. m. g. sg. m. abl. sg. m. g. sg. n. abl. sg. n. acc. pl. f. nom. pl. f. g. sg. f. abl. sg. f.}[र ज _2 ] ] पद य = [ {loc. sg. m.} [ _3] <>] [पद {loc. sg. m. acc. du. n. nom. du. n. loc. sg. n.}[पद] <>] [य {acc. pl. f. nom. pl. f.} [य_1 ] ] Table 2: LDA Results Sanskrit Reader Bhaktivedanta Vedabase Network Topic 1 Topic 2 Topic 1 Topic 2 कर म र र यत य सर म वर न आ प र म ब रह म आहर त आ र ह आस म इदर आ परर वर र र ह इदर तत तम त आर जम र य भ रत त र र आयर तमय कर म सर म य र त ब रह म आ त श र य ग उच यत Samāsaḥ is compound word, created by combining two or more words and this compound word conveys same meaning as that of collection of component words of compound word. For example, ल ब दर = ल ब + उदर meaning गणश व रप ष = व र + प ष Semantically, Pāṇini classifies Sanskrit Samāsaḥ [11] into four major types: Tatpuruṣaḥ (Endocentric with head typically to the right) Bahuvrīhiḥ (Exocentric) Dvandvaḥ (Copulative) Avyayībhāvaḥ (Endocentric with head typically to the left and behaves as an indeclinable) Problem here is to detect whether particular word is Samāsaḥ or not to prevent over-generation while segmentation. For example, Bahuvrīhiḥ Samāsaḥ न लकणठ (श व), such compounds may be used in gender which the right component does not admit by itself. Sanskrit Reader have temporary solution for Samāsaḥ, they are just recording Samāsaḥ in the lexicon. ज ञ We have one of the many solutions for this problem. The solution is using word sequence analysis. We have analysed the frequencies of unigram, bigram and trigram. An n-gram is the continuous sequence of n items from a given sequence of text. An n-gram could be any combination of letters. However, items can be phonemes, syllables or words according to the application. We know the frequencies of bigram and trigram, therefore, if two or three continuous words occurs frequently in the padapāṭha of Bhagavadgita obtained using Sanskrit Reader than we can assume that there is high probability that compound of those two or three continuous words is Samāsaḥ. The results of bigram and trigram analysis of padapāṭha Bhagavadgita is large. We have manually analysed the results and found some Samāsaḥ. For example, परम तप with frequency 9 सख द ख with frequency 6 Also Sanskrit Reader is not consistent with the sandhi split. For example, सखम द खम - सख द ख, मह ब ह - मह ब ह. H. Digital Aṣṭādhyāyī India has rich heritage in the linguistic studies. Out of the six vedāṅgas (field of studies necessary to study the vedas) viz. śikṣā, vyākaraṇa, chanda, nirukta, jyotiṣa and kalpa, the first four are concerned with the language studies. śikṣā deals with the pronunciation, vyākaraṇa with the grammatical aspects, chanda with the prosody and nirukta with the etymology. Though all these are important aspects of linguistics, it is the vyākaraṇa and the nirukta which have major role to play in understanding how a language communicates thoughts from one human being to the other. Pāṇini consolidated all the earlier grammars for Sanskrit and presented a concise and almost exhaustive descriptive coverage of and prevalent Sanskrit language. The goal of Pāṇinian enterprise is to construct a theory of human communication using natural language [37]. Pāṇinian grammar, as any other grammar formalism would give a very good theory to identify the relations among words in a sentence. Importance of Pāṇinian grammar lies in the minute observations of Pāṇini regarding the information coding in a language. Pāṇini`s Aṣṭādhyāyī represents the first attempt in the history of the world to describe and analyse the components of a language on scientific lines. It is the earliest complete grammar of Classical Sanskrit, and in fact is of a brevity and completeness unmatched in any ancient grammar of any language. It takes material from the lexical lists as input and describes algorithms to be applied to them for the generation of well-formed words. It is highly systematised and technical. Inherent in its approach are the concepts of the phoneme, the morpheme and the root. His rules have a reputation for perfection that is, they are claimed to describe Sanskrit morphology fully, without any redundancy. The Aṣṭādhyāyī consists of 4000 (3983 in Kāśikāvṛtti) sutras (sūtrāṇi) or rules, distributed among eight chapters, which are each subdivided into four sections or padas (pādāḥ).

Aṣṭādhyāyī is difficult to understand because it is complex and its technical structure [38]. In recent years, research to Hack the structure of Aṣṭādhyāyī has gained the momentum.

13 Aṣṭādhyāyī is difficult to understand because it is complex and its technical structure [38]. In recent years, research to Hack the structure of Aṣṭādhyāyī has gained the momentum. Aṣṭādhyāyī available only in linear form and there is no indexing of rules, concepts, etc. To provide support to researchers, we have implemented digital Aṣṭādhyāyī for navigation and computation. In the section II (D), we have explained Śivasūtra. The Śivasūtra form the component in which the phonological segments of the language and their grouping in natural phonological classes, designated by pratyāhāras. In the Aṣṭādhyāyī Pāṇini refers to the phonological classes in hundreds of rules. The Śivasūtra identify 42 phonological segments and consists of a sequence of 14 sūtras (rows in Fig. 1), each of which consists of a sequence of phonological segments bounded by a marker (characters in last column), called anubandha. Phonological classes are denoted by abbreviations, called pratyāhāras, consisting of a phonological segment and an anubandha. For example, [ क ] = [ इ ऋ ऌ]. We have constructed index of all 42 pratyāhāras used in Aṣṭādhyāyī using Sanskrit word search algorithm. Sūtras are verb-less sentences unlike those in natural language and give an impressing of formulae or program like code. They are of following type (Pāṇini has not hinted to any such classification and it is strictly post- Pāṇini): 1. vidhi or operational rules: These form the core of the grammar. All other rules assist the operational rules. 2. saṃjñā or Definition: This rule is as basic as it gets. We take some term, like vṛddhi, and give it a specialized meaning that exists only within the scope of the Aṣṭādhyāyī. 3. paribhāṣā or meta-rules: rules which provide a check on the operational rules so that they do not suffer from over-application, under-application and impossible application. 4. adhikāra or heading rules: These rules are similar to a heading in modern books. Adhikāra have domains which are not always well defined and only the commentaries like Kāśikāvṛtti have to be consulted to understand their scope. 5. atideśa or extension rules: A rule is termed atideśa if it transfers certain qualities or operation to something for which they did not previously qualify. 6. niyama or restriction rules: rules which restrict the scope of other rules. Using basic meta-index of Aṣṭādhyāyī [39], we have constructed definition (saṃjñā) use chain data structure using Sanskrit word search algorithm. We have constructed linked-list based data structure for Aṣṭādhyāyī and in this data structure we have also included pratyāhāras index and definition (saṃjñā) use chain. This data structure also contains information about other types of sūtras. The main advantage of data structure is we can add other information as required. Using this data structure we have implemented web interface (Fig. 5) for Aṣṭādhyāyī in Java. Fig. 5 Digital Aṣṭādhyāyī We have implemented web interface for navigation and additionally, provided dictionary [35], sandhi splitter [19] and morphological analyser [19] in the same web interface. We want to construct graph of Aṣṭādhyāyī to understand its structure and make an executable model of Aṣṭādhyāyī. It s very challenging problem, not done yet. VI. CONCLUSION We have implemented a part of toolkit for Indian language processing and to support Sanskrit linguistic analysis e.g., word search in presence of sandhi. Using this toolkit we have analyzed the frequency of syllables in Devanagari we document, Mahabharata, and Rig-Veda. Using frequency analysis, we have designed syllable level encoding. There are some problems with Unicode, but when compared Unicode with our syllable level encoding, we found that Unicode is good encoding. We have analysed the Sanskrit in terms of statistical techniques using toolkit. We have found problems with Sanskrit Reader, namely, unnecessary sandhi splits and Samāsaḥ. We have proposed solution for Samāsaḥ. We have developed a digital prototype for Aṣṭādhyāyī to navigate and store discovered/computed properties. REFERENCES [1] [2] [3] Indian Script Code for Information Interchange ISCII. iscii91.pdf. [4] Krishnakumar V, Indrani Roy, Acharya A Text Editor and Framework for working with Indic Scripts, in Proc. IJCNLP, 08. [5] Cathy Wissink: "Issues in Indic Language Collation" [6] Huet, G.: Formal structure of Sanskrit text: Requirements analysis for a Mechanical Sanskrit Processor. In: Sanskrit Computational Linguistics [7] Scharf, P.M.: Levels in Pāṇini s Aṣṭādhyāyī, Sanskrit Computational Linguistics Lecture Notes in Computer Science, 2009, Volume 5406/2009, [8] Huet, G.: Sanskrit Segmentation. In: South Asian Languages Analysis Roundtable XXVIII, Denton, Ohio. (October 2009) [9] Kulkarni, A.P., Shukla, D.: Sanskrit Morphological Analyser: Some Issues. In: To appear in Bh.K Festschrift volume by LSI. (2009) [10] Huet, G.: Shallow syntax analysis in Sanskrit guided by semantic nets constraints. In: Proceedings of International Workshop on Research Issues in Digital Libraries, Kolkata. (2006) [11] Anil Kumar, Vipul Mittal and Amba Kulkarni: Sanskrit Compound Processor

DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook

DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook मह म ग ध अ तरर य ह द व व व लय (स सद र प रत अ ध नयम 1997, म क 3 क अ तगत थ पत क य व व व लय) Mahatma Gandhi Antarrashtriya Hindi Vishwavidyalaya (A Central University Established by Parliament by Act No.