A Toolkit for Sanskrit Processing

Size: px
Start display at page:

Download "A Toolkit for Sanskrit Processing"

Transcription

1 A Toolkit for Sanskrit Processing Hasmukh M Parmar Computer Science and Automation Indian Institute of Science Bangalore, Advisor: Prof. K. Gopinath Abstract There are many tools available for linguistic analysis and natural language processing (NLP) of English language like Unix diff, Latent Dirichlet Allocation etc. But when we consider Sanskrit and other Indian languages, tool support is still insufficient: for example, consider supporting Sandhi as part of morphological analysis. We have implemented toolkit for Indian language processing and to support Sanskrit linguistic analysis for documents with Unicode character set. In this toolkit, we have implemented important functionalities such as word search in sandhied text (Sanskrit text), Unicode text validation (syntax) etc. Using this toolkit we have studied three use-cases. Using toolkit, we have implemented digital Aṣṭādhyāyī in Java for learning. In the first use-case, we compute the frequency of samyuktasharas to investigate alternate syllable level coding schemes. In this case, by analysing frequencies of syllables of different documents like crawled web pages and text files we have implemented syllable level coding scheme and compared it with Unicode in terms of search and sorting algorithm complexity and also in terms of natural language processing. In the second use-case, we discussed about additive and subtractive model for encoding. In the third use-case, we discussed analysis of Sanskrit using statistical techniques and various problems related to preprocessing of Sanskrit text to make it suitable for statistical techniques. Keywords Sanskrit, Sandhi, Morphology and Unicode I. INTRODUCTION Indic languages belong to four major families: Indo-Aryan (or Indo-European), Dravidian, Austro-Asiatic and Sino- Tibetan. The majority of languages belong to the Indo-Aryan and Dravidian families. Indic writing systems are orthographic, and combine phonetic and syllabic systems. Each syllable is formed by a combination of vowels and consonants. Indic scripts originate from one script system, Brahmi. Consequently, some Indic languages share the same script (Hindi, Sanskrit, Marathi, Gujarati etc.) and others have scripts that are very similar (Tamil-Malayalam, Kannada- Telugu). Linguistically, India is a unique country. No other region has a comparative variety of distinct languages and scripts. Apart from some shared general characteristics, they are different enough that developers should understand their individual characteristics. There are different kinds of documents in Indian languages available like scanned documents, web pages, and other files (.doc,.txt, etc.). Scanned documents are important because most of Indian literature available only in scanned form. But for time being we are not considering them for analysis because to get text from scanned page, we need to use OCR and other pre-processing. We have focused only on web pages and files (.doc,.txt, etc.). There are some groups, who are working on Indian language processing and Sanskrit linguistic analysis. There are some basic functionalities, which we all require for language processing and Sanskrit linguistic analysis. As these functionalities are not available as tools or library, we have to spend important time in implementing these functionalities again. So, to analyse and process these documents we have implemented a toolkit. This toolkit offers functionalities for Devanagari script like checking whether a character is a consonant or not, checks whether a character is nasal or not, check for nasalization, Unicode text validation (syntax), word search in sandhied text (Sanskrit), process segments in Pāṇini s Śivasūtra, segment text into syllables, transliteration from International Alphabet of Sanskrit Transliteration (IAST) to Unicode Devanagari etc. This toolkit follows Unicode. Using this toolkit, we have analysed the problems related to Indian languages which are discussed hereafter. The multilingual multi-culture environment in India demands new approaches to computing with Indian languages. There is a need for a coding scheme across the country which deals with information common to all the regions. That is we need to define a standard coding scheme that satisfies the following requirements [1]: 1. Accommodate all basic characters All the basic vowels and consonants must be included in the code space. All the symbols which have the information about text (Vedic symbols, Accounting symbols etc.) must be coded. Punctuation marks, the ten numerals must be included in the code space. 2. Lexical Ordering A meaningful ordering of the vowels and consonants will help in text processing. Coding structure needs to reflect linguistic information If codes are assigned to the basic vowels and consonants in such way that codes convey some linguistic information. For example, the consonants in our languages are classified as cerebrals, palatals etc., based on sound generated. 3. Ease of data entry The scheme proposed for data entry must provide for typing in all symbols without having to install additional software or use multiple keyboard schemes.

2 It is also important that data entry modules restrict data entry to only those strings that carry meaningful linguistic content. In the context of Unicode, data entry scheme may permit typing in any valid Unicode character though it may convey nothing linguistically. It would therefore help if schemes allowed only linguistically valid text strings. 4. Transliteration Government and public institutions in the country are doing bilingual documentation with English as base and the regional language as the way to communicate with the people. Information originates in the regional language even if it is going to be transmitted in English. Ease of transliteration across the scripts is important to collect the common information and set up centralized database etc. There is also need to transliterate between English and the regional script because this is going to help people to learn language. 5. Handling spelling error 6. Dictionary look up When we search for the meaning of a word in the dictionary, we are not just looking for the meaning but also for a synonym. Hence we need to show synonyms and opposites of a word. There is knowledge-net called Amarakosha for Sanskrit which shows relations like synsets, ontology, holonymy, meronymy, hypernymy, hyponymy etc. There are two main encoding standards for Indian languages, namely, Unicode and Indian Script Code for Information Interchange (ISCII) which are widely accepted. Unicode is derived from ISCII. Unicode and ISCII are character level encodings. They are following rules for rendering characters. Considering above requirements for good encoding scheme, there are some technical problems [1, 2] with both Unicode and ISCII which are given below: 1. Uniform text rendering across applications is quite difficult. 2. Interpreting the syllabic content involves context depended processing, that too with a variable number of codes for each syllable. 3. Transliteration across Indian scripts will not be easy to implement. The Unicode assignments for linguistically equivalent Aksharas across languages are not sufficiently uniform to permit quick and effective transliteration. One requires independent table for each pair of scripts. Indian writing system is syllabic. Unique shape is used for each syllable. Acharya Text Editor [4] has raised important points about syllable level encoding scheme for Indian languages. We have started our analysis with frequencies of syllables. For frequency analysis we needed text in Indian language as well as in Unicode. We have implemented multithreaded web crawler to download web pages as well as documents. We have crawled Hindi, Kannada, Oriya, Bengali and Gujarati web pages. Considering the network traffic and data size, we have analysed the number of bits required for each syllable by using frequencies of syllables and Huffman encoding. We have proposed syllable level encoding which is UNIX safe and self-correcting like UTF-8. Searching and sorting are two important algorithms in terms of their applications. So, we have compared Unicode with our encoding in terms of these two algorithms. To analyse the effect of encoding schemes on natural language processing, namely, we have considered the language with complete grammar, Sanskrit, for this analysis. We have analysed the effect of Unicode and syllable level encoding while implementing different Sanskrit tools like Chhandas and sandhi-splitter. Analysis is related to the increase in complexity of tool when particular encoding is used for text. When we started analysis of Unicode, we were curious about why Unicode is using subtractive model instead of additive model. So, we have analysed both models using frequencies of pure consonant and consonant. In recent years, Sanskrit Computational Linguistics has gained momentum. There have been several efforts towards developing computational tools for accessing Sanskrit texts ([6, 7, and 18]). Most of these tools handle morphological analysis and sandhi splitting. Some of them ([8, 10 and 16]) also do the sentential parsing. However, there have been almost negligible efforts in statistical analysis of Sanskrit texts. Only one work has been done on this concept, Etymological trends in the Sanskrit vocabulary [22]. This work examines how the etymological composition of the Sanskrit lexicon is influenced by time and whether this composition can be used to date Sanskrit texts automatically. Suppose a researcher want to search a word in a Sanskrit text. Using just character comparison for searching is not going to be useful here because Sanskrit sentences are continuous in ancient Sanskrit text. Words in the sentences are combined using euphonic sandhi rules. Characters at word boundaries get modified because of sandhi rules. So, to solve this problem we have implemented Sanskrit Search algorithm. There are tools like Sanskrit-Hindi Accesor Cum Machine Translator [18] and Sanskrit Reader [27] for Sanskrit sentence segmentation. We can use these tools to get padapāṭha of Sanskrit text; we have used padapāṭha for statistical analysis. Using statistical analysis we want to find out how efficient these tools are. We have found two problems with this tool, namely, unnecessary sandhi splits and Samāsaḥ. Using n- gram, we gave solution to Samāsaḥ problem faced by Sanskrit Reader tool. II. BACKGROUND A. Writing System of Indian Languages The languages of India employ a syllabic writing method where a unique shape is used for each syllable (known as Akshara in Hindi and Varna in Sanskrit). There are sufficient reasons to study languages individually since variations will help to understand the problems involved in using computer to deal with the scripts.

3 Pāṇini s phonetic classification of Indian alphabets into vowels (V) and consonants (C), (known as Swaras and Vyannjanas in Sanskrit, Hindi, Marathi and Konkani, Atchulu and Hallulu in Telugu, uir and mey in Tamil, etc.), serves as a common base for all Indian languages of non-perso-arabic origin. It also provides us with a unique encoding for any word in the language. There are differences in their written forms, as different letter shapes and different shaping rules get used. In addition to the vowels and consonants, there are also a few graphical signs used for denoting nasal consonants, for nasalization of vowels, etc. (denoted by G). The effective unit of writing system for all the Indian languages is the orthographic syllable, consisting either of a lone vowel, optionally, followed by a graphical sign with the structure (V) (G) or a consonantal syllable consisting of a lone consonant or consonant and a vowel (CV) core and, optionally, one or more preceding consonants, and an optionally following graphical sign. The canonical structure for a syllables is thus of the form (C (C)) (C (V)) (G) [22]. In some syllables the number of consonants can go even up to five. The methodology of combining these two basic groups (C and V) to form various syllable is in itself a unique and scientific approach, common to all the Indian scripts. The various combinations of one (or more) consonants with each of the vowels turn out to be a perfect matrix called Barakhadi in Hindi, Matralu in Telugu and Varnamala in Sanskrit. To a linguist, consonants in Indic scripts contain an inherent vowel 'a'. For example, the first consonant in most Indic scripts is 'ka' (in Devanagari it is क), which is the equivalent of the 'pure' consonant sound 'k' and the vowel 'a'. The pure consonant in Indic script is also known as the 'dead' consonant. It is depicted by a special character called 'halant' ( ) placed below a character (क ) to suppress the inherent vowel in the consonant. B. Indian Script Code for Information Interchange ISCII [3] was proposed in the eighties and a suitable standard was evolved by Here are some aspects of the ISCII representation: 1. It is a single representation for all the Indian scripts. 2. Codes have been assigned in the upper ASCII region ( ) for the Aksharas of the languages. 3. The scheme also assigns codes for the Matras (Vowel extensions). 4. Special characters have been included to specify how a consonant in a syllable should be rendered. Rendering of Devanagari has been kept in mind. 5. A special attribute character has been included to identify the script to be used in rendering specific section of the text. C. UNICODE for Indian languages Unicode [2] was the first attempt at producing a standard for multilingual documents. Unicode owes it origin to the concept of the ASCII code extended to accommodate international languages and scripts. Unicode is usually used as a generic term referring to a two-byte character-encoding scheme. The Unicode CCS (coded character set) 3.1 is officially known as the ISO Universal Multiple Octet Coded Character Set (UCS). Unicode 3.1 adds 44,946 new encoded characters. With the 49,194 already existing characters in Unicode 3.0, the total is now 94,140. The Unicode CCS utilizes a four-dimensional coding space of 128 three-dimensional groups. Each group has 256 two-dimensional planes. Each plane consists of 256 one-dimensional rows and each row has 256 cells. A cell codes a character at this coding space or the cell is declared unused. This coding concept is called UCS-4; four octets of bits are used to represent each character specifying the group, plane, row and cell. The first plane (plane 00 of the group 00) is the Basic Multilingual Plane (BMP). The BMP defines characters in general use in alphabetic, syllabic and ideographic scripts as well as various symbols and digits. Subsequent planes are used for additional characters or other coded entities not yet invented. This full range is needed to cope with all of the world's languages; specifically, some East Asian languages that have almost 64,000 characters. To represent a character coding system (CCS) of more than 256 characters with eightbit bytes, a character-encoding scheme (CES) is required. Unicode Transformations: The most-used characterencoding scheme is UTF-8. It allows for full support of the entire Unicode, all pages and planes, and will still read standard ASCII correctly. Unicode Transformation Formats (UTFs) are CESs that supports the use of Unicode by mapping a value in a multi-byte code. The UTF-8 transformation format is becoming a dominant method for exchanging international text information because it can support all of the world's languages and is compatible ASCII. UTF-8 uses variable-width encoding. The characters numbered 0 to 0x7f (127) encode to themselves as a single byte, and larger character values are encoded into 2 to 6 bytes. It is UNIX safe and self correcting encoding scheme. The major scripts of India proper, including Devanagari, are all encoded according to a common plan, so that comparable characters are in the same order and relative location. This structural arrangement, which facilitates transliteration to some degree, is based upon the Indian national standard (ISCII) encoded for Indian scripts. [28] D. Sanskrit Sanskrit language, (from Sanskrit: saṃskṛta, adorned, cultivated, purified ) an old Indo-Aryan language in which the most ancient documents are the Vedas, composed in what is called Vedic Sanskrit. Over its long history, Sanskrit has been written both in Devanāgarī script and in various regional scripts, such as Śāradā from the north (Kashmir), Bāṅglā (Bengali) in the east, Gujarātī in the west, and various southern scripts, including the Grantha alphabet, which was especially devised for Sanskrit texts. Sanskrit texts continue to be published in regional scripts, although in fairly recent times Devanāgarī has become more generally used. There is a large corpus of literature in Sanskrit which includes Mahābhārata, Rāmāyaṇa, Vedas, Śakuntalā etc.

4 What is generally called Classical Sanskrit was elegantly described in one of the finest grammar ever produced, the Aṣṭādhyāyī ( Eight Chapters ) composed by Pāṇini (c. 6th 5th century BCE). The Aṣṭādhyāyī in turn was the object of a rich commentatorial literature, documents of which are known from the time of Kātyāyana (4th 3rd century bce) onward. Fig. 1 Śivasūtra To explain greatness of Pāṇini s Sanskrit grammar, let s consider only one sutra, Śivasūtra, is a table which defines the natural classes of phonological segments in Sanskrit by intervals. Pāṇini wanted to make the Aṣṭādhyāyī as brief as possible. One of the ways he did so was by organizing the Sanskrit sounds into different groups. Since many rules deal with particular letters, these groups made it easier to refer to particular letters. WIEBKE PETERSEN present a formal argument in [12] which shows that, using his representation method, Pāṇini s way of ordering the phonological segments to represent the natural classes is optimal. The argument is based on a strictly set-theoretical point of view depending only on the set of natural classes and does not explicitly take into account the phonological features of the segments, which are, however, implicitly given in the way a language clusters its phonological inventory. III. RELATED WORK Acharya Text Editor [4]: Acharya is a multi-platform text editor that supports Asamiya, Bangla, Devanagari, Gujarati, Kannada, Malayalam, Oriya, Punjabi, Tamil and Telugu. It achieves this functionality by storing Indic text in syllabic units instead of characters as most other editors do. Although it is using custom encoding, the editor supports conversion of text to standard encoding like ISCII and UNICODE. This encoding tries to capture the syllabic nature of Indic scripts. In this encoding, each syllable can be view as one of V, C, CV, CCV, CCCV, CCCCV combinations. The initial C is the base consonant and the subsequent Cs represents conjunct combinations. The memory representation of each syllable is a 16-bit value with the following bit distribution: V-4 CNJ-5 C-6 Etymological trends in the Sanskrit vocabulary [22], in this work they have used Linear Discriminant Analysis and Neural Network (NN) for dating Sanskrit text automatically. Sanskrit Segmentation [8], this work examines how to solve by computer software the problem of identifying in a Sanskrit sentence the division of a continuous enunciation into a sequence of discrete word forms. Comparing Sanskrit Texts for Critical Editions [21], a critical edition takes into account all the different known versions of the same text in order show the difference between any two distinct versions, in terms of words missing, changed or omitted. Sanskrit Compound Processor [11], Sanskrit is very rich in compound formation. Typically a compound does not code the relation between its components explicitly. In this paper, authors discuss the automatic segmentation and type identification of a compound using simple statistics that result from the manually annotated data. Automatic Sanskrit Segmentizer using Finite State Transducers [15], in this paper, authors proposed a novel method for automatic segmentation of a Sanskrit string either encoded as a Unicode string or as a Roman transliterated string and the output is a set of possible splits with weights associated with each of them. SanskritTagger, a stochastic lexical and POS tagger for Sanskrit [18], SanskritTagger is a stochastic tagger for unpreprocessed Sanskrit text. The tagger tokenises text with a Markov model and performs part-ofspeech tagging with a Hidden Markov model.his document is a template. An electronic copy can be downloaded from the conference website. For questions on paper guidelines, please contact the conference publications committee as indicated on the conference website. Information about final paper submission is available from the conference website. IV. TOOLKIT For ASCII character set, functionalities like isalpha(), isalphanumeric(), diff etc are available for general text processing of English documents. For processing documents and web pages in Indian languages, functionalities like these are not available. We have implemented a toolkit in Java which provides basic functionalities for Indian language text processing. Currently we support only documents in Devanagari script. While reading literature related to Indian language processing, we have noticed that most implementations have been done using transliterated text like IAST, WX-alphabetic, KH, Velthius etc instead of directly using Unicode Devanagari. Our toolkit is using only Unicode and there is no need of unnecessary conversion. Functionalities provided by this toolkit are: boolean isconsonant (char) : This function returns true if character is a Devanagari consonant; otherwise false. boolean isshortvowel (char): This function returns true if character is a Devanagari short vowel; otherwise false. boolean islongvowel (char): This function returns true if character is a Devanagari long vowel; otherwise false. boolean isvowel (char): This function returns true if character is a Devanagari vowel; otherwise false. boolean isdevalpha (char): This function returns true if character is either a Devanagari vowel or consonant; otherwise false.

5 boolean ishalant (char): This function returns true if the character is halant; otherwise false. boolean isdevnumeric (char): This function returns true if the character is a Devanagari digit; otherwise false. boolean isvowelsign (char): This function returns true if character is a Devanagari vowel sign; otherwise false. There are similar functions like isanuswara (char), isnukta (char), isavagraha (char), isom (char), ischandrabindu (char), isinvchandrabindu (char). All these functions are for signs. There is also one more function issign (char) to detect sign. boolean isnasal (char): This function returns true if the character belongs to { ङ, ञ, ण, न, भ}; otherwise false. boolean ishardconsonant (char): This function returns true if the character belongs to { क, च, ट, त, ऩ, ख, छ, ठ, थ, प, श, ष, स}; otherwise false. boolean issoftconsonant (char): This function returns true if the character belongs to { ग, ज, ड, द, फ, घ, झ, ढ, ध, ब, ह}; otherwise returns false. Similar to above functions; there are other functions to detect semi vowels, dipthongs and to classify into five classes namely, Guttarals, Palatals, Cerebrals, Dentals and Labials. syllablize (text): This function segments input text into syllables. For example, म =,, म. getnextsamyuktakshara (): returns next samyuktakshara. isconsonantconjunct (text): This function returns true if text is a consonant conjunct like, otherwise false. converttolaghuguru (text): This function converts text in Laghu and Guru sequence. Validateunicode (text): This function is for validating Unicode text syntax. Using following patterns [36] for syllable, we have implemented getnextsamyuktakshar (). Functions like validateunicode () can also be included in toolkit. Consonant syllable: {C+[N]+<H+[<ZWNJ ZWJ>] <ZWNJ ZWJ>+H>}+C+[N] +[A] + [< H+[<ZWNJ ZWJ>] {M}+[N]+[H]>]+[SM]+[(VD)] Vowel-based syllable: [Ra+H]+V+[N]+[<[<ZWJ ZWNJ>]+H+C ZWJ+C>]+[{M} +[N]+[H]]+[SM]+[(VD)] Stand Alone cluster (at the start of the word only): #[Ra+H]+NBSP+[N]+[<[<ZWJ ZWNJ>]+H+C>]+[{M}+[ N]+[H]]+[SM]+[(VD)] Where { } denotes zero or more occurrences, [ ] denotes optional occurrence, < > denotes one of, ( ) denotes one or two occurrences, C denotes consonant, V denotes independent vowel, N denotes nukta, H denotes halant/virama, ZWNJ denotes zero width non-joiner, ZWJ denotes zero width joiner, M denotes matra (up to one of each type: pre-, above-, belowor post- base), SM denotes syllable modifier signs, VD denotes vedic, A denotes anudatta (U+0952), NBSP denotes NO-BREAK SPACE. We have found invalid character sequences in existing Unicode documents which are as given below: Errors in Mahabharata ( : Book 1: chapter 171, द : Book 2: chapter 59, ब Unicode is considering : Book 10: chapter 1: Problem is that as sign (Devanagari sign avagraha), भ : Book 13: chapter 33, : Book 1: chapter 114 Errors in Devanagari web documents (1.4GB):,,, श, श, म etc. Errors in Rig-Veda ( भ, म ण, ह, ह, र म, म, ग, ध, म, र, ग etc. We have explained the importance of Pāṇini s Śivasūtra in section II. So, we have implemented function to check whether a given character belongs to a given segment or pratyāhāras. boolean charinpratyahara (char, pratyahara) sanskritwordsearch(): Sanskrit text is written as a continuous string of letters without space rather than a sequence of words. Thus, the Sanskrit text consists of a very long sequence of phonemes. Words in Sanskrit text are combined by applying sandhi rules on initial and final characters of words. So, original words are modified because of sandhi rules. When you search a word in Sanskrit text using simple search algorithm you may not find the all occurrences of word. For example, जर द म ऩकऩ थ ब तम गऩ थ त [11] = जर- द- म ऩक-ऩ थ - ब - तम ग- ऩ थ त When you search द in above sentence you will not find द even if it s there in the sentence. There are two solutions to this problem. First one is segmentation and second is design search algorithm for this problem. First solution is segmentation; this is a task of splitting Sanskrit text into its constituent words. This is complex and time consuming task. So, we have developed Sanskrit search algorithm. This algorithm works with word of length two or more than two. For word of length one there are lots of possibilities because of recursive sandhi rule and also result table contains wrong index of word. Fig. 2 Sandhi of words

6 In Fig. 2, W2 is a search word and W1 and W3 are adjacent words. Because of sandhi rules all three words may get modified. v + x = p and y + z = q. The substring R of W2 is not changed. We can use R for partial searching. We will discuss later about the case when R is empty. For partial search we are using Knuth-Morris-Pratt [17] (KMP) Algorithm. KMP algorithm is string searching algorithm searches for occurrence of pattern P within the text T. KMP algorithm is tight analysis of naive search algorithm. After a shift of the pattern, the naive algorithm forgets all information about previously matched symbols. KMP algorithm makes use of the information gained by previous symbol comparisons. It never re-compares a text symbols that has matched a pattern symbol. KMP algorithms with input R will give us table of results which may contain invalid results because of R which is a substring of W2. So, to further process the result table we use known value of x and y. We have list of all sandhi rule [19]. We are using two HashMaps which are InitialMap and FinalMap. InitialMap represents the first equation and is use to map x to list of p. FinalMap represents the second equation and is use to map y to list of q. The idea is that in above equations keep x and y constant and apply sandhi rule for all possible v and z respectively. Consider following sandhi rules for FinalMap, + = + = + = ए + ई = ए Algorithm 1 Sanskrit Word Search Algorithm Procedure SanskritSearch(w, T) Inputs: w sanskrit word Output: table: contains the index of each occurrence of w in text T sandhirule read(sandhirule.txt) Initialize InitiaMap using sandhirule Initialize FinalMap using sandhirule x getinitial(w) y getfinal(w) R getmiddle(w) plist InitialMap.get(x) qlist FinalMap.get(y) if R is empty for all i ϵ {1 to sizeof(plist)} do p plist[i] for all j ϵ {1 to sizeof(qlist)} do q qlist[j] table KMP(p+q) end end for Rtable KMP(R) for all i ϵ {1 to sizeof(rtable)} do if Text.substring(Rtable[i]-x.len,Rtable[i]) equals x if Text.substring(Rtable[i]+R.len,Rtable[i]+R.len + y.len) equals y table Rtable[i]-x.len for all j ϵ {1 to sizeof(qlist)} do q qlist[j] if Text.substring(Rtable[i]+R.len,Rtable[i] +R.len+q.len) equals q table Rtable[i]-x.len end for if Text.substring(Rtable[i]+R.len,Rtable[i]+R.len + y.len) equals y for all j ϵ {1 to sizeof(plist)} do p plist[j] if Text.substring(Rtable[i]-p.len,Rtable[i]) equals p table Rtable[i]- p.len end for for all j ϵ {1 to sizeof(plist)} do p plist[j] if Text.substring(Rtable[i]-p.len,Rtable[i]) equals p initialflag 1 end for for all j ϵ {1 to sizeof(qlist)} do q qlist[j] if Text.substring(Rtable[i]+R.len,Rtable[i] +R.len+q.len) equals q finaflag 1 end for if initialflag equals 1 and finalflag equals 1 table Rtable[i] initialflag 0 finalflag 0 end for Return table end Procedure FinalMap maps final of some word to, ए. Consider following sandhi rules for InitiaMap, + = + = म उ + = ए + = म

7 InitialMap maps initial of some word to, म,, म. When R is non-empty, use a KMP algorithm which gives result table which contains the index of occurrence of R. To get the indices of word, for each index in KMP result table check for the string xry, (v1 + x) R (y + z1) = p1rq1, p2rq2... Where xry, (v1 + x) R (y + z1) = p1rq1, p2rq2... are all possible versions of word after applying sandhi rule. Now consider a case when R is empty, we cannot use KMP algorithm. We have initial x and final y of word. When words in Sanskrit sentence are joined by applying sandhi rules both x and y may get changed. To search word, we have to search for xy, (v1 + x) (y + z1) = p1q1, p2q2... Where p1q1, p2q2... are all possible versions of word after applying sandhi rules. While implementing getinitial and getfinal procedures in Algorithm 1, it is not always that procedure returns first and last character of word because if last character is consonant than return inherent vowel. If last character is visarg than we have to also consider second last character because of rules like + श = श. Using Sanskrit word search algorithm, we have implemented Firefox Add-on (Fig. 3). We have used the Addon builder for developing and testing the Add-on. The Addon Builder is a web-based development environment that provides additional functionality for working with the Add-on SDK. All add-ons created with the Add-on Builder or Add-on SDK are restart less by default, hence users do not have to interrupt their browsing to begin using the add-on right away. E. Encoding Issues The linguistic unit for analyzing text in Indian languages is a syllable. We can get some idea of the commonly occurring sounds in speech by analyzing frequency of syllables. Also from the point of view of grammar, the syllable assumes significance since syllables affixed or prefixed to root forms result in words that conform to the rules of grammar. Therefore, it should be possible to perform computations on a string of syllables to arrive at the underlying structure of a sentence and in the process, understand the sentence as well. In Indic languages, when a compound word is seen in text, its meaning could be determined from the connected words and one can break the word in multiple ways, all leading to perfect but different meanings. Indian poets had cleverly used compounds to hide the true meaning of the sentence from all but those who could correctly split the word. In the Mahabharata Epic, consisting of 100,000 śloka, every thousandth śloka could be interpreted in two ways. Study of frequency of occurrences of sounds could possibly lead to interesting results and could give explanation for the way in which changes in language had taken place. So we have analyzed the text of Mahabharata, Rig-Veda and Devanagari web documents (1.4 GB). The analysis yields information on syllables of the form V, C, CV, CCV, CCCV and CCCC...V. For Devanagari web documents most frequents are: य = , क = , न = , स = , ऩ = For Devanagari web documents less frequents are: म = 1, द = 1, म = 1, = 1, न = 2 Fig. 3 Add-on details string getunicodestring (codepoint): This function returns Unicode character string given codepoint. transliterate(text, from, to): Currently this function is useful for transliterating from IAST to Unicode Devanagari. We are going o extend this function to transliterate between Indian scripts. It is difficult to analyse and process web pages because they contain html tags, other language text etc. So, before we process web pages, we have to extract Devanagari text from these web pages. We have provided functions for this: extractdevanagari(pagelink), gotofirstdevchar(pagelink). iterativegetdev(pagelink): This function iteratively returns Devanagari text. We are using this toolkit for analysing problems which are explained in hereafter sections. V. USECASES We have used our toolkit to investigate problems related to Indian language processing and Sanskrit linguistic analysis. Details about these problems are given below: For Mahabharata ( most frequents are: भ = , त = , य = 95883, स = 94388, न = For Mahabharata ( less frequents are: झ = 1, म = 1, म = 1, फ = 2, म = 3 For Rig-Veda ( most frequents are: भ = 17700, = 15730, य = 13411, त = 13315, न = For Rig-Veda ( less frequents are: ट = 1, न = 1, ज = 2, त = 2, ध = 3 Huffman coding Algorithm 2 shows how Huffman tree is formed. We have used Huffman encoding to check whether we can fit syllable with minimum frequency into 2 bytes, but

8 for Mahabharata ( म =1: :21 is 21 bits Algorithm 2 Huffman Coding Algorithm Procedure HUFFMAN(C) Inputs: C - Syllables with their frequency Output: root: Root of Huffman tree n C Q C for all i ϵ {1 to n} do do allocate a new node z left[z] x Extract _Min (Q) right[z] y Extract_ Min (Q) f[z] f[x] + f[y] Insert (Q, z) end for Return Extract_Min (Q) end Procedure We have proposed a coding scheme at syllable level. We have assigned codes to syllables in dictionary order to make sorting easy and to remove ambiguity between traditional sorting order and sorting order according to Unicode. 8 Bit code with most significant bit 0. There are 2 7 possibilities and MSB 0 is important because it make this coding scheme UNIX safe. 0 x x x x x x x 1 x x x x x x x 16 Bit code represents 2 11 syllables x x x x x 1 0 x x x x x x 24 Bit code represents 2 16 syllables. In this code first byte with represent the escape sequence which is used to give information about the script x x x x x 1 0 x x x x x x 1 0 x x x x x x Just as UTF-8 is UNIX safe and self-synchronizing, our coding idea is also UNIX safe and self-synchronizing. Comparison between Unicode and Our Coding Scheme: 1. Sorting and Searching Algorithm: Text searching and sorting is one of the most well researched areas in computer science. Sorting and searching non-english text presents a number of challenges. Primary source of difficulty is accents, which have very different meanings in different languages, and sometimes even within the same language: Letters like é in café are treated as minor variants of the letter that is accented, in this case e. Sometimes the accented form of a letter is treated as a distinct letter for comparison. For example, In Danish Æ comes after Z. Other difficulties arise when multiple characters are compared as if they were one. For example, in traditional Spanish ch is treated as a single letter that sorts just after c and before d. In case of Indian languages, the Unicode two-part vowels, Tamil vowel sign O can be composed with E + AA. Though the resultant output looks identical, it adds additional logic to sort, search and replace, and so on. These kinds of difficulties are solved by using Unicode Collation Element Table [30, 31]. Java is using Unicode as its character set. In Unicode there are 65,535 distinct characters that cover all modern languages of the world. In general this is good; it makes the task of developing global application a great deal easier. However, algorithms like Boyer-Moore that rely on an array indexed by character codes are very wasteful of memory and take a long time to initialize in this environment. To make the Boyer-Moore algorithm work [31], first consider what happens when a letter occurs twice in the pattern. There are two possible shift distance for that letter, one for each occurrence. We always want to enter the smaller of the two shift distances in the table. If we used the larger one, we might shift the pattern too far and miss a match. In a sense, the shift table is not required to be perfectly accurate, and conservative estimates of shift distances are OK. As long as we don t shift the pattern too far, we re fine. This realization leads to a simple technique for applying the algorithm to Java collation elements: simply map all possible elements down to a much smaller set of shift table indices (say, 256). If two or more elements in your pattern happen to collide and end up with the same index, it s not a problem as long as you enter the smaller of the shift distances in the table. So, complexity of sorting algorithm of Java is O (n*log n) and complexity of searching algorithm of Java is linear-time. Another problem with Unicode sorting with respect to Indian languages is that Unicode code-point order is admittedly not intended to solve culturally acceptable sorting [32]. However, sorting is frequently a source of confusion. Providing a default collating order for each script would be helpful in clarifying this development issue. Now consider proposed syllable level encoding, There are no difficulties with this encoding scheme which are discussed above. So, it is easy to implement sorting algorithm and searching algorithm (Using same logic which is used by Java searching algorithm) for syllable level encoding. Complexity of sorting algorithm is O (n*log n) and complexity of searching algorithm is linear-time.

9 2. Analysis of Encoding schemes with respect to Natural language processing: In this section we will discuss how Unicode and syllable level encoding schemes make impact on language technology. The efficiency of Unicode and syllable level encoding scheme is analysed on tools like Sandhi-Splitter, Sandhi-Generator, Sanskrit Chhandas, Sanskrit word search, Sanskrit Morphological analyser, Sanskrit Reader. Consider a Sanskrit Chhandas [24], which is used find laghu and guru syllable sequence of Sanskrit poetry. Using this sequence we can classify poems into chhandas. Akshara is as much of a word as can be pronounced distinctly at once or by one effort of the voice. So a vowel with or without one or more consonants is considered as one syllable. A syllable can be laghu or guru depending on whether its vowel is short or long. Laghu syllable: The vowel (a), (i), उ (u), ऋ (ṛ), ऌ (ḷ) are laghu. Whenever any of these is used in a verse separately or with one or more consonants, it will be considered as a short syllable. For example क (ka), क (ki), etc, are laghu syllables. The vowels (ā), ई (ī), ऊ (ū), ॠ (ṝ), ए (e), ऐ (ai), (o) and औ (au) are guru. Whenever any of these is used in a verse separately or with one or more consonants, it will be considered as guru. However, a laghu vowel becomes long under the three conditions given below: a. If a vowel is followed by an anusvāra. For example, क (kaṃ) etc. b. If a vowel is followed by a visarga. For example क (kaḥ) etc. c. If a vowel is followed by a conjunct consonant. For example ग ध (gandha). Here, even though ग is laghu, it has to be considered as guru because it is followed by conjunct consonant ध (ndha). Now, let's compare two encoding schemes with respect to Sanskrit chhanda. Algorithm is simple: 1. syllablize(text) 2. for each syllable check laghu and guru, using isshortsyllable(syllable) and islongsyllable(syllable) functions respectively First step of algorithm can be solved in single pass in both encoding schemes. But in the second step, to find whether particular conjunct consonant is laghu or guru is difficult in syllable level encoding because conjunct consonant in syllable level encoding is single unit and to find last character of conjunct consonant we have to generate conjunct consonant by trying all combination or keep table of all possible syllable with their characters. As there are more that 15,000 syllable, it consumes more time and memory. Consider a Sandhi-Generator [19], Sanskrit grammar contains the set of euphonic rules called Sandhi rules, which when applied, cause phonological changes at word or morph boundaries. These rules are given in Pāṇini's Aṣṭādhyāyī. W1x + yw2 W1x + W2 W1zW2 W1x' + W2 In above two equations, x and y are final character of W1 and initial character of W2 respectively. Here z is euphonic transformation of x and y. Now consider the case, when final character x of W1 is conjunct consonant and initial character y of W2 is also conjunct consonant. In Unicode it is simple to get last or character of word but when we consider syllable level encoding, it is difficult to find initial and final character of word as explained for Sanskrit chhandas. F. Subtractive vs. Additive Model When we write word in any Indian language using Unicode characters and its rules, we use subtractive model to write a word. For example, consider a pure consonant क, we first write क then to remove inherent vowel. We want to compare subtractive model with additive model. In additive mode, we use code for pure consonant. For example, assign code to क instead of क and to write consonant, use code for inherent vowel (IHV) and add IHV to pure consonant (PC). For example, PC + IHV = C. Now, to compare additive and subtractive model we need frequency of pure consonant and consonant because we want to compare both models in terms of size of document. Size is an important attribute for text because its impact on lots of other things like storage, network file transfers etc. As we can see in Table 1, the frequency of consonants is higher than the frequency of pure consonants in both Bhagavadgita and Mahabharata, but there is one exception in Mahabharata-frequency, -5380, which has higher frequency than ङ-110. The result of analysis of frequency of pure consonant and consonant in crawled web documents in Hindi language is also similar. So, from Table 1, we can conclude that size of Unicode document less than document which uses the additive model because frequency of most of consonant is higher than pure consonant and also in subtractive model, for consonant we use one code-point and for pure consonant we use two code-points, while in additive model, for consonant we use two code-points and for pure consonant we use one code-point. G. Analysing Sanskrit using Statistical Techniques In ancient India, Sanskrit was orally transmitted from generation to generation. This process causes the euphonic changes in the text at the word boundaries. As we have mentioned in the previous section that Sanskrit text is continuous. There is no space between words.

10 There are two forms of Sanskrit text, saṃhitapāṭha and padapāṭha. Saṃhitapāṭha is the continuous text according to the mode of recitation and Padapāṭha is the text with word separated with space after sandhi vichhed according to mode of recitation. Padapāṭha is for understanding the meaning of text. Table 1: Frequency of pure consonant and consonant Bhagavadgita Consonant Pure Consonant क 879 ख 101 ग 467 घ 25 च 690 छ 70 ज 414 झ 1 ञ 191 ट 83 ठ 47 ड 25 ढ 32 ण 379 त 2725 थ 337 द 748 ध 461 न 1750 प 871 फ 35 ब 168 भ 629 म 2287 य 2669 र 1853 ल 280 व 2113 श 401 ष 405 स 1245 ह 602 क 446 ख 15 ग 74 घ 7 99 च 131 ज 9 ज 255 ञ 58 ट 27 ड 2 ढ 1 ण 53 त 1194 थ 3 द 503 ध 83 न 825 प 458 ब 83 भ 50 म 388 य 10 र 980 ल 39 व 134 श 401 ष 296 स 737 ह 118 Mahabharata Consonant Pure Consonant क ख 7417 ग घ च छ 8688 ज झ 71 ञ 9022 ट 9855 ठ 6951 ड 8660 ढ 1181 ण त थ द ध न प फ 2130 ब भ म य र ल व श ष स ह क ख 1910 ग 7083 घ च 7939 ज 387 ज झ 1 ञ 4143 ट 4578 ठ 43 ड 603 ढ 63 ण त थ 745 द ध 6767 न प फ 3 ब 3727 भ 5692 म 7385 य 353 र ल 4612 व 7519 श ष स ह 8313 The Natural Language Processing is concerned with the design and implementation of effective natural language input and output components for computational system [29]. The most modern natural language processing depends heavily on statistics and complex statistics and complex statistical models. For example, Language modelling for automatic speech recognition using smoothed n-grams to find the most probable string of words w 1,..., w n out of set of candidate strings compatible with the acoustic data, Part-of-speech tagging using hidden Markov models to find the most probable tag sequence t 1,...t n given a word sequence w 1,...w n, Word sense disambiguation using Bayesian classifiers to find the most probable sense s for word w in context C. Sentiment mining refers to the application of NLP, Computational Linguistics, and text analytics to identify and extract subjective information in source material. Sentiment analysis is about determining polarity of given text at the document, sentence level. There are different methods for sentiment analysis like Latent Dirichlet Allocation (LDA) [21], Support Vector Machine (SVM) etc. Also using word-net, knowledge-net (Amarkosha [19]) etc. Most of NLP, application of NLP, Computational Linguistics is depends on words. But Sanskrit sentences are continuous. We have to split Sanskrit sentences in to words. The task of splitting sentences into its constituent word, segmentation, is not simple. There are two different works on this problem, namely, Sanskrit-Hindi Accesor Cum Machine Translator [19, 15] and Sanskrit Segmentation [8]. The segmentation is not simple task because there are number of possible segmentation and it requires morphological analysis. For example, consider a following sentence, "त व स य व च" There are 28 possible segmentations because we cannot directly find the boundaries of words. At every character we can apply sandhi rule and split the sentence. This problem of over-generation is solved by prioritizing solutions based frequencies and graph matching algorithm in [15] and [10] respectively. Now, we have words but still we cannot use these words directly because we cannot find their meaning in dictionary. Also knowledge-net like Amarkosha takes only Pratipadika (uninflected forms) as input. Morphology, in linguistics, is the study of the forms of words, and the ways in which words are related to other words of the same language. Formal differences among words serve a variety of purposes, from the creation of new lexical items to the indication of grammatical structure. Sanskrit has rich inflectional as well as derivational morphology. Sanskrit is having a formal grammar in the form of Aṣṭādhyāyī. One may think, therefore, it would be trivial task to build a morphological analyser based on this grammar. But it is not so. The issues related to the development of Sanskrit morphological analyser are well described in [9]. In Sanskrit every word (except adverbs and particles) is inflected and the grammatical inflection itself shows the relation in which one word stands to another. Thus grammatically speaking, there is no order as such that need be much attended to.... But if there is no grammatical order, there is a sort of logical sequence of ideas, which must follow one another in a particular order.... Words must be so arranged that the ideas will follow one another in their natural order, and the words in their natural connection.... [26]. There is no sentence structure in Sanskrit and no fix word order in sentence. So, NLP problems like Part-of-speech

11 tagging are difficult. Word formation in Sanskrit is described by Fig. 4. We have used Bhagavadgita for statistical analysis provided by GRETIL [33]. This file is in Unicode encoding. This file has lots of typing mistakes which we have corrected manually by comparing it with printed Bhagavadgita [34]. Sanskrit Reader is online tool, we have used to this tool to get padapāṭha of Bhagavadgita. In the paper related to this tool, there is no information regarding performance of this tool. We have tested this tool using Bhagavadgita. We have transliterated Bhagavadgita from Unicode Devanagari to WX- ALPHABETIC encoding for Devanagari script because Sanskrit Reader accept only WX-ALPHABETIC, Velthius, Kyoto-Harvard and SLP1 encodings. The performance of this tool is 80.90%. There are are 1403 sentences (some removed) in Bhagavadgita. Fig. 4 Sanskrit Word Formation [11] To improve the performance of this tool, we have integrated this tool with Dictionary [35], Sandhi-Splitter [19] and Morphological Analyser [19]. The performance after integration is 99.71% (some words are removed from some sentences or manually processed). Algorithm 3 gives details about integrated tool. We have integrated 3 output files of IntegratedTool and Bhagavadgita file using line number. We have padapāṭha of Bhagavadgita, therefore, we can use statistical method for analysis. We have used Latent Dirichlet Analysis (LDA) for analysis of Bhagavadgita. In the documents, words are related to each other, in terms of synonyms, hypernyms, hyponyms etc. Also, their cooccurrence in a document infers these relations. They are key to the hidden semantics in document. To discover these semantics different probabilistic and non-probabilistic techniques are used. LDA is one of them. It is a probabilistic graphical model that is used to find hidden semantics in documents. It is based on projecting words (basic units of representation) to topics (group of correlated words). Being a generative model it tries to find probabilities of features (words) to generate data points (documents). In other words, it finds a topical structure in a set of documents, so that each document may be viewed as a mixture of various topics. Another way of looking at LDA is by viewing the documents as a weighted mixture of topics i.e. the probability distribution of topics for each document are the weights of the topics in that document such that the sum of the weights is 1. Similarly, the topics can be viewed as a weighted mixture of words with the probability distribution of the topic over the words as the weights Algorithm 3 Integrated Tool Procedure IntegratedTool(S) Inputs: S Sanskrit Sentence Output: padapatha: padapatha of Sanskrit sentence padapatha SanskritReader(S) if padapatha contains No solution to chunk chunk getchunk(padapatha) word chunk[chunk.length] if Dictionary contains word remove&writetodictrm(word) return SanskritReader(S') morphoword MorphAnalyser(word) if morphword is empty sandhiword SandhiSplitter(word) if sandhiword contains No Split remove&writetospitrm(word) return SanskritReader(S') replace word with sandhiword return SanskritReader(S') endif replace word with morphword return SanskritReader(S') Return padapatha end Procedure We want to compare the results obtained by applying LDA on padapāṭha of Bhagavadgita obtained using Sanskrit Reader with results obtained by applying LDA on padapāṭha of Bhagavadgita provided by Bhaktivedanta VedaBase Network [25]. Padapāṭha of Bhagavadgita by Bhaktivedanta VedaBase Network is manually written. There are 18 chapters in Bhagavadgita. We have considered each chapter as a document. Results are given in table 2. Results in table 2 are of documents without morphological analysed. There is huge difference between the results. The reasons for difference are samāsaḥ, some unnecessary sandhi splits. Consider a some unnecessary sandhi splits like: सजय = [सन {nom. sg. m.}[सत {ppr. [2] ac.}[ स _1]] <n j>] [ जय {voc. sg. m. voc. sg. n. } [जय ] ]

12 पस य = [ पसम <m g ] [ ग य { abs. }[गम ] ] ध क त = [ध {iic.} [ध {pp.} [धष ]] <>] [क त { nom. sg. m. }[क त ] ] क शर ज = [क श {iic.}[क श] <>] [र ज {acc. pl. m. nom. pl. m. g. sg. m. abl. sg. m. g. sg. n. abl. sg. n. acc. pl. f. nom. pl. f. g. sg. f. abl. sg. f.}[र ज _2 ] ] पद य = [ {loc. sg. m.} [ _3] <>] [पद {loc. sg. m. acc. du. n. nom. du. n. loc. sg. n.}[पद] <>] [य {acc. pl. f. nom. pl. f.} [य_1 ] ] Table 2: LDA Results Sanskrit Reader Bhaktivedanta Vedabase Network Topic 1 Topic 2 Topic 1 Topic 2 कर म र र यत य सर म वर न आ प र म ब रह म आहर त आ र ह आस म इदर आ परर वर र र ह इदर तत तम त आर जम र य भ रत त र र आयर तमय कर म सर म य र त ब रह म आ त श र य ग उच यत Samāsaḥ is compound word, created by combining two or more words and this compound word conveys same meaning as that of collection of component words of compound word. For example, ल ब दर = ल ब + उदर meaning गणश व रप ष = व र + प ष Semantically, Pāṇini classifies Sanskrit Samāsaḥ [11] into four major types: Tatpuruṣaḥ (Endocentric with head typically to the right) Bahuvrīhiḥ (Exocentric) Dvandvaḥ (Copulative) Avyayībhāvaḥ (Endocentric with head typically to the left and behaves as an indeclinable) Problem here is to detect whether particular word is Samāsaḥ or not to prevent over-generation while segmentation. For example, Bahuvrīhiḥ Samāsaḥ न लकणठ (श व), such compounds may be used in gender which the right component does not admit by itself. Sanskrit Reader have temporary solution for Samāsaḥ, they are just recording Samāsaḥ in the lexicon. ज ञ We have one of the many solutions for this problem. The solution is using word sequence analysis. We have analysed the frequencies of unigram, bigram and trigram. An n-gram is the continuous sequence of n items from a given sequence of text. An n-gram could be any combination of letters. However, items can be phonemes, syllables or words according to the application. We know the frequencies of bigram and trigram, therefore, if two or three continuous words occurs frequently in the padapāṭha of Bhagavadgita obtained using Sanskrit Reader than we can assume that there is high probability that compound of those two or three continuous words is Samāsaḥ. The results of bigram and trigram analysis of padapāṭha Bhagavadgita is large. We have manually analysed the results and found some Samāsaḥ. For example, परम तप with frequency 9 सख द ख with frequency 6 Also Sanskrit Reader is not consistent with the sandhi split. For example, सखम द खम - सख द ख, मह ब ह - मह ब ह. H. Digital Aṣṭādhyāyī India has rich heritage in the linguistic studies. Out of the six vedāṅgas (field of studies necessary to study the vedas) viz. śikṣā, vyākaraṇa, chanda, nirukta, jyotiṣa and kalpa, the first four are concerned with the language studies. śikṣā deals with the pronunciation, vyākaraṇa with the grammatical aspects, chanda with the prosody and nirukta with the etymology. Though all these are important aspects of linguistics, it is the vyākaraṇa and the nirukta which have major role to play in understanding how a language communicates thoughts from one human being to the other. Pāṇini consolidated all the earlier grammars for Sanskrit and presented a concise and almost exhaustive descriptive coverage of and prevalent Sanskrit language. The goal of Pāṇinian enterprise is to construct a theory of human communication using natural language [37]. Pāṇinian grammar, as any other grammar formalism would give a very good theory to identify the relations among words in a sentence. Importance of Pāṇinian grammar lies in the minute observations of Pāṇini regarding the information coding in a language. Pāṇini`s Aṣṭādhyāyī represents the first attempt in the history of the world to describe and analyse the components of a language on scientific lines. It is the earliest complete grammar of Classical Sanskrit, and in fact is of a brevity and completeness unmatched in any ancient grammar of any language. It takes material from the lexical lists as input and describes algorithms to be applied to them for the generation of well-formed words. It is highly systematised and technical. Inherent in its approach are the concepts of the phoneme, the morpheme and the root. His rules have a reputation for perfection that is, they are claimed to describe Sanskrit morphology fully, without any redundancy. The Aṣṭādhyāyī consists of 4000 (3983 in Kāśikāvṛtti) sutras (sūtrāṇi) or rules, distributed among eight chapters, which are each subdivided into four sections or padas (pādāḥ).

13 Aṣṭādhyāyī is difficult to understand because it is complex and its technical structure [38]. In recent years, research to Hack the structure of Aṣṭādhyāyī has gained the momentum. Aṣṭādhyāyī available only in linear form and there is no indexing of rules, concepts, etc. To provide support to researchers, we have implemented digital Aṣṭādhyāyī for navigation and computation. In the section II (D), we have explained Śivasūtra. The Śivasūtra form the component in which the phonological segments of the language and their grouping in natural phonological classes, designated by pratyāhāras. In the Aṣṭādhyāyī Pāṇini refers to the phonological classes in hundreds of rules. The Śivasūtra identify 42 phonological segments and consists of a sequence of 14 sūtras (rows in Fig. 1), each of which consists of a sequence of phonological segments bounded by a marker (characters in last column), called anubandha. Phonological classes are denoted by abbreviations, called pratyāhāras, consisting of a phonological segment and an anubandha. For example, [ क ] = [ इ ऋ ऌ]. We have constructed index of all 42 pratyāhāras used in Aṣṭādhyāyī using Sanskrit word search algorithm. Sūtras are verb-less sentences unlike those in natural language and give an impressing of formulae or program like code. They are of following type (Pāṇini has not hinted to any such classification and it is strictly post- Pāṇini): 1. vidhi or operational rules: These form the core of the grammar. All other rules assist the operational rules. 2. saṃjñā or Definition: This rule is as basic as it gets. We take some term, like vṛddhi, and give it a specialized meaning that exists only within the scope of the Aṣṭādhyāyī. 3. paribhāṣā or meta-rules: rules which provide a check on the operational rules so that they do not suffer from over-application, under-application and impossible application. 4. adhikāra or heading rules: These rules are similar to a heading in modern books. Adhikāra have domains which are not always well defined and only the commentaries like Kāśikāvṛtti have to be consulted to understand their scope. 5. atideśa or extension rules: A rule is termed atideśa if it transfers certain qualities or operation to something for which they did not previously qualify. 6. niyama or restriction rules: rules which restrict the scope of other rules. Using basic meta-index of Aṣṭādhyāyī [39], we have constructed definition (saṃjñā) use chain data structure using Sanskrit word search algorithm. We have constructed linked-list based data structure for Aṣṭādhyāyī and in this data structure we have also included pratyāhāras index and definition (saṃjñā) use chain. This data structure also contains information about other types of sūtras. The main advantage of data structure is we can add other information as required. Using this data structure we have implemented web interface (Fig. 5) for Aṣṭādhyāyī in Java. Fig. 5 Digital Aṣṭādhyāyī We have implemented web interface for navigation and additionally, provided dictionary [35], sandhi splitter [19] and morphological analyser [19] in the same web interface. We want to construct graph of Aṣṭādhyāyī to understand its structure and make an executable model of Aṣṭādhyāyī. It s very challenging problem, not done yet. VI. CONCLUSION We have implemented a part of toolkit for Indian language processing and to support Sanskrit linguistic analysis e.g., word search in presence of sandhi. Using this toolkit we have analyzed the frequency of syllables in Devanagari we document, Mahabharata, and Rig-Veda. Using frequency analysis, we have designed syllable level encoding. There are some problems with Unicode, but when compared Unicode with our syllable level encoding, we found that Unicode is good encoding. We have analysed the Sanskrit in terms of statistical techniques using toolkit. We have found problems with Sanskrit Reader, namely, unnecessary sandhi splits and Samāsaḥ. We have proposed solution for Samāsaḥ. We have developed a digital prototype for Aṣṭādhyāyī to navigate and store discovered/computed properties. REFERENCES [1] [2] [3] Indian Script Code for Information Interchange ISCII. iscii91.pdf. [4] Krishnakumar V, Indrani Roy, Acharya A Text Editor and Framework for working with Indic Scripts, in Proc. IJCNLP, 08. [5] Cathy Wissink: "Issues in Indic Language Collation" [6] Huet, G.: Formal structure of Sanskrit text: Requirements analysis for a Mechanical Sanskrit Processor. In: Sanskrit Computational Linguistics [7] Scharf, P.M.: Levels in Pāṇini s Aṣṭādhyāyī, Sanskrit Computational Linguistics Lecture Notes in Computer Science, 2009, Volume 5406/2009, [8] Huet, G.: Sanskrit Segmentation. In: South Asian Languages Analysis Roundtable XXVIII, Denton, Ohio. (October 2009) [9] Kulkarni, A.P., Shukla, D.: Sanskrit Morphological Analyser: Some Issues. In: To appear in Bh.K Festschrift volume by LSI. (2009) [10] Huet, G.: Shallow syntax analysis in Sanskrit guided by semantic nets constraints. In: Proceedings of International Workshop on Research Issues in Digital Libraries, Kolkata. (2006) [11] Anil Kumar, Vipul Mittal and Amba Kulkarni: Sanskrit Compound Processor

DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook

DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook मह म ग ध अ तरर य ह द व व व लय (स सद र प रत अ ध नयम 1997, म क 3 क अ तगत थ पत क य व व व लय) Mahatma Gandhi Antarrashtriya Hindi Vishwavidyalaya (A Central University Established by Parliament by Act No.

More information

S. RAZA GIRLS HIGH SCHOOL

S. RAZA GIRLS HIGH SCHOOL S. RAZA GIRLS HIGH SCHOOL SYLLABUS SESSION 2017-2018 STD. III PRESCRIBED BOOKS ENGLISH 1) NEW WORLD READER 2) THE ENGLISH CHANNEL 3) EASY ENGLISH GRAMMAR SYLLABUS TO BE COVERED MONTH NEW WORLD READER THE

More information

HinMA: Distributed Morphology based Hindi Morphological Analyzer

HinMA: Distributed Morphology based Hindi Morphological Analyzer HinMA: Distributed Morphology based Hindi Morphological Analyzer Ankit Bahuguna TU Munich ankitbahuguna@outlook.com Lavita Talukdar IIT Bombay lavita.talukdar@gmail.com Pushpak Bhattacharyya IIT Bombay

More information

क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD

क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD FROM PRINCIPAL S KALAM Dear all, Only when one is equipped with both, worldly education for living and spiritual education, he/she deserves respect

More information

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature 1 st Grade Curriculum Map Common Core Standards Language Arts 2013 2014 1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature Key Ideas and Details

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

Phonological Processing for Urdu Text to Speech System

Phonological Processing for Urdu Text to Speech System Phonological Processing for Urdu Text to Speech System Sarmad Hussain Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, B Block, Faisal Town, Lahore,

More information

CS Machine Learning

CS Machine Learning CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing

More information

Problems of the Arabic OCR: New Attitudes

Problems of the Arabic OCR: New Attitudes Problems of the Arabic OCR: New Attitudes Prof. O.Redkin, Dr. O.Bernikova Department of Asian and African Studies, St. Petersburg State University, St Petersburg, Russia Abstract - This paper reviews existing

More information

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches Yu-Chun Wang Chun-Kai Wu Richard Tzong-Han Tsai Department of Computer Science

More information

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE Pratibha Bajpai 1, Dr. Parul Verma 2 1 Research Scholar, Department of Information Technology, Amity University, Lucknow 2 Assistant

More information

Florida Reading Endorsement Alignment Matrix Competency 1

Florida Reading Endorsement Alignment Matrix Competency 1 Florida Reading Endorsement Alignment Matrix Competency 1 Reading Endorsement Guiding Principle: Teachers will understand and teach reading as an ongoing strategic process resulting in students comprehending

More information

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Neural Network GUI Tested on Text-To-Phoneme Mapping A Neural Network GUI Tested on Text-To-Phoneme Mapping MAARTEN TROMPPER Universiteit Utrecht m.f.a.trompper@students.uu.nl Abstract Text-to-phoneme (T2P) mapping is a necessary step in any speech synthesis

More information

First Grade Curriculum Highlights: In alignment with the Common Core Standards

First Grade Curriculum Highlights: In alignment with the Common Core Standards First Grade Curriculum Highlights: In alignment with the Common Core Standards ENGLISH LANGUAGE ARTS Foundational Skills Print Concepts Demonstrate understanding of the organization and basic features

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction CLASSIFICATION OF PROGRAM Critical Elements Analysis 1 Program Name: Macmillan/McGraw Hill Reading 2003 Date of Publication: 2003 Publisher: Macmillan/McGraw Hill Reviewer Code: 1. X The program meets

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Twitter Sentiment Classification on Sanders Data using Hybrid Approach IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders

More information

Arabic Orthography vs. Arabic OCR

Arabic Orthography vs. Arabic OCR Arabic Orthography vs. Arabic OCR Rich Heritage Challenging A Much Needed Technology Mohamed Attia Having consistently been spoken since more than 2000 years and on, Arabic is doubtlessly the oldest among

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

What the National Curriculum requires in reading at Y5 and Y6

What the National Curriculum requires in reading at Y5 and Y6 What the National Curriculum requires in reading at Y5 and Y6 Word reading apply their growing knowledge of root words, prefixes and suffixes (morphology and etymology), as listed in Appendix 1 of the

More information

Cross Language Information Retrieval

Cross Language Information Retrieval Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................

More information

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion

More information

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny By the End of Year 8 All Essential words lists 1-7 290 words Commonly Misspelt Words-55 working out more complex, irregular, and/or ambiguous words by using strategies such as inferring the unknown from

More information

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI

More information

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,

More information

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art

More information

Software Maintenance

Software Maintenance 1 What is Software Maintenance? Software Maintenance is a very broad activity that includes error corrections, enhancements of capabilities, deletion of obsolete capabilities, and optimization. 2 Categories

More information

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,

More information

Rule-based Expert Systems

Rule-based Expert Systems Rule-based Expert Systems What is knowledge? is a theoretical or practical understanding of a subject or a domain. is also the sim of what is currently known, and apparently knowledge is power. Those who

More information

वण म गळ ग र प ज http://www.mantraaonline.com/ वण म गळ ग र प ज Check List 1. Altar, Deity (statue/photo), 2. Two big brass lamps (with wicks, oil/ghee) 3. Matchbox, Agarbatti 4. Karpoor, Gandha Powder,

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS Daffodil International University Institutional Repository DIU Journal of Science and Technology Volume 8, Issue 1, January 2013 2013-01 BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS Uddin, Sk.

More information

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading Program Requirements Competency 1: Foundations of Instruction 60 In-service Hours Teachers will develop substantive understanding of six components of reading as a process: comprehension, oral language,

More information

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics (L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes

More information

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database Journal of Computer and Communications, 2016, 4, 79-89 Published Online August 2016 in SciRes. http://www.scirp.org/journal/jcc http://dx.doi.org/10.4236/jcc.2016.410009 Performance Analysis of Optimized

More information

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading ELA/ELD Correlation Matrix for ELD Materials Grade 1 Reading The English Language Arts (ELA) required for the one hour of English-Language Development (ELD) Materials are listed in Appendix 9-A, Matrix

More information

CS 598 Natural Language Processing

CS 598 Natural Language Processing CS 598 Natural Language Processing Natural language is everywhere Natural language is everywhere Natural language is everywhere Natural language is everywhere!"#$%&'&()*+,-./012 34*5665756638/9:;< =>?@ABCDEFGHIJ5KL@

More information

Parsing of part-of-speech tagged Assamese Texts

Parsing of part-of-speech tagged Assamese Texts IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal

More information

Disambiguation of Thai Personal Name from Online News Articles

Disambiguation of Thai Personal Name from Online News Articles Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online

More information

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative English Teaching Cycle The English curriculum at Wardley CE Primary is based upon the National Curriculum. Our English is taught through a text based curriculum as we believe this is the best way to develop

More information

AQUA: An Ontology-Driven Question Answering System

AQUA: An Ontology-Driven Question Answering System AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

Visual CP Representation of Knowledge

Visual CP Representation of Knowledge Visual CP Representation of Knowledge Heather D. Pfeiffer and Roger T. Hartley Department of Computer Science New Mexico State University Las Cruces, NM 88003-8001, USA email: hdp@cs.nmsu.edu and rth@cs.nmsu.edu

More information

Lecture 1: Basic Concepts of Machine Learning

Lecture 1: Basic Concepts of Machine Learning Lecture 1: Basic Concepts of Machine Learning Cognitive Systems - Machine Learning Ute Schmid (lecture) Johannes Rabold (practice) Based on slides prepared March 2005 by Maximilian Röglinger, updated 2010

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition Chapter 2: The Representation of Knowledge Expert Systems: Principles and Programming, Fourth Edition Objectives Introduce the study of logic Learn the difference between formal logic and informal logic

More information

Indian Institute of Technology, Kanpur

Indian Institute of Technology, Kanpur Indian Institute of Technology, Kanpur Course Project - CS671A POS Tagging of Code Mixed Text Ayushman Sisodiya (12188) {ayushmn@iitk.ac.in} Donthu Vamsi Krishna (15111016) {vamsi@iitk.ac.in} Sandeep Kumar

More information

An Online Handwriting Recognition System For Turkish

An Online Handwriting Recognition System For Turkish An Online Handwriting Recognition System For Turkish Esra Vural, Hakan Erdogan, Kemal Oflazer, Berrin Yanikoglu Sabanci University, Tuzla, Istanbul, Turkey 34956 ABSTRACT Despite recent developments in

More information

1. Introduction. 2. The OMBI database editor

1. Introduction. 2. The OMBI database editor OMBI bilingual lexical resources: Arabic-Dutch / Dutch-Arabic Carole Tiberius, Anna Aalstein, Instituut voor Nederlandse Lexicologie Jan Hoogland, Nederlands Instituut in Marokko (NIMAR) In this paper

More information

Applications of memory-based natural language processing

Applications of memory-based natural language processing Applications of memory-based natural language processing Antal van den Bosch and Roser Morante ILK Research Group Tilburg University Prague, June 24, 2007 Current ILK members Principal investigator: Antal

More information

Learning Disability Functional Capacity Evaluation. Dear Doctor,

Learning Disability Functional Capacity Evaluation. Dear Doctor, Dear Doctor, I have been asked to formulate a vocational opinion regarding NAME s employability in light of his/her learning disability. To assist me with this evaluation I would appreciate if you can

More information

Grade 6: Correlated to AGS Basic Math Skills

Grade 6: Correlated to AGS Basic Math Skills Grade 6: Correlated to AGS Basic Math Skills Grade 6: Standard 1 Number Sense Students compare and order positive and negative integers, decimals, fractions, and mixed numbers. They find multiples and

More information

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence. NLP Lab Session Week 8 October 15, 2014 Noun Phrase Chunking and WordNet in NLTK Getting Started In this lab session, we will work together through a series of small examples using the IDLE window and

More information

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words, First Grade Standards These are the standards for what is taught in first grade. It is the expectation that these skills will be reinforced after they have been taught. Taught Throughout the Year Foundational

More information

Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features

Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features Dhirendra Singh Sudha Bhingardive Kevin Patel Pushpak Bhattacharyya Department of Computer Science

More information

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract

More information

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks 3rd Grade- 1st Nine Weeks R3.8 understand, make inferences and draw conclusions about the structure and elements of fiction and provide evidence from text to support their understand R3.8A sequence and

More information

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,

More information

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH ISSN: 0976-3104 Danti and Bhushan. ARTICLE OPEN ACCESS CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH Ajit Danti 1 and SN Bharath Bhushan 2* 1 Department

More information

ScienceDirect. Malayalam question answering system

ScienceDirect. Malayalam question answering system Available online at www.sciencedirect.com ScienceDirect Procedia Technology 24 (2016 ) 1388 1392 International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST - 2015) Malayalam

More information

Natural Language Processing. George Konidaris

Natural Language Processing. George Konidaris Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2017 Natural Language Processing Understanding spoken/written sentences in a natural language. Major area of research in AI. Why? Humans

More information

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 6, Ver. IV (Nov Dec. 2015), PP 01-07 www.iosrjournals.org Longest Common Subsequence: A Method for

More information

A heuristic framework for pivot-based bilingual dictionary induction

A heuristic framework for pivot-based bilingual dictionary induction 2013 International Conference on Culture and Computing A heuristic framework for pivot-based bilingual dictionary induction Mairidan Wushouer, Toru Ishida, Donghui Lin Department of Social Informatics,

More information

Artificial Neural Networks written examination

Artificial Neural Networks written examination 1 (8) Institutionen för informationsteknologi Olle Gällmo Universitetsadjunkt Adress: Lägerhyddsvägen 2 Box 337 751 05 Uppsala Artificial Neural Networks written examination Monday, May 15, 2006 9 00-14

More information

Radius STEM Readiness TM

Radius STEM Readiness TM Curriculum Guide Radius STEM Readiness TM While today s teens are surrounded by technology, we face a stark and imminent shortage of graduates pursuing careers in Science, Technology, Engineering, and

More information

South Carolina English Language Arts

South Carolina English Language Arts South Carolina English Language Arts A S O F J U N E 2 0, 2 0 1 0, T H I S S TAT E H A D A D O P T E D T H E CO M M O N CO R E S TAT E S TA N DA R D S. DOCUMENTS REVIEWED South Carolina Academic Content

More information

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA Alta de Waal, Jacobus Venter and Etienne Barnard Abstract Most actionable evidence is identified during the analysis phase of digital forensic investigations.

More information

GOLD Objectives for Development & Learning: Birth Through Third Grade

GOLD Objectives for Development & Learning: Birth Through Third Grade Assessment Alignment of GOLD Objectives for Development & Learning: Birth Through Third Grade WITH , Birth Through Third Grade aligned to Arizona Early Learning Standards Grade: Ages 3-5 - Adopted: 2013

More information

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Amit Juneja and Carol Espy-Wilson Department of Electrical and Computer Engineering University of Maryland,

More information

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler Machine Learning and Data Mining Ensembles of Learners Prof. Alexander Ihler Ensemble methods Why learn one classifier when you can learn many? Ensemble: combine many predictors (Weighted) combina

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Notebook for PAN at CLEF 2013 Andrés Alfonso Caurcel Díaz 1 and José María Gómez Hidalgo 2 1 Universidad

More information

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards TABE 9&10 Revised 8/2013- with reference to College and Career Readiness Standards LEVEL E Test 1: Reading Name Class E01- INTERPRET GRAPHIC INFORMATION Signs Maps Graphs Consumer Materials Forms Dictionary

More information

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education GCSE Mathematics B (Linear) Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education Mark Scheme for November 2014 Oxford Cambridge and RSA Examinations OCR (Oxford Cambridge

More information

Compositional Semantics

Compositional Semantics Compositional Semantics CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Words, bag of words Sequences Trees Meaning Representing Meaning An important goal of NLP/AI: convert natural language

More information

Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion

Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion Computational Linguistics and Chinese Language Processing vol. 3, no. 2, August 1998, pp. 79-92 79 Computational Linguistics Society of R.O.C. Noisy Channel Models for Corrupted Chinese Text Restoration

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

Using SAM Central With iread

Using SAM Central With iread Using SAM Central With iread January 1, 2016 For use with iread version 1.2 or later, SAM Central, and Student Achievement Manager version 2.4 or later PDF0868 (PDF) Houghton Mifflin Harcourt Publishing

More information

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus CS 1103 Computer Science I Honors Fall 2016 Instructor Muller Syllabus Welcome to CS1103. This course is an introduction to the art and science of computer programming and to some of the fundamental concepts

More information

Human Emotion Recognition From Speech

Human Emotion Recognition From Speech RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati

More information

Mandarin Lexical Tone Recognition: The Gating Paradigm

Mandarin Lexical Tone Recognition: The Gating Paradigm Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition

More information

Grade 4. Common Core Adoption Process. (Unpacked Standards)

Grade 4. Common Core Adoption Process. (Unpacked Standards) Grade 4 Common Core Adoption Process (Unpacked Standards) Grade 4 Reading: Literature RL.4.1 Refer to details and examples in a text when explaining what the text says explicitly and when drawing inferences

More information

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions. to as a linguistic theory to to a member of the family of linguistic frameworks that are called generative grammars a grammar which is formalized to a high degree and thus makes exact predictions about

More information

GACE Computer Science Assessment Test at a Glance

GACE Computer Science Assessment Test at a Glance GACE Computer Science Assessment Test at a Glance Updated May 2017 See the GACE Computer Science Assessment Study Companion for practice questions and preparation resources. Assessment Name Computer Science

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

Implementing a tool to Support KAOS-Beta Process Model Using EPF

Implementing a tool to Support KAOS-Beta Process Model Using EPF Implementing a tool to Support KAOS-Beta Process Model Using EPF Malihe Tabatabaie Malihe.Tabatabaie@cs.york.ac.uk Department of Computer Science The University of York United Kingdom Eclipse Process Framework

More information

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many Schmidt 1 Eric Schmidt Prof. Suzanne Flynn Linguistic Study of Bilingualism December 13, 2013 A Minimalist Approach to Code-Switching In the field of linguistics, the topic of bilingualism is a broad one.

More information

Detecting English-French Cognates Using Orthographic Edit Distance

Detecting English-French Cognates Using Orthographic Edit Distance Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National

More information

On-Line Data Analytics

On-Line Data Analytics International Journal of Computer Applications in Engineering Sciences [VOL I, ISSUE III, SEPTEMBER 2011] [ISSN: 2231-4946] On-Line Data Analytics Yugandhar Vemulapalli #, Devarapalli Raghu *, Raja Jacob

More information

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Tyler Perrachione LING 451-0 Proseminar in Sound Structure Prof. A. Bradlow 17 March 2006 Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Abstract Although the acoustic and

More information

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models Netpix: A Method of Feature Selection Leading to Accurate Sentiment-Based Classification Models 1 Netpix: A Method of Feature Selection Leading to Accurate Sentiment-Based Classification Models James B.

More information

Transliteration Systems Across Indian Languages Using Parallel Corpora

Transliteration Systems Across Indian Languages Using Parallel Corpora Transliteration Systems Across Indian Languages Using Parallel Corpora Rishabh Srivastava and Riyaz Ahmad Bhat Language Technologies Research Center IIIT-Hyderabad, India {rishabh.srivastava, riyaz.bhat}@research.iiit.ac.in

More information

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE Submitted in partial fulfillment of the requirements for the degree of Sarjana Sastra (S.S.)

More information

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1) Houghton Mifflin Reading Correlation to the Standards for English Language Arts (Grade1) 8.3 JOHNNY APPLESEED Biography TARGET SKILLS: 8.3 Johnny Appleseed Phonemic Awareness Phonics Comprehension Vocabulary

More information

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona Parallel Evaluation in Stratal OT * Adam Baker University of Arizona tabaker@u.arizona.edu 1.0. Introduction The model of Stratal OT presented by Kiparsky (forthcoming), has not and will not prove uncontroversial

More information

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy Informatics 2A: Language Complexity and the Chomsky Hierarchy September 28, 2010 Starter 1 Is there a finite state machine that recognises all those strings s from the alphabet {a, b} where the difference

More information