Information Theoretical Complexities in Developing a Bilingual Corpus: Critical comparison Hindi and Marathi

Information Theoretical Complexities in Developing a Bilingual Corpus: Critical comparison Hindi and Marathi Sonal Khosla Symbiosis International University Haridasa Acharya Symbiosis International University Abstract A critical comparison of Hindi and Marathi languages and its implications for building a bilingual corpus with IT perspective is presented here. We strongly believe that the efforts required to build a corpus can be reduced if the similarities between the languages can be properly accommodated in the design and development of the corpus. If the corpus is domain specific then the similarities can further be exploited to arrive at a more practicable design and the efforts can be reduced. The paper also attempts to discuss challenges faced in building the corpus due to the dissimilarities of the two languages. As a first step towards such design, we have attempted to provide formal definitions of the basic concepts in terms which could be understood better by IT designers. Results from two elementary experiments have been used to get insight into the complexities involved in the design of a bilingual corpus. Hypothesis have been laid and established which could help in understanding the complexities involved in design and development of a bilingual corpus. 1. Introduction 1.1. Human language is a most exciting and demanding puzzle Theoretical CL (Computational Linguistics) takes up issues in theoretical linguistics and cognitive science. Formal theories have reached a degree of complexity that can only be managed by employing computers [22]. Models simulating aspects of the human language faculty facilitate implementing them as computer programs, which in turn constitute the basis for the evaluation and further development of the theories. Natural language applications which are built to run on computers have tremendous importance. First thing that is required when natural language applications are written is support of a good corpus. A corpus is a representative of any language and makes itself useful for any linguistic analysis [16]. Building a corpus or may be parallel corpora is a task which has both linguistic aspects and information theoretical aspects to be taken care of. When we say applications, in today s scenario we expect practically every application to run off-line on office computers or client machines at the end user level with possible web support. 1.2 Bilingual Applications If one is concerned with two natural languages at a time, then a Bilingual Corpus is what is needed. Computational Linguists have defined various characteristics of a good bilingual corpus [12][24]. However from the point of view of using computers as the cognitive artifacts in applications we need to identify the IT related aspects. Storage formats, appropriate choice of data bases, proper choice of tagging tools, aligning tools while building the corpus assume tremendous importance [24]. Facilitating easy browsing, providing proper API so that software developers can easily write application which naturally interface with the corpus are the factors one should be worried about while building bilingual corpus. 93

1.3 Objective of current paper In this article we propose to compare languages Hindi and Marathi. The comparison will have three important parts. The initial part would be a formalization of some of the basic concepts followed by our idea of how these similarities and dissimilarities listed in part one can be tackled while building a bilingual corpus. Second part would be the linguistic similarities and dissimilarities, where we base our arguments on what earlier researchers have written. Third and the last part would be results of a very limited experimentation and their results which support a few hypotheses. We feel that earlier researchers have concentrated more on Linguistic aspects, both pure and computational in nature, and have not provided adequate linkage to down to earth software engineering requirements. It is hoped that this article will help the developers in getting clearer picture of the corpus and in turn the selection of software environments to support related projects will be easier. We have preferred to use the word e-corpus to emphasize the fact that the format will be electronic, stored in such a way that it is readily accessible to anyone who is a computer literate. It will be easily browsable without any special browser specifications thus hindering the process. And those who are developers can develop applications with interfaces which use the corpus at the back-end that can be easily created. As the title of the paper indicates the chosen languages are Hindi and Marathi. 2. Basic Definitions and Formal Theoretical Aspects Firstly we attempt the formalization of the basic definitions and concepts which is the first step towards building an e_corpus. 2.1 Notations and Symbols Table 1 gives the list of symbols and notations that are used in this article. Symbol Notation Meaning F file A file is logical storage unit of a computer, which is a collection of data and information in the form of texts,pictures etc. P Paragraph A paragraph is a suitable sub-division of contents of a file defined by paragraph delimiters which can be specially defined spaces, new line characters and other special characters. A file can always be divided into finite no of paragraphs, conversely a file is a union of paragraphs. A sentence is a set of tokens(words/punctuation S sentence marks/numerals/dates etc.) having a meaning in itself, which conveys a statement, question, exclamation and command. In Hindi, a sentence ends in a viraam, question mark or exclamation mark. In Marathi, a sentence ends in a full stop, question mark or exclamation mark. 94

K token A token is a smallest unit of a sentence to be processed which can be a word, punctuation marks, numerals etc. It is usually enclosed by two blank characters or two punctuation marks. It represents one morphosyntactic unit. W word A word is a sequence of characters/syllables with a defined meaning and space on either side C character A character over here is referred to a syllable where a syllable is a unit of spoken language. L Lemma A lemma is the canonical form of the token or the token itself. It is the base form of the word representing all word forms belonging to the same word. L Language_type Language refers to the language in particular i.e. Hindi and Marathi t tag Tag is a lexical category given to a particular token in a sentence. The lexical category is picked from the standard POS tagset given in Table 2. Table 1. List of Symbols, Notations with their meanings 2.2 2.3 Formal Definitions We present here the formal definitions that are requisite in building an e-corpus. [ Most of the following definitions have been adapted from ( William H Wilson, 2012, http://www.cse.unsw.edu.au/~billw/nlpdict.html.] Def 1: Part of speech, POS (Lexical category) A set POS, also called a tagset, = {t1, t2, t3,...}, where each element corresponds to a role which can be played by a word. It is a linguistic category. Def 2: An ordered pair {w, t} is called a tagged-word if w is the word and t is its lexical category. It should be remembered that same word may have different tags, when they appear in different sentences depending on the context [8]. A list of tags is given in Table 2. Def 3 : Part-of-speech tagging is the process of labelling each word, w in each sentence, s with its part of speech category, t. We will assume that an e-corpus should always be a tagged-corpus to be apt for use in natural language processing applications. Further we will assume that the tag-sets are unique, within an e-corpus. In our work we have used the tagset given in Table No 2, taken from the POS Tagset [4] developed at Language Technology Research Centre at IIIT, Hyderabad. Note that there are exactly 26 number of tags, with 21 basic ones, and these are, common to all Indian languages, and this is the complete set. Table 2: POS Tagset for Indian Languages Sl. No. Category Tag Name Remarks/Discussion 1.1 Noun NN Common Noun Example (Hindi) Examples (Marathi) घयघय हट, घयघय हट, स लन स लन 95

Noun denoting spatial and ऊऩय, temporal expressions such as 1.2 NLoc NST लय location and time. They are also postpositions in certain contexts. 2 Proper Noun NNP Proper Nouns used for manual करऩ र annotation करऩ र 3.1 Pronoun PRP Pronouns is a word that आऩक, आऩल म, substitutes for a noun or a noun इस म र phrase. Demonstratives indicate the person or thing to be referred to as. It is used as a separate tag to उस 3.2 Demonstrative DEM mark the difference between त र demonstratives and pronouns. It points to a particular noun or to a noun it replaces. A finite verb is used to mark tense and is used in agreement with the subject. If there is one फ र ए, फ रल, 4 Verb-finite VM verb in a sentence it is a finite ख, ख, verb. They provide grammatical स, झ ऩ, information of gender, person, य य ड number, tense, aspect, mood and voice. 5 Verb Aux An auxiliary verb is a verb giving further semantic or syntactic ह VAUX information of the main verb आह following it. Adjectives are words that 6 Adjective JJ describe or modify another नळ र नळ र person or thing in a sentence. 7 Adverb RB Adverbs are words that modify ज, ज य, verbs. ध य हऱ ल य 8 Post position PSP It is a word that is used to show the relation of a noun or pronoun क, क, NA to some other word in a sentence, क It follows the object. These are function words that 9 Particles RP show grammatical relationships ब, ऩण, य with other words. conjuncts or Conjunctions are 10 Conjuncts CC words that join parts of a आणण औय sentence. 11 क म, Question These are the words that put up a WQ Words question क म, कह क, क म, क ठ 96

12.1 Quantifiers QF 12.2 Cardinal QC 12.3 Ordinal QO 12.4 Classifier CL 13 Intensifier INTF 14 Interjection INJ 15 Negation NEG 16 Quotative UT 17 Special Symbol SYM 18 Compounds C 19 Reduplicative RDP 20 Echo ECH 21 Unknown UNK These are the words that tell us सब, सगऱ, how many or how much. They फह, ज स,.थ ड generally precede or modify थ ड, कमभ nouns. कभ, These words are the cardinal numbers in the language which फ स, एक, ल स, एक, quantify and are adjectives द, न द न, त न referring to quantity Ordinal numbers are words that ऩहर, ऩहहर, represent the relative position of द सय, द सय, an item in an ordered sequence सय त सय classifier is a word which accompanies a noun in certain grammatical contexts and NA generally reflects some kind of NA conceptual classification of nouns. ख फ, ख ऩ, These are the words that फह, ज स intensifies adjectives or adverbs थ ड, कमभ कभ,.थ ड, Interjections are words that used to exclaim, protest or command. अय, ह They can sometimes be used by अय, ह themselves. These are the negative words in a नह language. न ह A quotative introduces a quote. It is typically a verb and some indian languages use it. It is used to tag all the special symbols which cannot be?, : ;!?, : ;!. categorised in any other category These are the words that are र र यक र र यक combined together to represent a क मळक ए ऩ ळ single word. It is used to mark those words in छ ट Indian languages that are छ ट repeated consecutively. छ ट छ ट This category is designed for representing words in Indian दल ई languages that do not have any ळल ई NA place in dictionary and can be called as nonsense words This category is used to mark the words whose category is not known which may be loan words or foreign words. 97

Def 4: We will ref to tagset as defined in Table 2 as the default tag-set. Def 5 : Storage format is the standardized format that is used to store the metadata so that it is machine readable and interpretable. There are many standardized formats for encoding like Text Encoding Initiative (TEI), Translation Memory Exchange (TMX) Corpus Encoding Standard for XML (XCES)[24] IMS corpus WorkBench Each of the formats have their own advantages and disadvantages. We feel that the choice of the format, by the corpus builder, will not create any incompatibility in respect of choice of tagging tool or even alignment tools at later stage at the time of actual building of the e_corpus. Def 6: A RAW_resource is always a Pair of TEXT files in the corresponding languages namely L1 and L2 using Unicode, one being translation of the other for a bilingual corpus. This means, we are avoiding usage of resources which could contain images, or sound files and also assuming that the Raw resource is in UNICODE format. Further, this means that a resource is input into the corpus only when the builder is satisfied with its translation into the other language. Def 7: Sum total of sizes of all RAW_resource files in number of Bytes would be referred to as the size of the Corpus. Obviously the actual fingerprint of the total content would be far larger than the size of the Corpus as that would include processed files, accompanying software resources etc. Def 8: Length of a sentence, s will be always specified as number of tokens, k. Token length of a sentence and byte length of a sentence are two different metrics which we will need when we analyze. Hence a separate definition of byte length is proposed. Tokens are words separated by a token delimiter. Def 9: The Byte length of a Sentence is the total no. of bytes that constitute a sentence which includes white spaces and one byte for the sentence delimiter. Choice of the Data Base System: Use of a Data Base System is a necessity while building a Corpus. XML would be treated as the default Database unless otherwise specified. Reason for choice of XML as the default database system is its interoperability with most of the platforms and programming languages, its software and hardware independence when it comes to way of storing information. Since Unicode is supported it is ideal for storing text of any language, and would allow use of simple text editors when it comes to the content part [24]. Most of the developmental tools will have natural compatibility with XML. The other options could be any of the Standard Relational Database like MySQL, Oracle etc. Choice of Database should not affect the usability of the corpus. Def 10: Paragraph is a subdivision of the text file of finite length, identified by special delimiters like spaces, new line characters, tabs etc. Sometimes it may be indicated in number of sentences. Def 11: Context_Tags is a set of tags, predefined (Like set of PoS tags), which can be associated with each of the paragraphs. There can be different classes of tags: linguistic, situational and cultural [19]. 98

Def 12: Context_tagging is the process of tagging each paragraph in a text-file, with predefined tags. Def 13: Since we are concerned with a bilingual product, the concept of Direction assumes considerable importance. The e- corpus will have a direction specified as one of the elements of the set {Uni, Bi}. If the choice is Uni, then the aligning will be L1 to L2 or L2 to L1 which again will be clearly specified in the definition. Choice Bi would mean bi-directional and needs no further specification since it would any way be symmetric. A bidirectional corpus any way would include both unidirectional corpora into it as a recoverable sub-corpus [12]. The corpus will have various sub corpora that will be aligned (text by text / paragraph by paragraph, sentence by sentence, phrase by phrase and word by word) [12]. Having defined the basic components in clear formal terms, now we are in a position to provide a good implementable definition of monolingual e_corpus and bilingual e_corpus. Def 14: Bilingual e_corpus Bilingual e_corpus is a Quadruplet {Lan_Names, RawLang_Files, ProLang_Files, SoftwareToManage } with following characteristics 1. It has a specified pair of languages associated with it. {Lan_Names = (L1, L2)} 2. It constitutes of a repository of contents included in containers { RawLang_Files, ProLang_Files } combined size of which will be called the size of corpus. RawLang_Files are a collection of resources, which are basically Pairs of text files as defined in Def 6, whereas ProLang_Files are the XML files which provide exact alignment and tagging of Raw resources. 3. It is Intelligent in the sense that adequate interfaces provided to add /modify/delete info, perform various linguistic operations allows to browse and extract information for user applications (and Lot more related functionalities ) 4. It has a software component SoftToManage an integrated package with utilities required to manage the repository and which also contains proper API s, which will facilitate application development in Java/C++. 5. It is noiseless (possible noise is spelling mistakes, incorrect translations, incorrect character encoding, missing words). In short it will have no linguistic inconsistencies) The ProLang_Files is essentially a collection of tagged -repositories etc. obtained from the RawLang_Files, arranged and structured in such a way so as to facilitate the Utilities in SoftToMange to work properly. Def 15: Context Context is the physical environment in which a word is used [19]. A word can have a different POS tag based on the context in which it is used [13]. Lexical ambiguity that arises in different situations can be resolved using the contextual information available in the text [13]. 3. Linguistic Similarities and their Information Theoretic Implications 99

Challenges faced by developers in building a bilingual corpus for Hindi and Marathi pair of languages are many. The basic definitions discussed earlier in Section 2 and the functional requirements specified in the definition of a bilingual corpus are not easy to meet. Through a study of the similarities and dissimilarities of the pair of languages one can possibly counter some of the challenges and reduce complexities. 3.1 Vocabulary Indian languages share a common origin and are known to have a common vocabulary of around 40 to 80 percent [9]. Hindi and Marathi is one such pair of Indo-aryan languages, being derived from Devanagari script. They are known to be sister languages and have significant proportion of common vocabulary [18]. Words that are phonologically and lexically similar are defined to be as Cognates. [18] [14]. Out of the corpus of 6 million words created by Central Institute of Indian Languages for Marathi and Hindi language, 44.5% are cognate. Though sometimes these cognitive words may have different meanings posing a problem of Word sense disambiguation in front of developers. These differences are due to the difference in the Marathi and Hindi grammatical rules in the construction of verbs and its placement. Some words retain their meanings and have similar meanings while others have become associated with different concepts [18]. The problems arising due to bilingualism are reduced when the rate of cognates available in the two languages are higher [14]. Some examples of cognates in Hindi and Marathi are given below: Same origin; same meaning: The word Utsuk means curious in both Hindi and Marathi. Same origin; different meaning: The word shikhsa in Hindi means education, while the same word in Marathi means Punishment. Since there are very less similar words in the two languages having different meanings. The work involved in building lexical resources can be reduced by taking care of these cognates. So from the designer s perspective we may conclude that A good bilingual corpus faces the problems related to BIGDATA, i.e. problem of Volume, Variety and Velocity. The common part of the vocabulary, and presence of cognitive words will certainly reduce the Volume of Corpus, significantly. 3.2 Script & Alphabet Set The set of symbols of each language is unified into a single collection identified as a single script. These collection of symbols and scripts, then serve as a reserve from which symbols are taken to write multiple languages. Hindi and Marathi are derived from Devanagari script for writing, which is a phonetic script. Devanagari script used for Hindi and Marathi have 12 pure vowels, two additional loan vowels taken from the Sanskrit and one loan vowel from English [9][10].There are 34 pure consonants, 5 traditional conjuncts, 7 loan consonants and 2 traditional signs in Devanagari script and each consonant have 14 variations through integration of 14 vowels, which produces 507 different alphabetical characters[9][11]. Apart from ऱ / which is used only in Marathi language, consonants are identical. In Marathi glyphs are preferred for U+0932 devanagari letter la and U+0936 devanagari letter sha[2]. The different committees of the Department of Electronics and the Department of Official Language, Govt. of India have developed a universal code, which is the Indian Standard Code for Information Interchange (ISCII). The ISCII code is a super set of all the characters required in the ten Brahmi based Indian scripts. It is based on the standard ASCII code [1]. 100

Unicode has also encoded the Indian language scripts and is based on the Indian national standard, ISCII. the Unicode standard has encoded the Devanagari characters in the same relative position as in the ISCII-1988 standard. This enables one to one mapping between different scripts in the Indian family [2]. The Range of codes for Devanagari in The Unicode Standard Version 7. 0 is 0900 097F [2]. Since the script and the Alphabet set is similar in both languages,so Unicode to ISCII and vice versa is not language specific, but is dependent on the script. Whenever we download or procure raw files or documents in Hindi or Marathi for inclusion in a repository they are passed or produced through some document editors. In most cases they need to be converted into a plain text resource using a code converter. Font Suvidha [http://www.fontsuvidha.com/] is one of its kind software developed to convert writing in devnagari scripts like Hindi, Marathi, and many other languages written in different fonts to Unicode and vice versa. Availability of many such converters is a tremendous advantage for a corpus builder. Some tools also have language detection feature to leave English text unchanged so that documents with mixed contents (English, Hindi or Marathi) can be easily handled. Hypothesis : The commonality of Devanagari script between the two languages has made development of such Unicode-converters possible. (One Unicode converter can handle RAW files from both the languages) 3.3 Phonology Phonology of a language is an important feature. Most often phonetically similar words have similar spellings. Devanagari being a phonetic script, this aspect can be used to match misspelled words or missing/muted words [11]. For example, the words aaya and gaya rhyme similar to aala and gela in Marathi and have similar meanings as well. Even in the below example of sentences, the words न ल and ण ल rhyme similar and have similar spellings. Hindi: न ल कभ कय Marathi: ण ल कभ कय. ण sound is more frequently used in Marathi Due to the phonetic similarity of different alphabets [7] and several features and sounds shared across Indian languages [5], an optimal keyboard common to all languages is possible. The different committees of the Department of Electronics and the Department of Official Language, Govt. of India have been evolving different codes and keyboards which could cater to all the Indian scripts. Hypothesis: Due to an overlap in vocabulary of Hindi and Marathi, words having similar pronunciation and which rhyme together, they can be directly taken in the corpus to be equivalent words having similar meanings. The Hypothesis said above is supported by the experiments described in Section 5. Although phonology aspect is more applicable in building a speech corpus, still we have attempted to list the differences in Table 3 which needs to be taken care of. 101

Table 3. Difference in pronunciation of certain consonants in Hindi and Marathi. Consonants Marathi Hindi च, ज, झ and प Multiple pronunciation Single pronunciation ऋ /ru/ /ri/ and similar to Sanskrit words ending with these T, TH, D, DH, t, th, d dh consonants are prolonged No change in pronunciation च and ज are dental-alveolar in Marathi only, while these are alveolar in Hindi. 3.4 Grammar This is one aspect that needs to be studied considerably to build a highly accurate bilingual corpus. Table 4 gives a comprehensive list of similarities and dissimilarities in grammar in the two languages [20][21] Hindi is a highly inflected language and requires the modification of a word to represent different grammatical categories such as tense, mood, voice, aspect, person etc. It adds prefixes and suffixes to form words. The inflection of verbs is called as conjugation and the inflection of nouns, adjectives and pronouns is called declension. Hindi uses postpositions (PSP) rather than prepositions for case marking and auxiliaries. In Marathi, postpositions are added to the word preceding it. It also adds suffixes to roots to build words [20][21]. While doing conversion from Hindi to Marathi, the PSP s like क, भ or ह are removed in Marathi and added as a morphological phenomena[11] or grammatical information in the word itself. It is converted to a syntactic feature in Marathi [23]. Hence these PSP s in Hindi do not find any translation equivalent in Marathi and are irrelevant while doing Word alignment. Multiple words in Hindi are converted into a single compound word in Marathi. Mean sentence length of Hindi is 15.95 while that of Marathi is 9.54[3]. This is attributed to the fact that Marathi forms compound words. The pilot experiment also conducted shows that the sentence length in Hindi is always greater than the sentence length in Marathi. Section 5 shows the exact statistics of the sentence length in the pilot study done. द For e.g. In the given sentence pair taken from test data. Hindi: क म द ळय य क अन म अ ग भ प र ह? Marathi: ह ल दन ळय य च म अन म ब ग ऩसय क? The words ळय य क gets converted to ळय य च म in Marathi and the words अ ग भ forms the compound word ब ग. It is also seen that the PSP s क and भ gets converted into a syntactic feature in Marathi. 102

Category Table 4. Grammatical Categories of Hindi and Marathi Similarities Differences Number Singular, Plural NIL Nouns Gender Case Articles Masculine, Feminine Direct(nominative), Vocative NIL Neuter(Marathi) In Marathi, genitive, accusativedative, instrumental, ablative, locative. All cases except vocative are marked by postpositions. In Hindi, oblique(direct) case is used to mark subject of sentences and is used to mark postpositions. In Hindi, definite and indefinite. In Marathi, no articles Adjectives Person adjectives agree with the In Marathi, Adjectives do not inflect nouns they modify in unless they end in long /a/. number, gender, and case. In Hindi, 2nd honorific 1st, 2nd, 3rd Number Singular, Plural NIL Tense Past, Present, Future NIL Verbs Aspect Mood Imperfective, Perfective NIL Indicative, imperative, optative Subjective, conditional(marathi) Forms Hindi verbs occur in the following forms: root, imperfect stem, perfect stem, and infinitive. The stems agree with nouns in gender and number. 103

Word Order In Marathi, indirect objects precede Subject-Object- direct objects. Verb Modifiers precede the nouns they modify.. The common script, and the phonetic similarities together will certainly reduce the variety part of the BIGDATA problem faced by developers. 3.5 Other Aspects Other challenges in processing Hindi and Marathi are length of sentence, lexical ambiguity, ordering of words etc. The length of the sentence of Hindi and Marathi sentence is not the same. A Marathi sentence is smaller as compared to a Hindi sentence [3]. Due to the difference in the usage context, a word may have different POS tagging thereby resulting in ambiguity. Indian languages are morphologically rich and allow changing the ordering of words in a sentence. Due to the free word order of languages, alignment of equivalent words is challenging [4]. Though both Hindi and Marathi follow the Subject-Object-Verb order, but still the usage shows different word order. As given by [17] in the inter-language comparison study, the distance between Hindi and Marathi is very less. There is a close correspondence between Hind and Marathi and largely similar structural property [15][6]. Due to their structural similarity, the development of Marathi Wordnet can be done through relation borrowing from Hindi Wordnet [15][6]. 4. Important Statistical Parameters The word types define the distinct number of words in a corpus, which is also a measure of the vocabulary of the language. The table below gives the top five frequently used words in the corpus in Hindi and its percentage distribution [3]. Table 5.Some statistical measures Top Frequentl five y used Percentag Percentag syllable words in e e s in Hindi Hindi Ke 3.59 ra 5.27 he 3.08 ka 3.60 mem 2.79 na 2.84 ki 2.355 sa 2.80 Se 1.70 pa 2.17 Table 6. gives the number of words required to cover a certain percentage of the corpus[3].there is a drastic difference in the number of words required to cover a certain percentage of the corpus in Hindi and Marathi. As can be seen from the table, Marathi has a larger vocabulary as compared to Hindi. It can be accounted to the fact that the postpositions in Hindi like ke, he, mem, ki and se are the words that occur most often in Hindi, but these gets converted to a syntactic or grammatical feature in Marathi, hence having more variations and vocabulary. Table 7. gives a comparative list of syllable and words in Hindi and Marathi. 104

Table 6. Number of words required to cover a certain percentage of the corpus. % of Corpus Hindi Marathi 10% 4 17 20% 10 67 30% 26 213 40% 77 548 50% 199 1247 60% 486 2882 70% 1158 6922 80% 2874 18874 Hindi Marathi Corpus Size(in no. of words) 2986063 1872345 Word Types 127241 210578 Syllable types 3994 3757 Average no of syllables in a word 2.23 2.97 Syllable Mode 2 3 Most frequent syllable Ra wa Bigram syllable types 65697 69023 Most frequent bisyllable ka : ra A : he u:na:ki, mha:nu:na, a:pa:ne, mha:na:je, Most frequent a:pa:ni, A:pa:lyA, trisyllable i:sa:ke, A:he:wa, u:sa:ki ka:ru:na Maximum Word Length 20 23 Average Word Length 4.695 6.33 Total Sentences 171604 187373 Mean Sentence Length 15.95 9.54 Table 7. A comparative list of syllable and words in Hindi and Marathi The syllable patterns are very important in the study of languages. Table 8. shows the trisyllable pattern and its distribution in Hindi and Marathi. These syllable frequencies are used to extract unique patterns from the corpus. For example for a pattern with a high frequency in Hindi is compared with its occurrence in Marathi. It has been observed that a pattern which has a high occurrence in Hindi has a very low occurrence in Marathi. Therefore such patterns of syllables are unique to Hindi language. Some patterns are unique to a particular language and hence can be used to identify that language. Table 8. The trisyllable pattern and its distribution in Hindi and Marathi Trisyllable Hindi Marathi Pattern in Hindi ka:ra:ne 10642 29 a:pa:ne 8824 6 sa:ma:ya 5152 29 u:sa:ke 5057 1 ka:ra:we 4995 643 5. Results of some Related Experiments Tagging and aligning are two basic techniques used in providing interpretable structure to contents in a corpus [13]. We have run a few trials on selected taggers and aligners. Results of the experiments are reported here. A critical look into the outputs helps in understanding the complexity of the whole process and also suggests how the similarity between the languages can help in reducing the complexity. The data has been collected on medical domain. A set of 90 sentences of varying length are taken. The total length of the sentence ranges from 3 to 28 tokens per sentence. Experiment 1 : Experiment on Tagging Tools : Shallow Parser for Hindi and Marathi developed by IIIT Hyderabad. Input : Raw file containing 90 sentences in Unicode format 105

Tagged Output : Parsed sentences in a text file. A sample output selected from the Parsed file is shown in Table 9. Discussions : The following observations were made: The Postpositions in Hindi do not exist in Marathi Almost the word order remains the same with few changes. Postpositions do not have any translation equivalents in Marathi as these gets converted into a grammatical feature. The lemma (root) words are same for words which are common in both the languages as shown in the above table. If the starting few syllables of two words in Hindi and Marathi are similar then their root words are translation equivalents of each other like ळय य and ळय य च म. Table 9. Sample output of Experiment 1. क म द द ळय य क अन म अ ग भ Hindi प र ह? Sentence क म द द ळय य क अन म अ ग भ प र with lemma ह? No. of Tokens 10 POS Tags WQ NN NN PSP JJ NN PSP VM VAUX SYM ह ल दन ळय य च म अन म ब ग ऩसय Marathi क? Sentence ह ल दन ळय य अन म ब ग ऩसय with lemma क? POS Tags DEM NN NN JJ NN VM WQ SYM 8 Experiment 2 : Experiment on Alignment Tool : GIZA ++ Input : Parallel corpus Aligned output : Bilingual corpus aligned at word level. A sample output has been shown in Table 10. Discussions : With Source as Marathi and Target as Hindi the word alignment procedure matches 83% of the Marathi words with some hindi word, 17% are not aligned with any word and is hence null. Out of these, 76% are correctly aligned pairs where the marathi word is correctly matched with the corresponding hindi word. With Source as Hindi and Target as Marathi, the word alignment procedure matches 85% of the Marathi words with some hindi word, 15 % are not aligned with any word and is hence null. Out of these, 73% are correctly aligned pairs where the marathi word is correctly matched with the corresponding hindi word. Mean sentence length of Hindi = 1041/90 = 11.93 106

Mean Sentence length of Marathi = 811/90 = 9.11 Average distance between the length of sentences = 2.75 25% of the Hindi Words are also present in Marathi and 33% of the Marathi words are also present in Hindi which indicates the commonality of vocabulary. Mean difference between length of sentence is 2.75 which means that the Marathi sentence is bigger than a Hindi sentence by approx 2.75 words. This helps us in making an assumption that a long sentence gets translated to a long one while short sentence gets translated into a short one. द The following is the output of GIZA++. Table No. 10 shows the serial number assigned to each word by the tool. क म ({ 1 }) द ({ 2 }) ळय य ({ 3 }) क ({ }) अन म ({ 4 }) अ ग ({ 5 }) भ ({ }) प र ({ 6 7 }) ह ({ })? ({ 8 }) Table 10. Sample output of Experminent No. 2 Sl. No of 1 2 3 4 5 6 7 8 9 10 Words Source Sentence (Hindi) Target Sentence (Marathi) क म द द ळय य क अन म अ ग भ प र ह? ह ल दन ळय य च म अन म ब ग ऩसय क? As can be seen from the output of GIZA++, each word in hindi is assigned to some word in Marathi. For eg. क म ({ 1 }) means that the word क म is aligned to the first word in Marathi and द द ({ 2 }) means the word द द in the Hind sentence is aligned to the 2nd word in the Marathi sentence. If no equivalence is found, the word is assigned null. Out of the 10 hindi words, 7 words have been aligned to some marathi word and the rest 3 words are not aligned and are shown as empty brackets. Out of the 7 words, the correctly aligned words are 5. Four words are given as one to one mapping, while the word प र ({ 6 7 }) is aligned to the 6th and 7th word of the marathi sentence, hence is an example of one-to-many mapping. Result of alignment on GIZA ++ with Marathi as source and Hindi as target sentence The experiment was repeated with Marathi as the source sentence and Hindi as the target sentence. ह ({ 1 }) ल दन ({ 2 }) ळय य च म ({ 3 }) अन म ({ 5 }) ब ग ({ 7 8 }) ऩसय ({ 6 }) क ({ 9 })? ({ 10 }) All the marathi words in the source sentence has been aligned to some hindi word. There is no null assignment in this case. 107

A summary of the statistics in Experiment No. 1 and Experiment No. 2 is given in Table 11. Table 11. Summarize results of Experiment No. 1 and Experiment No. 2 ns = No. of sentences ls = length of sentence(no of words) Tw = Total words C= Categories created by GIZA++ V=vocabulary (distinct words) cp = correctly aligned pairs cw= common words nu = null assignments p:q = Hindi:marathi (ratio) alignments H= Hindi, M= Marathi Alignment Ratio Set No L ns ls Tw C V cp cw nu 1:2 1:3 1:4 1:5 1:6 I H 24 16-28 435 100 224 112 92 169 35 7 1 0 0 M 24 9-20 328 100 242 II H 36 9-14 425 100 250 168 101 149 27 7 1 1 1 M 36 7-13 337 100 251 III H 30 1-8 181 100 150 56 52 41 2 1 0 0 0 M 30 3-6 146 100 132 All H 90 3-20 1041 100 488 465 121 358 64 14 5 1 0 M 90 3-20 811 100 526 All M 90 3-20 811 100 526 616 171 55 60 37 11 4 1 H 90 3-20 1041 100 488 6. CONCLUSIONS AND DISCUSSIONS In this article an attempt has been made to formalize the very concept of a bilingual corpus with appropriate definitions in terms of Information Technology, so that the concepts are better understood by a developer and hence would become better implementable. Since our focus is on Hindi and Marathi bilingual e_corpus, a study of similarities between the two languages has been presented with a view of extracting proper help in reducing the complexity of the bilingual corpus. Various hypothesis stated in Sec 3, should provide help to a developer. Results of a few experiments in tagging and aligning are presented as evidences to some of the observations made earlier, which further strengthen our belief that a corpus designer can exploit the similarities to his advantage. References [1] Anonymous, Script Grammar for Marathi language. Technical Report. Technology development for Indian languages Programme of DIT, Govt. of India in association with CDAC. Ver 1.4-2. 108

[2] Julie D. Allen. 2012. The Unicode Standard / the Unicode Consortium Version 6.2. Technical Report. Published in Mountain View, CA. ISBN 978-1- 936213-07-8. September 2012. [3] Akshar Bharati, Prakash Rao, Rajeev Sangal and S.M.Bendre. 2002. Basic Statistical analysis of corpus and cross comparison among corpora. In Proceedings of 2002 International Conference on Natural Language Processing, Mumbai, India. (2002). [4] Akshar Bharati, Rajeev Sangal, Dipti Mishra Sharma and Lakshmi Bai. 2006. AnnCorra : Annotating Corpora Guidelines For POS And Chunk Annotation For Indian Languages. Language Technologies Research Centre, Technical Report, IIIT, Hyderabad, 2006. [5] Peri Bhaskararao. 2011. Salient Phonetic features of Indian languages in speech technology. Sadhana Vol.36, Part 5, pp. 587-599. http:// dx.doi.org/ 10.1007/ s12046-011-0039-z. (October 2011). [6] Pushpak Bhattacharya, Debasri Chakrabarti and Vaijayanthi M.Sarma. 2006. Complex Predicates in Indian language Wordnets. Language Resources and Evaluation, Vol. 40. pp. 331-355. (2006). [7] Sandeep Chaware and Srikantha Rao. 2011. Rule based phonetic matching approach for Hindi and Marathi. Computer Science and Engineering: An International Journal(CSEIJ), vol.1, No.3. DOI : http:// 10.5121 /cseij 2011(August 2011). [8] Niladri Sekhar Dash. 2008. Corpus Linguistics: An Introduction. India: Pearson Education-Longman Publishing Co., pp. 208, ISBN: 81-317-1603-1, 2008. [9] M.L.Dhore, S.K.Dixit and R.M.Dhore. 2012a. Hindi and Marathi to English NE Transliteration Tool using Phonology and Stress Analysis.Proceedings of 24th International Conference on Computational Linguistics: Demonstration Papers at IIT Bombay, pages 111 118(2012). [10] M.L.Dhore, S.K.Dixit and R.M.Dhore. 2012b. Issues in Hindi to English and Marathi to English Machine transliteration of Named Entities. International Journal of Computer Applications, Vol. 51, No.14 (August 2012). [11] M.L.Dhore, R.M.Dhore and P.H.Rathod. 2013. Transliteration by Orthography or Phonology for Hindi and Marathi to English: Case Study. International Journal of Natural Language Computing, Vol.2, No.5 (October 2013). DOI : 10.5121/ijnlc.2013.2501. [12] A.Frankenberg-Garcia. 2009. Compiling and Using a Parallel corpus for research in translation. International Journal of Translation, vol.21(1), pp.57-71, (2009). [13] Nisheeth Joshi, Hemant Darbari and Iti Mathur. 2013. HMM Based POS tagger for Indian languages. Jan Zizka (Eds) : CCSIT, SIPP, AISC, PDCTA 2013, pp.341 349, 2013. CS & IT- CSCP 2013, DOI : 10.5121/csit.2013.3639(2013). [14] Rujvi Kamat, Manisha Ghate, Tamar H.Gollan, Rachel Meyer, Florin Vaida, Robert K.Heaton, Scott Letendre, Donald Franklin, Terry Alexander, Igor Grant, Sanjay Mehendale and Thomas D.Marcotte. 2012. Effects of Marathi-Hindi bilingualism on 109

Neuropsychological performance. Journal of International Neuropsychological Society, Vol. 18,Issue 02, pp.305 313, March, 2012. http://dx.doi.org/10.1017/s1355617 711001731. [15] J. Ramanand, Akshay Ukey, Brahm Kiran Singh and Pushpak Bhattacharyya. 2007. Mapping and Structural analysis of Multilingual Wordnets. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 30(1). (March 2007). [16] Shikar Kr. Sharma, Himadri Barali Ambeshwar Gogoi, Ratul Ch. Deka and, Anup Kr. Burman. 2012. A structured approach for building Assamese corpus: Insights Applications and Challenges. In, Proceedings of the 10th Workshop on Asian Language Resources COLING 2012, pages 21-28., [17] Anil Kumar Singh and Harshit Surana 2007a. Can corpus based measures be. used for comparative study o languages. In Proceedings of Ninth Meeting f of the ACL Special Interest Group in Computational Morphology and Phonology, pp 40 47, Prague. (June 2007). [20] Irene Thompson. 2014. About World languages: Hindi. http:// aboutworldlanguages.com/ hindi. (July 2014). [21] Irene Thompson. 2014. About World languages: Marathi. http:// aboutworldlanguages.com/ marathi. (December 2014). [22] Hans Uszkoreit. 2000. What is Computational Linguistics. http:// coli.unisaarland.de/~hansu/what_is_cl.html [23] Christopher C. Yang and Kar Wing Li. 2003. Automatic construction of English/Chinese parallel corpora. Journal of the American Society for Information Science and Technology. Vol. 54, Issue 8, p.p 730-742. http:// dx.doi.org/ 10.1002/ asi.10261 (June 2003). [24] Johann Gamper and Paolo Dongilli, Primary data encoding of a bilingual corpus, In Proceedings of the 11 th Annual Meeting of the GLDV, Frankfurt a/m, Germany, July, 1999. [18] Anil Kumar Singh and Harshit Surana 2007b. Study of Cognates among South. Asian languages for the purpos of Building Lexical Resources. Journae of Language Technology. Dept. ol IT, Govt. of India. 2007. f [19] Lichao Song. 2010. The role of context in Discourse Analysis. Journal of Language teaching and Research, Vol. 1, No. 6, pp. 876-879. doi: 10.4304/ jltr.1.6.876-879( November 2010). 110