Information Theoretical Complexities in Developing a Bilingual Corpus: Critical comparison Hindi and Marathi

Size: px
Start display at page:

Download "Information Theoretical Complexities in Developing a Bilingual Corpus: Critical comparison Hindi and Marathi"

Transcription

1 Information Theoretical Complexities in Developing a Bilingual Corpus: Critical comparison Hindi and Marathi Sonal Khosla Symbiosis International University Haridasa Acharya Symbiosis International University Abstract A critical comparison of Hindi and Marathi languages and its implications for building a bilingual corpus with IT perspective is presented here. We strongly believe that the efforts required to build a corpus can be reduced if the similarities between the languages can be properly accommodated in the design and development of the corpus. If the corpus is domain specific then the similarities can further be exploited to arrive at a more practicable design and the efforts can be reduced. The paper also attempts to discuss challenges faced in building the corpus due to the dissimilarities of the two languages. As a first step towards such design, we have attempted to provide formal definitions of the basic concepts in terms which could be understood better by IT designers. Results from two elementary experiments have been used to get insight into the complexities involved in the design of a bilingual corpus. Hypothesis have been laid and established which could help in understanding the complexities involved in design and development of a bilingual corpus. 1. Introduction 1.1. Human language is a most exciting and demanding puzzle Theoretical CL (Computational Linguistics) takes up issues in theoretical linguistics and cognitive science. Formal theories have reached a degree of complexity that can only be managed by employing computers [22]. Models simulating aspects of the human language faculty facilitate implementing them as computer programs, which in turn constitute the basis for the evaluation and further development of the theories. Natural language applications which are built to run on computers have tremendous importance. First thing that is required when natural language applications are written is support of a good corpus. A corpus is a representative of any language and makes itself useful for any linguistic analysis [16]. Building a corpus or may be parallel corpora is a task which has both linguistic aspects and information theoretical aspects to be taken care of. When we say applications, in today s scenario we expect practically every application to run off-line on office computers or client machines at the end user level with possible web support. 1.2 Bilingual Applications If one is concerned with two natural languages at a time, then a Bilingual Corpus is what is needed. Computational Linguists have defined various characteristics of a good bilingual corpus [12][24]. However from the point of view of using computers as the cognitive artifacts in applications we need to identify the IT related aspects. Storage formats, appropriate choice of data bases, proper choice of tagging tools, aligning tools while building the corpus assume tremendous importance [24]. Facilitating easy browsing, providing proper API so that software developers can easily write application which naturally interface with the corpus are the factors one should be worried about while building bilingual corpus. 93

2 1.3 Objective of current paper In this article we propose to compare languages Hindi and Marathi. The comparison will have three important parts. The initial part would be a formalization of some of the basic concepts followed by our idea of how these similarities and dissimilarities listed in part one can be tackled while building a bilingual corpus. Second part would be the linguistic similarities and dissimilarities, where we base our arguments on what earlier researchers have written. Third and the last part would be results of a very limited experimentation and their results which support a few hypotheses. We feel that earlier researchers have concentrated more on Linguistic aspects, both pure and computational in nature, and have not provided adequate linkage to down to earth software engineering requirements. It is hoped that this article will help the developers in getting clearer picture of the corpus and in turn the selection of software environments to support related projects will be easier. We have preferred to use the word e-corpus to emphasize the fact that the format will be electronic, stored in such a way that it is readily accessible to anyone who is a computer literate. It will be easily browsable without any special browser specifications thus hindering the process. And those who are developers can develop applications with interfaces which use the corpus at the back-end that can be easily created. As the title of the paper indicates the chosen languages are Hindi and Marathi. 2. Basic Definitions and Formal Theoretical Aspects Firstly we attempt the formalization of the basic definitions and concepts which is the first step towards building an e_corpus. 2.1 Notations and Symbols Table 1 gives the list of symbols and notations that are used in this article. Symbol Notation Meaning F file A file is logical storage unit of a computer, which is a collection of data and information in the form of texts,pictures etc. P Paragraph A paragraph is a suitable sub-division of contents of a file defined by paragraph delimiters which can be specially defined spaces, new line characters and other special characters. A file can always be divided into finite no of paragraphs, conversely a file is a union of paragraphs. A sentence is a set of tokens(words/punctuation S sentence marks/numerals/dates etc.) having a meaning in itself, which conveys a statement, question, exclamation and command. In Hindi, a sentence ends in a viraam, question mark or exclamation mark. In Marathi, a sentence ends in a full stop, question mark or exclamation mark. 94

3 K token A token is a smallest unit of a sentence to be processed which can be a word, punctuation marks, numerals etc. It is usually enclosed by two blank characters or two punctuation marks. It represents one morphosyntactic unit. W word A word is a sequence of characters/syllables with a defined meaning and space on either side C character A character over here is referred to a syllable where a syllable is a unit of spoken language. L Lemma A lemma is the canonical form of the token or the token itself. It is the base form of the word representing all word forms belonging to the same word. L Language_type Language refers to the language in particular i.e. Hindi and Marathi t tag Tag is a lexical category given to a particular token in a sentence. The lexical category is picked from the standard POS tagset given in Table 2. Table 1. List of Symbols, Notations with their meanings Formal Definitions We present here the formal definitions that are requisite in building an e-corpus. [ Most of the following definitions have been adapted from ( William H Wilson, 2012, Def 1: Part of speech, POS (Lexical category) A set POS, also called a tagset, = {t1, t2, t3,...}, where each element corresponds to a role which can be played by a word. It is a linguistic category. Def 2: An ordered pair {w, t} is called a tagged-word if w is the word and t is its lexical category. It should be remembered that same word may have different tags, when they appear in different sentences depending on the context [8]. A list of tags is given in Table 2. Def 3 : Part-of-speech tagging is the process of labelling each word, w in each sentence, s with its part of speech category, t. We will assume that an e-corpus should always be a tagged-corpus to be apt for use in natural language processing applications. Further we will assume that the tag-sets are unique, within an e-corpus. In our work we have used the tagset given in Table No 2, taken from the POS Tagset [4] developed at Language Technology Research Centre at IIIT, Hyderabad. Note that there are exactly 26 number of tags, with 21 basic ones, and these are, common to all Indian languages, and this is the complete set. Table 2: POS Tagset for Indian Languages Sl. No. Category Tag Name Remarks/Discussion 1.1 Noun NN Common Noun Example (Hindi) Examples (Marathi) घयघय हट, घयघय हट, स लन स लन 95

4 Noun denoting spatial and ऊऩय, temporal expressions such as 1.2 NLoc NST लय location and time. They are also postpositions in certain contexts. 2 Proper Noun NNP Proper Nouns used for manual करऩ र annotation करऩ र 3.1 Pronoun PRP Pronouns is a word that आऩक, आऩल म, substitutes for a noun or a noun इस म र phrase. Demonstratives indicate the person or thing to be referred to as. It is used as a separate tag to उस 3.2 Demonstrative DEM mark the difference between त र demonstratives and pronouns. It points to a particular noun or to a noun it replaces. A finite verb is used to mark tense and is used in agreement with the subject. If there is one फ र ए, फ रल, 4 Verb-finite VM verb in a sentence it is a finite ख, ख, verb. They provide grammatical स, झ ऩ, information of gender, person, य य ड number, tense, aspect, mood and voice. 5 Verb Aux An auxiliary verb is a verb giving further semantic or syntactic ह VAUX information of the main verb आह following it. Adjectives are words that 6 Adjective JJ describe or modify another नळ र नळ र person or thing in a sentence. 7 Adverb RB Adverbs are words that modify ज, ज य, verbs. ध य हऱ ल य 8 Post position PSP It is a word that is used to show the relation of a noun or pronoun क, क, NA to some other word in a sentence, क It follows the object. These are function words that 9 Particles RP show grammatical relationships ब, ऩण, य with other words. conjuncts or Conjunctions are 10 Conjuncts CC words that join parts of a आणण औय sentence. 11 क म, Question These are the words that put up a WQ Words question क म, कह क, क म, क ठ 96

5 12.1 Quantifiers QF 12.2 Cardinal QC 12.3 Ordinal QO 12.4 Classifier CL 13 Intensifier INTF 14 Interjection INJ 15 Negation NEG 16 Quotative UT 17 Special Symbol SYM 18 Compounds C 19 Reduplicative RDP 20 Echo ECH 21 Unknown UNK These are the words that tell us सब, सगऱ, how many or how much. They फह, ज स,.थ ड generally precede or modify थ ड, कमभ nouns. कभ, These words are the cardinal numbers in the language which फ स, एक, ल स, एक, quantify and are adjectives द, न द न, त न referring to quantity Ordinal numbers are words that ऩहर, ऩहहर, represent the relative position of द सय, द सय, an item in an ordered sequence सय त सय classifier is a word which accompanies a noun in certain grammatical contexts and NA generally reflects some kind of NA conceptual classification of nouns. ख फ, ख ऩ, These are the words that फह, ज स intensifies adjectives or adverbs थ ड, कमभ कभ,.थ ड, Interjections are words that used to exclaim, protest or command. अय, ह They can sometimes be used by अय, ह themselves. These are the negative words in a नह language. न ह A quotative introduces a quote. It is typically a verb and some indian languages use it. It is used to tag all the special symbols which cannot be?, : ;!?, : ;!. categorised in any other category These are the words that are र र यक र र यक combined together to represent a क मळक ए ऩ ळ single word. It is used to mark those words in छ ट Indian languages that are छ ट repeated consecutively. छ ट छ ट This category is designed for representing words in Indian दल ई languages that do not have any ळल ई NA place in dictionary and can be called as nonsense words This category is used to mark the words whose category is not known which may be loan words or foreign words. 97

6 Def 4: We will ref to tagset as defined in Table 2 as the default tag-set. Def 5 : Storage format is the standardized format that is used to store the metadata so that it is machine readable and interpretable. There are many standardized formats for encoding like Text Encoding Initiative (TEI), Translation Memory Exchange (TMX) Corpus Encoding Standard for XML (XCES)[24] IMS corpus WorkBench Each of the formats have their own advantages and disadvantages. We feel that the choice of the format, by the corpus builder, will not create any incompatibility in respect of choice of tagging tool or even alignment tools at later stage at the time of actual building of the e_corpus. Def 6: A RAW_resource is always a Pair of TEXT files in the corresponding languages namely L1 and L2 using Unicode, one being translation of the other for a bilingual corpus. This means, we are avoiding usage of resources which could contain images, or sound files and also assuming that the Raw resource is in UNICODE format. Further, this means that a resource is input into the corpus only when the builder is satisfied with its translation into the other language. Def 7: Sum total of sizes of all RAW_resource files in number of Bytes would be referred to as the size of the Corpus. Obviously the actual fingerprint of the total content would be far larger than the size of the Corpus as that would include processed files, accompanying software resources etc. Def 8: Length of a sentence, s will be always specified as number of tokens, k. Token length of a sentence and byte length of a sentence are two different metrics which we will need when we analyze. Hence a separate definition of byte length is proposed. Tokens are words separated by a token delimiter. Def 9: The Byte length of a Sentence is the total no. of bytes that constitute a sentence which includes white spaces and one byte for the sentence delimiter. Choice of the Data Base System: Use of a Data Base System is a necessity while building a Corpus. XML would be treated as the default Database unless otherwise specified. Reason for choice of XML as the default database system is its interoperability with most of the platforms and programming languages, its software and hardware independence when it comes to way of storing information. Since Unicode is supported it is ideal for storing text of any language, and would allow use of simple text editors when it comes to the content part [24]. Most of the developmental tools will have natural compatibility with XML. The other options could be any of the Standard Relational Database like MySQL, Oracle etc. Choice of Database should not affect the usability of the corpus. Def 10: Paragraph is a subdivision of the text file of finite length, identified by special delimiters like spaces, new line characters, tabs etc. Sometimes it may be indicated in number of sentences. Def 11: Context_Tags is a set of tags, predefined (Like set of PoS tags), which can be associated with each of the paragraphs. There can be different classes of tags: linguistic, situational and cultural [19]. 98

7 Def 12: Context_tagging is the process of tagging each paragraph in a text-file, with predefined tags. Def 13: Since we are concerned with a bilingual product, the concept of Direction assumes considerable importance. The e- corpus will have a direction specified as one of the elements of the set {Uni, Bi}. If the choice is Uni, then the aligning will be L1 to L2 or L2 to L1 which again will be clearly specified in the definition. Choice Bi would mean bi-directional and needs no further specification since it would any way be symmetric. A bidirectional corpus any way would include both unidirectional corpora into it as a recoverable sub-corpus [12]. The corpus will have various sub corpora that will be aligned (text by text / paragraph by paragraph, sentence by sentence, phrase by phrase and word by word) [12]. Having defined the basic components in clear formal terms, now we are in a position to provide a good implementable definition of monolingual e_corpus and bilingual e_corpus. Def 14: Bilingual e_corpus Bilingual e_corpus is a Quadruplet {Lan_Names, RawLang_Files, ProLang_Files, SoftwareToManage } with following characteristics 1. It has a specified pair of languages associated with it. {Lan_Names = (L1, L2)} 2. It constitutes of a repository of contents included in containers { RawLang_Files, ProLang_Files } combined size of which will be called the size of corpus. RawLang_Files are a collection of resources, which are basically Pairs of text files as defined in Def 6, whereas ProLang_Files are the XML files which provide exact alignment and tagging of Raw resources. 3. It is Intelligent in the sense that adequate interfaces provided to add /modify/delete info, perform various linguistic operations allows to browse and extract information for user applications (and Lot more related functionalities ) 4. It has a software component SoftToManage an integrated package with utilities required to manage the repository and which also contains proper API s, which will facilitate application development in Java/C It is noiseless (possible noise is spelling mistakes, incorrect translations, incorrect character encoding, missing words). In short it will have no linguistic inconsistencies) The ProLang_Files is essentially a collection of tagged -repositories etc. obtained from the RawLang_Files, arranged and structured in such a way so as to facilitate the Utilities in SoftToMange to work properly. Def 15: Context Context is the physical environment in which a word is used [19]. A word can have a different POS tag based on the context in which it is used [13]. Lexical ambiguity that arises in different situations can be resolved using the contextual information available in the text [13]. 3. Linguistic Similarities and their Information Theoretic Implications 99

8 Challenges faced by developers in building a bilingual corpus for Hindi and Marathi pair of languages are many. The basic definitions discussed earlier in Section 2 and the functional requirements specified in the definition of a bilingual corpus are not easy to meet. Through a study of the similarities and dissimilarities of the pair of languages one can possibly counter some of the challenges and reduce complexities. 3.1 Vocabulary Indian languages share a common origin and are known to have a common vocabulary of around 40 to 80 percent [9]. Hindi and Marathi is one such pair of Indo-aryan languages, being derived from Devanagari script. They are known to be sister languages and have significant proportion of common vocabulary [18]. Words that are phonologically and lexically similar are defined to be as Cognates. [18] [14]. Out of the corpus of 6 million words created by Central Institute of Indian Languages for Marathi and Hindi language, 44.5% are cognate. Though sometimes these cognitive words may have different meanings posing a problem of Word sense disambiguation in front of developers. These differences are due to the difference in the Marathi and Hindi grammatical rules in the construction of verbs and its placement. Some words retain their meanings and have similar meanings while others have become associated with different concepts [18]. The problems arising due to bilingualism are reduced when the rate of cognates available in the two languages are higher [14]. Some examples of cognates in Hindi and Marathi are given below: Same origin; same meaning: The word Utsuk means curious in both Hindi and Marathi. Same origin; different meaning: The word shikhsa in Hindi means education, while the same word in Marathi means Punishment. Since there are very less similar words in the two languages having different meanings. The work involved in building lexical resources can be reduced by taking care of these cognates. So from the designer s perspective we may conclude that A good bilingual corpus faces the problems related to BIGDATA, i.e. problem of Volume, Variety and Velocity. The common part of the vocabulary, and presence of cognitive words will certainly reduce the Volume of Corpus, significantly. 3.2 Script & Alphabet Set The set of symbols of each language is unified into a single collection identified as a single script. These collection of symbols and scripts, then serve as a reserve from which symbols are taken to write multiple languages. Hindi and Marathi are derived from Devanagari script for writing, which is a phonetic script. Devanagari script used for Hindi and Marathi have 12 pure vowels, two additional loan vowels taken from the Sanskrit and one loan vowel from English [9][10].There are 34 pure consonants, 5 traditional conjuncts, 7 loan consonants and 2 traditional signs in Devanagari script and each consonant have 14 variations through integration of 14 vowels, which produces 507 different alphabetical characters[9][11]. Apart from ऱ / which is used only in Marathi language, consonants are identical. In Marathi glyphs are preferred for U+0932 devanagari letter la and U+0936 devanagari letter sha[2]. The different committees of the Department of Electronics and the Department of Official Language, Govt. of India have developed a universal code, which is the Indian Standard Code for Information Interchange (ISCII). The ISCII code is a super set of all the characters required in the ten Brahmi based Indian scripts. It is based on the standard ASCII code [1]. 100

9 Unicode has also encoded the Indian language scripts and is based on the Indian national standard, ISCII. the Unicode standard has encoded the Devanagari characters in the same relative position as in the ISCII-1988 standard. This enables one to one mapping between different scripts in the Indian family [2]. The Range of codes for Devanagari in The Unicode Standard Version 7. 0 is F [2]. Since the script and the Alphabet set is similar in both languages,so Unicode to ISCII and vice versa is not language specific, but is dependent on the script. Whenever we download or procure raw files or documents in Hindi or Marathi for inclusion in a repository they are passed or produced through some document editors. In most cases they need to be converted into a plain text resource using a code converter. Font Suvidha [ is one of its kind software developed to convert writing in devnagari scripts like Hindi, Marathi, and many other languages written in different fonts to Unicode and vice versa. Availability of many such converters is a tremendous advantage for a corpus builder. Some tools also have language detection feature to leave English text unchanged so that documents with mixed contents (English, Hindi or Marathi) can be easily handled. Hypothesis : The commonality of Devanagari script between the two languages has made development of such Unicode-converters possible. (One Unicode converter can handle RAW files from both the languages) 3.3 Phonology Phonology of a language is an important feature. Most often phonetically similar words have similar spellings. Devanagari being a phonetic script, this aspect can be used to match misspelled words or missing/muted words [11]. For example, the words aaya and gaya rhyme similar to aala and gela in Marathi and have similar meanings as well. Even in the below example of sentences, the words न ल and ण ल rhyme similar and have similar spellings. Hindi: न ल कभ कय Marathi: ण ल कभ कय. ण sound is more frequently used in Marathi Due to the phonetic similarity of different alphabets [7] and several features and sounds shared across Indian languages [5], an optimal keyboard common to all languages is possible. The different committees of the Department of Electronics and the Department of Official Language, Govt. of India have been evolving different codes and keyboards which could cater to all the Indian scripts. Hypothesis: Due to an overlap in vocabulary of Hindi and Marathi, words having similar pronunciation and which rhyme together, they can be directly taken in the corpus to be equivalent words having similar meanings. The Hypothesis said above is supported by the experiments described in Section 5. Although phonology aspect is more applicable in building a speech corpus, still we have attempted to list the differences in Table 3 which needs to be taken care of. 101

10 Table 3. Difference in pronunciation of certain consonants in Hindi and Marathi. Consonants Marathi Hindi च, ज, झ and प Multiple pronunciation Single pronunciation ऋ /ru/ /ri/ and similar to Sanskrit words ending with these T, TH, D, DH, t, th, d dh consonants are prolonged No change in pronunciation च and ज are dental-alveolar in Marathi only, while these are alveolar in Hindi. 3.4 Grammar This is one aspect that needs to be studied considerably to build a highly accurate bilingual corpus. Table 4 gives a comprehensive list of similarities and dissimilarities in grammar in the two languages [20][21] Hindi is a highly inflected language and requires the modification of a word to represent different grammatical categories such as tense, mood, voice, aspect, person etc. It adds prefixes and suffixes to form words. The inflection of verbs is called as conjugation and the inflection of nouns, adjectives and pronouns is called declension. Hindi uses postpositions (PSP) rather than prepositions for case marking and auxiliaries. In Marathi, postpositions are added to the word preceding it. It also adds suffixes to roots to build words [20][21]. While doing conversion from Hindi to Marathi, the PSP s like क, भ or ह are removed in Marathi and added as a morphological phenomena[11] or grammatical information in the word itself. It is converted to a syntactic feature in Marathi [23]. Hence these PSP s in Hindi do not find any translation equivalent in Marathi and are irrelevant while doing Word alignment. Multiple words in Hindi are converted into a single compound word in Marathi. Mean sentence length of Hindi is while that of Marathi is 9.54[3]. This is attributed to the fact that Marathi forms compound words. The pilot experiment also conducted shows that the sentence length in Hindi is always greater than the sentence length in Marathi. Section 5 shows the exact statistics of the sentence length in the pilot study done. द For e.g. In the given sentence pair taken from test data. Hindi: क म द ळय य क अन म अ ग भ प र ह? Marathi: ह ल दन ळय य च म अन म ब ग ऩसय क? The words ळय य क gets converted to ळय य च म in Marathi and the words अ ग भ forms the compound word ब ग. It is also seen that the PSP s क and भ gets converted into a syntactic feature in Marathi. 102

11 Category Table 4. Grammatical Categories of Hindi and Marathi Similarities Differences Number Singular, Plural NIL Nouns Gender Case Articles Masculine, Feminine Direct(nominative), Vocative NIL Neuter(Marathi) In Marathi, genitive, accusativedative, instrumental, ablative, locative. All cases except vocative are marked by postpositions. In Hindi, oblique(direct) case is used to mark subject of sentences and is used to mark postpositions. In Hindi, definite and indefinite. In Marathi, no articles Adjectives Person adjectives agree with the In Marathi, Adjectives do not inflect nouns they modify in unless they end in long /a/. number, gender, and case. In Hindi, 2nd honorific 1st, 2nd, 3rd Number Singular, Plural NIL Tense Past, Present, Future NIL Verbs Aspect Mood Imperfective, Perfective NIL Indicative, imperative, optative Subjective, conditional(marathi) Forms Hindi verbs occur in the following forms: root, imperfect stem, perfect stem, and infinitive. The stems agree with nouns in gender and number. 103

12 Word Order In Marathi, indirect objects precede Subject-Object- direct objects. Verb Modifiers precede the nouns they modify.. The common script, and the phonetic similarities together will certainly reduce the variety part of the BIGDATA problem faced by developers. 3.5 Other Aspects Other challenges in processing Hindi and Marathi are length of sentence, lexical ambiguity, ordering of words etc. The length of the sentence of Hindi and Marathi sentence is not the same. A Marathi sentence is smaller as compared to a Hindi sentence [3]. Due to the difference in the usage context, a word may have different POS tagging thereby resulting in ambiguity. Indian languages are morphologically rich and allow changing the ordering of words in a sentence. Due to the free word order of languages, alignment of equivalent words is challenging [4]. Though both Hindi and Marathi follow the Subject-Object-Verb order, but still the usage shows different word order. As given by [17] in the inter-language comparison study, the distance between Hindi and Marathi is very less. There is a close correspondence between Hind and Marathi and largely similar structural property [15][6]. Due to their structural similarity, the development of Marathi Wordnet can be done through relation borrowing from Hindi Wordnet [15][6]. 4. Important Statistical Parameters The word types define the distinct number of words in a corpus, which is also a measure of the vocabulary of the language. The table below gives the top five frequently used words in the corpus in Hindi and its percentage distribution [3]. Table 5.Some statistical measures Top Frequentl five y used Percentag Percentag syllable words in e e s in Hindi Hindi Ke 3.59 ra 5.27 he 3.08 ka 3.60 mem 2.79 na 2.84 ki sa 2.80 Se 1.70 pa 2.17 Table 6. gives the number of words required to cover a certain percentage of the corpus[3].there is a drastic difference in the number of words required to cover a certain percentage of the corpus in Hindi and Marathi. As can be seen from the table, Marathi has a larger vocabulary as compared to Hindi. It can be accounted to the fact that the postpositions in Hindi like ke, he, mem, ki and se are the words that occur most often in Hindi, but these gets converted to a syntactic or grammatical feature in Marathi, hence having more variations and vocabulary. Table 7. gives a comparative list of syllable and words in Hindi and Marathi. 104

13 Table 6. Number of words required to cover a certain percentage of the corpus. % of Corpus Hindi Marathi 10% % % % % % % % Hindi Marathi Corpus Size(in no. of words) Word Types Syllable types Average no of syllables in a word Syllable Mode 2 3 Most frequent syllable Ra wa Bigram syllable types Most frequent bisyllable ka : ra A : he u:na:ki, mha:nu:na, a:pa:ne, mha:na:je, Most frequent a:pa:ni, A:pa:lyA, trisyllable i:sa:ke, A:he:wa, u:sa:ki ka:ru:na Maximum Word Length Average Word Length Total Sentences Mean Sentence Length Table 7. A comparative list of syllable and words in Hindi and Marathi The syllable patterns are very important in the study of languages. Table 8. shows the trisyllable pattern and its distribution in Hindi and Marathi. These syllable frequencies are used to extract unique patterns from the corpus. For example for a pattern with a high frequency in Hindi is compared with its occurrence in Marathi. It has been observed that a pattern which has a high occurrence in Hindi has a very low occurrence in Marathi. Therefore such patterns of syllables are unique to Hindi language. Some patterns are unique to a particular language and hence can be used to identify that language. Table 8. The trisyllable pattern and its distribution in Hindi and Marathi Trisyllable Hindi Marathi Pattern in Hindi ka:ra:ne a:pa:ne sa:ma:ya u:sa:ke ka:ra:we Results of some Related Experiments Tagging and aligning are two basic techniques used in providing interpretable structure to contents in a corpus [13]. We have run a few trials on selected taggers and aligners. Results of the experiments are reported here. A critical look into the outputs helps in understanding the complexity of the whole process and also suggests how the similarity between the languages can help in reducing the complexity. The data has been collected on medical domain. A set of 90 sentences of varying length are taken. The total length of the sentence ranges from 3 to 28 tokens per sentence. Experiment 1 : Experiment on Tagging Tools : Shallow Parser for Hindi and Marathi developed by IIIT Hyderabad. Input : Raw file containing 90 sentences in Unicode format 105

14 Tagged Output : Parsed sentences in a text file. A sample output selected from the Parsed file is shown in Table 9. Discussions : The following observations were made: The Postpositions in Hindi do not exist in Marathi Almost the word order remains the same with few changes. Postpositions do not have any translation equivalents in Marathi as these gets converted into a grammatical feature. The lemma (root) words are same for words which are common in both the languages as shown in the above table. If the starting few syllables of two words in Hindi and Marathi are similar then their root words are translation equivalents of each other like ळय य and ळय य च म. Table 9. Sample output of Experiment 1. क म द द ळय य क अन म अ ग भ Hindi प र ह? Sentence क म द द ळय य क अन म अ ग भ प र with lemma ह? No. of Tokens 10 POS Tags WQ NN NN PSP JJ NN PSP VM VAUX SYM ह ल दन ळय य च म अन म ब ग ऩसय Marathi क? Sentence ह ल दन ळय य अन म ब ग ऩसय with lemma क? POS Tags DEM NN NN JJ NN VM WQ SYM 8 Experiment 2 : Experiment on Alignment Tool : GIZA ++ Input : Parallel corpus Aligned output : Bilingual corpus aligned at word level. A sample output has been shown in Table 10. Discussions : With Source as Marathi and Target as Hindi the word alignment procedure matches 83% of the Marathi words with some hindi word, 17% are not aligned with any word and is hence null. Out of these, 76% are correctly aligned pairs where the marathi word is correctly matched with the corresponding hindi word. With Source as Hindi and Target as Marathi, the word alignment procedure matches 85% of the Marathi words with some hindi word, 15 % are not aligned with any word and is hence null. Out of these, 73% are correctly aligned pairs where the marathi word is correctly matched with the corresponding hindi word. Mean sentence length of Hindi = 1041/90 =

15 Mean Sentence length of Marathi = 811/90 = 9.11 Average distance between the length of sentences = % of the Hindi Words are also present in Marathi and 33% of the Marathi words are also present in Hindi which indicates the commonality of vocabulary. Mean difference between length of sentence is 2.75 which means that the Marathi sentence is bigger than a Hindi sentence by approx 2.75 words. This helps us in making an assumption that a long sentence gets translated to a long one while short sentence gets translated into a short one. द The following is the output of GIZA++. Table No. 10 shows the serial number assigned to each word by the tool. क म ({ 1 }) द ({ 2 }) ळय य ({ 3 }) क ({ }) अन म ({ 4 }) अ ग ({ 5 }) भ ({ }) प र ({ 6 7 }) ह ({ })? ({ 8 }) Table 10. Sample output of Experminent No. 2 Sl. No of Words Source Sentence (Hindi) Target Sentence (Marathi) क म द द ळय य क अन म अ ग भ प र ह? ह ल दन ळय य च म अन म ब ग ऩसय क? As can be seen from the output of GIZA++, each word in hindi is assigned to some word in Marathi. For eg. क म ({ 1 }) means that the word क म is aligned to the first word in Marathi and द द ({ 2 }) means the word द द in the Hind sentence is aligned to the 2nd word in the Marathi sentence. If no equivalence is found, the word is assigned null. Out of the 10 hindi words, 7 words have been aligned to some marathi word and the rest 3 words are not aligned and are shown as empty brackets. Out of the 7 words, the correctly aligned words are 5. Four words are given as one to one mapping, while the word प र ({ 6 7 }) is aligned to the 6th and 7th word of the marathi sentence, hence is an example of one-to-many mapping. Result of alignment on GIZA ++ with Marathi as source and Hindi as target sentence The experiment was repeated with Marathi as the source sentence and Hindi as the target sentence. ह ({ 1 }) ल दन ({ 2 }) ळय य च म ({ 3 }) अन म ({ 5 }) ब ग ({ 7 8 }) ऩसय ({ 6 }) क ({ 9 })? ({ 10 }) All the marathi words in the source sentence has been aligned to some hindi word. There is no null assignment in this case. 107

16 A summary of the statistics in Experiment No. 1 and Experiment No. 2 is given in Table 11. Table 11. Summarize results of Experiment No. 1 and Experiment No. 2 ns = No. of sentences ls = length of sentence(no of words) Tw = Total words C= Categories created by GIZA++ V=vocabulary (distinct words) cp = correctly aligned pairs cw= common words nu = null assignments p:q = Hindi:marathi (ratio) alignments H= Hindi, M= Marathi Alignment Ratio Set No L ns ls Tw C V cp cw nu 1:2 1:3 1:4 1:5 1:6 I H M II H M III H M All H M All M H CONCLUSIONS AND DISCUSSIONS In this article an attempt has been made to formalize the very concept of a bilingual corpus with appropriate definitions in terms of Information Technology, so that the concepts are better understood by a developer and hence would become better implementable. Since our focus is on Hindi and Marathi bilingual e_corpus, a study of similarities between the two languages has been presented with a view of extracting proper help in reducing the complexity of the bilingual corpus. Various hypothesis stated in Sec 3, should provide help to a developer. Results of a few experiments in tagging and aligning are presented as evidences to some of the observations made earlier, which further strengthen our belief that a corpus designer can exploit the similarities to his advantage. References [1] Anonymous, Script Grammar for Marathi language. Technical Report. Technology development for Indian languages Programme of DIT, Govt. of India in association with CDAC. Ver

17 [2] Julie D. Allen The Unicode Standard / the Unicode Consortium Version 6.2. Technical Report. Published in Mountain View, CA. ISBN September [3] Akshar Bharati, Prakash Rao, Rajeev Sangal and S.M.Bendre Basic Statistical analysis of corpus and cross comparison among corpora. In Proceedings of 2002 International Conference on Natural Language Processing, Mumbai, India. (2002). [4] Akshar Bharati, Rajeev Sangal, Dipti Mishra Sharma and Lakshmi Bai AnnCorra : Annotating Corpora Guidelines For POS And Chunk Annotation For Indian Languages. Language Technologies Research Centre, Technical Report, IIIT, Hyderabad, [5] Peri Bhaskararao Salient Phonetic features of Indian languages in speech technology. Sadhana Vol.36, Part 5, pp dx.doi.org/ / s z. (October 2011). [6] Pushpak Bhattacharya, Debasri Chakrabarti and Vaijayanthi M.Sarma Complex Predicates in Indian language Wordnets. Language Resources and Evaluation, Vol. 40. pp (2006). [7] Sandeep Chaware and Srikantha Rao Rule based phonetic matching approach for Hindi and Marathi. Computer Science and Engineering: An International Journal(CSEIJ), vol.1, No.3. DOI : /cseij 2011(August 2011). [8] Niladri Sekhar Dash Corpus Linguistics: An Introduction. India: Pearson Education-Longman Publishing Co., pp. 208, ISBN: , [9] M.L.Dhore, S.K.Dixit and R.M.Dhore. 2012a. Hindi and Marathi to English NE Transliteration Tool using Phonology and Stress Analysis.Proceedings of 24th International Conference on Computational Linguistics: Demonstration Papers at IIT Bombay, pages (2012). [10] M.L.Dhore, S.K.Dixit and R.M.Dhore. 2012b. Issues in Hindi to English and Marathi to English Machine transliteration of Named Entities. International Journal of Computer Applications, Vol. 51, No.14 (August 2012). [11] M.L.Dhore, R.M.Dhore and P.H.Rathod Transliteration by Orthography or Phonology for Hindi and Marathi to English: Case Study. International Journal of Natural Language Computing, Vol.2, No.5 (October 2013). DOI : /ijnlc [12] A.Frankenberg-Garcia Compiling and Using a Parallel corpus for research in translation. International Journal of Translation, vol.21(1), pp.57-71, (2009). [13] Nisheeth Joshi, Hemant Darbari and Iti Mathur HMM Based POS tagger for Indian languages. Jan Zizka (Eds) : CCSIT, SIPP, AISC, PDCTA 2013, pp , CS & IT- CSCP 2013, DOI : /csit (2013). [14] Rujvi Kamat, Manisha Ghate, Tamar H.Gollan, Rachel Meyer, Florin Vaida, Robert K.Heaton, Scott Letendre, Donald Franklin, Terry Alexander, Igor Grant, Sanjay Mehendale and Thomas D.Marcotte Effects of Marathi-Hindi bilingualism on 109

18 Neuropsychological performance. Journal of International Neuropsychological Society, Vol. 18,Issue 02, pp , March, [15] J. Ramanand, Akshay Ukey, Brahm Kiran Singh and Pushpak Bhattacharyya Mapping and Structural analysis of Multilingual Wordnets. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 30(1). (March 2007). [16] Shikar Kr. Sharma, Himadri Barali Ambeshwar Gogoi, Ratul Ch. Deka and, Anup Kr. Burman A structured approach for building Assamese corpus: Insights Applications and Challenges. In, Proceedings of the 10th Workshop on Asian Language Resources COLING 2012, pages , [17] Anil Kumar Singh and Harshit Surana 2007a. Can corpus based measures be. used for comparative study o languages. In Proceedings of Ninth Meeting f of the ACL Special Interest Group in Computational Morphology and Phonology, pp 40 47, Prague. (June 2007). [20] Irene Thompson About World languages: Hindi. aboutworldlanguages.com/ hindi. (July 2014). [21] Irene Thompson About World languages: Marathi. aboutworldlanguages.com/ marathi. (December 2014). [22] Hans Uszkoreit What is Computational Linguistics. coli.unisaarland.de/~hansu/what_is_cl.html [23] Christopher C. Yang and Kar Wing Li Automatic construction of English/Chinese parallel corpora. Journal of the American Society for Information Science and Technology. Vol. 54, Issue 8, p.p dx.doi.org/ / asi (June 2003). [24] Johann Gamper and Paolo Dongilli, Primary data encoding of a bilingual corpus, In Proceedings of the 11 th Annual Meeting of the GLDV, Frankfurt a/m, Germany, July, [18] Anil Kumar Singh and Harshit Surana 2007b. Study of Cognates among South. Asian languages for the purpos of Building Lexical Resources. Journae of Language Technology. Dept. ol IT, Govt. of India f [19] Lichao Song The role of context in Discourse Analysis. Journal of Language teaching and Research, Vol. 1, No. 6, pp doi: / jltr ( November 2010). 110

HinMA: Distributed Morphology based Hindi Morphological Analyzer

HinMA: Distributed Morphology based Hindi Morphological Analyzer HinMA: Distributed Morphology based Hindi Morphological Analyzer Ankit Bahuguna TU Munich ankitbahuguna@outlook.com Lavita Talukdar IIT Bombay lavita.talukdar@gmail.com Pushpak Bhattacharyya IIT Bombay

More information

DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook

DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook मह म ग ध अ तरर य ह द व व व लय (स सद र प रत अ ध नयम 1997, म क 3 क अ तगत थ पत क य व व व लय) Mahatma Gandhi Antarrashtriya Hindi Vishwavidyalaya (A Central University Established by Parliament by Act No.

More information

S. RAZA GIRLS HIGH SCHOOL

S. RAZA GIRLS HIGH SCHOOL S. RAZA GIRLS HIGH SCHOOL SYLLABUS SESSION 2017-2018 STD. III PRESCRIBED BOOKS ENGLISH 1) NEW WORLD READER 2) THE ENGLISH CHANNEL 3) EASY ENGLISH GRAMMAR SYLLABUS TO BE COVERED MONTH NEW WORLD READER THE

More information

Parsing of part-of-speech tagged Assamese Texts

Parsing of part-of-speech tagged Assamese Texts IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal

More information

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE Pratibha Bajpai 1, Dr. Parul Verma 2 1 Research Scholar, Department of Information Technology, Amity University, Lucknow 2 Assistant

More information

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

FOREWORD.. 5 THE PROPER RUSSIAN PRONUNCIATION. 8. УРОК (Unit) УРОК (Unit) УРОК (Unit) УРОК (Unit) 4 80.

FOREWORD.. 5 THE PROPER RUSSIAN PRONUNCIATION. 8. УРОК (Unit) УРОК (Unit) УРОК (Unit) УРОК (Unit) 4 80. CONTENTS FOREWORD.. 5 THE PROPER RUSSIAN PRONUNCIATION. 8 УРОК (Unit) 1 25 1.1. QUESTIONS WITH КТО AND ЧТО 27 1.2. GENDER OF NOUNS 29 1.3. PERSONAL PRONOUNS 31 УРОК (Unit) 2 38 2.1. PRESENT TENSE OF THE

More information

क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD

क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD FROM PRINCIPAL S KALAM Dear all, Only when one is equipped with both, worldly education for living and spiritual education, he/she deserves respect

More information

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion

More information

Loughton School s curriculum evening. 28 th February 2017

Loughton School s curriculum evening. 28 th February 2017 Loughton School s curriculum evening 28 th February 2017 Aims of this session Share our approach to teaching writing, reading, SPaG and maths. Share resources, ideas and strategies to support children's

More information

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence. NLP Lab Session Week 8 October 15, 2014 Noun Phrase Chunking and WordNet in NLTK Getting Started In this lab session, we will work together through a series of small examples using the IDLE window and

More information

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature 1 st Grade Curriculum Map Common Core Standards Language Arts 2013 2014 1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature Key Ideas and Details

More information

First Grade Curriculum Highlights: In alignment with the Common Core Standards

First Grade Curriculum Highlights: In alignment with the Common Core Standards First Grade Curriculum Highlights: In alignment with the Common Core Standards ENGLISH LANGUAGE ARTS Foundational Skills Print Concepts Demonstrate understanding of the organization and basic features

More information

1. Introduction. 2. The OMBI database editor

1. Introduction. 2. The OMBI database editor OMBI bilingual lexical resources: Arabic-Dutch / Dutch-Arabic Carole Tiberius, Anna Aalstein, Instituut voor Nederlandse Lexicologie Jan Hoogland, Nederlands Instituut in Marokko (NIMAR) In this paper

More information

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading ELA/ELD Correlation Matrix for ELD Materials Grade 1 Reading The English Language Arts (ELA) required for the one hour of English-Language Development (ELD) Materials are listed in Appendix 9-A, Matrix

More information

Modeling full form lexica for Arabic

Modeling full form lexica for Arabic Modeling full form lexica for Arabic Susanne Alt Amine Akrout Atilf-CNRS Laurent Romary Loria-CNRS Objectives Presentation of the current standardization activity in the domain of lexical data modeling

More information

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1) Houghton Mifflin Reading Correlation to the Standards for English Language Arts (Grade1) 8.3 JOHNNY APPLESEED Biography TARGET SKILLS: 8.3 Johnny Appleseed Phonemic Awareness Phonics Comprehension Vocabulary

More information

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words, First Grade Standards These are the standards for what is taught in first grade. It is the expectation that these skills will be reinforced after they have been taught. Taught Throughout the Year Foundational

More information

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative English Teaching Cycle The English curriculum at Wardley CE Primary is based upon the National Curriculum. Our English is taught through a text based curriculum as we believe this is the best way to develop

More information

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,

More information

Writing a composition

Writing a composition A good composition has three elements: Writing a composition an introduction: A topic sentence which contains the main idea of the paragraph. a body : Supporting sentences that develop the main idea. a

More information

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract

More information

Emmaus Lutheran School English Language Arts Curriculum

Emmaus Lutheran School English Language Arts Curriculum Emmaus Lutheran School English Language Arts Curriculum Rationale based on Scripture God is the Creator of all things, including English Language Arts. Our school is committed to providing students with

More information

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5- New York Grade 7 Core Performance Indicators Grades 7 8: common to all four ELA standards Throughout grades 7 and 8, students demonstrate the following core performance indicators in the key ideas of reading,

More information

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se

More information

Indian Institute of Technology, Kanpur

Indian Institute of Technology, Kanpur Indian Institute of Technology, Kanpur Course Project - CS671A POS Tagging of Code Mixed Text Ayushman Sisodiya (12188) {ayushmn@iitk.ac.in} Donthu Vamsi Krishna (15111016) {vamsi@iitk.ac.in} Sandeep Kumar

More information

AQUA: An Ontology-Driven Question Answering System

AQUA: An Ontology-Driven Question Answering System AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.

More information

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles) New York State Department of Civil Service Committed to Innovation, Quality, and Excellence A Guide to the Written Test for the Senior Stenographer / Senior Typist Series (including equivalent Secretary

More information

A Simple Surface Realization Engine for Telugu

A Simple Surface Realization Engine for Telugu A Simple Surface Realization Engine for Telugu Sasi Raja Sekhar Dokkara, Suresh Verma Penumathsa Dept. of Computer Science Adikavi Nannayya University, India dsairajasekhar@gmail.com,vermaps@yahoo.com

More information

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,

More information

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools Dr. Amardeep Kaur Professor, Babe Ke College of Education, Mudki, Ferozepur, Punjab Abstract The present

More information

Heritage Korean Stage 6 Syllabus Preliminary and HSC Courses

Heritage Korean Stage 6 Syllabus Preliminary and HSC Courses Heritage Korean Stage 6 Syllabus Preliminary and HSC Courses 2010 Board of Studies NSW for and on behalf of the Crown in right of the State of New South Wales This document contains Material prepared by

More information

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks 3rd Grade- 1st Nine Weeks R3.8 understand, make inferences and draw conclusions about the structure and elements of fiction and provide evidence from text to support their understand R3.8A sequence and

More information

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many Schmidt 1 Eric Schmidt Prof. Suzanne Flynn Linguistic Study of Bilingualism December 13, 2013 A Minimalist Approach to Code-Switching In the field of linguistics, the topic of bilingualism is a broad one.

More information

THE VERB ARGUMENT BROWSER

THE VERB ARGUMENT BROWSER THE VERB ARGUMENT BROWSER Bálint Sass sass.balint@itk.ppke.hu Péter Pázmány Catholic University, Budapest, Hungary 11 th International Conference on Text, Speech and Dialog 8-12 September 2008, Brno PREVIEW

More information

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS Daffodil International University Institutional Repository DIU Journal of Science and Technology Volume 8, Issue 1, January 2013 2013-01 BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS Uddin, Sk.

More information

Coast Academies Writing Framework Step 4. 1 of 7

Coast Academies Writing Framework Step 4. 1 of 7 1 KPI Spell further homophones. 2 3 Objective Spell words that are often misspelt (English Appendix 1) KPI Place the possessive apostrophe accurately in words with regular plurals: e.g. girls, boys and

More information

Development of the First LRs for Macedonian: Current Projects

Development of the First LRs for Macedonian: Current Projects Development of the First LRs for Macedonian: Current Projects Ruska Ivanovska-Naskova Faculty of Philology- University St. Cyril and Methodius Bul. Krste Petkov Misirkov bb, 1000 Skopje, Macedonia rivanovska@flf.ukim.edu.mk

More information

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions. to as a linguistic theory to to a member of the family of linguistic frameworks that are called generative grammars a grammar which is formalized to a high degree and thus makes exact predictions about

More information

Let's Learn English Lesson Plan

Let's Learn English Lesson Plan Let's Learn English Lesson Plan Introduction: Let's Learn English lesson plans are based on the CALLA approach. See the end of each lesson for more information and resources on teaching with the CALLA

More information

Using dialogue context to improve parsing performance in dialogue systems

Using dialogue context to improve parsing performance in dialogue systems Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,

More information

Florida Reading Endorsement Alignment Matrix Competency 1

Florida Reading Endorsement Alignment Matrix Competency 1 Florida Reading Endorsement Alignment Matrix Competency 1 Reading Endorsement Guiding Principle: Teachers will understand and teach reading as an ongoing strategic process resulting in students comprehending

More information

CS 598 Natural Language Processing

CS 598 Natural Language Processing CS 598 Natural Language Processing Natural language is everywhere Natural language is everywhere Natural language is everywhere Natural language is everywhere!"#$%&'&()*+,-./012 34*5665756638/9:;< =>?@ABCDEFGHIJ5KL@

More information

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Tyler Perrachione LING 451-0 Proseminar in Sound Structure Prof. A. Bradlow 17 March 2006 Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Abstract Although the acoustic and

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

Developing Grammar in Context

Developing Grammar in Context Developing Grammar in Context intermediate with answers Mark Nettle and Diana Hopkins PUBLISHED BY THE PRESS SYNDICATE OF THE UNIVERSITY OF CAMBRIDGE The Pitt Building, Trumpington Street, Cambridge, United

More information

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE Submitted in partial fulfillment of the requirements for the degree of Sarjana Sastra (S.S.)

More information

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis International Journal of Arts Humanities and Social Sciences (IJAHSS) Volume 1 Issue 1 ǁ August 216. www.ijahss.com Linguistic Variation across Sports Category of Press Reportage from British Newspapers:

More information

Advanced Grammar in Use

Advanced Grammar in Use Advanced Grammar in Use A self-study reference and practice book for advanced learners of English Third Edition with answers and CD-ROM cambridge university press cambridge, new york, melbourne, madrid,

More information

Grammar Extraction from Treebanks for Hindi and Telugu

Grammar Extraction from Treebanks for Hindi and Telugu Grammar Extraction from Treebanks for Hindi and Telugu Prasanth Kolachina, Sudheer Kolachina, Anil Kumar Singh, Samar Husain, Viswanatha Naidu,Rajeev Sangal and Akshar Bharati Language Technologies Research

More information

What the National Curriculum requires in reading at Y5 and Y6

What the National Curriculum requires in reading at Y5 and Y6 What the National Curriculum requires in reading at Y5 and Y6 Word reading apply their growing knowledge of root words, prefixes and suffixes (morphology and etymology), as listed in Appendix 1 of the

More information

BULATS A2 WORDLIST 2

BULATS A2 WORDLIST 2 BULATS A2 WORDLIST 2 INTRODUCTION TO THE BULATS A2 WORDLIST 2 The BULATS A2 WORDLIST 21 is a list of approximately 750 words to help candidates aiming at an A2 pass in the Cambridge BULATS exam. It is

More information

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &

More information

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,

More information

Rendezvous with Comet Halley Next Generation of Science Standards

Rendezvous with Comet Halley Next Generation of Science Standards Next Generation of Science Standards 5th Grade 6 th Grade 7 th Grade 8 th Grade 5-PS1-3 Make observations and measurements to identify materials based on their properties. MS-PS1-4 Develop a model that

More information

Mercer County Schools

Mercer County Schools Mercer County Schools PRIORITIZED CURRICULUM Reading/English Language Arts Content Maps Fourth Grade Mercer County Schools PRIORITIZED CURRICULUM The Mercer County Schools Prioritized Curriculum is composed

More information

Cross Language Information Retrieval

Cross Language Information Retrieval Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................

More information

Derivational and Inflectional Morphemes in Pak-Pak Language

Derivational and Inflectional Morphemes in Pak-Pak Language Derivational and Inflectional Morphemes in Pak-Pak Language Agustina Situmorang and Tima Mariany Arifin ABSTRACT The objectives of this study are to find out the derivational and inflectional morphemes

More information

Participate in expanded conversations and respond appropriately to a variety of conversational prompts

Participate in expanded conversations and respond appropriately to a variety of conversational prompts Students continue their study of German by further expanding their knowledge of key vocabulary topics and grammar concepts. Students not only begin to comprehend listening and reading passages more fully,

More information

The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL

The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL 2011 33 50 Machine Learning Approach for the Classification of Demonstrative Pronouns for Indirect Anaphora in Hindi News Items Kamlesh Dutta

More information

CEFR Overall Illustrative English Proficiency Scales

CEFR Overall Illustrative English Proficiency Scales CEFR Overall Illustrative English Proficiency s CEFR CEFR OVERALL ORAL PRODUCTION Has a good command of idiomatic expressions and colloquialisms with awareness of connotative levels of meaning. Can convey

More information

Mandarin Lexical Tone Recognition: The Gating Paradigm

Mandarin Lexical Tone Recognition: The Gating Paradigm Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny By the End of Year 8 All Essential words lists 1-7 290 words Commonly Misspelt Words-55 working out more complex, irregular, and/or ambiguous words by using strategies such as inferring the unknown from

More information

SEMAFOR: Frame Argument Resolution with Log-Linear Models

SEMAFOR: Frame Argument Resolution with Log-Linear Models SEMAFOR: Frame Argument Resolution with Log-Linear Models Desai Chen or, The Case of the Missing Arguments Nathan Schneider SemEval July 16, 2010 Dipanjan Das School of Computer Science Carnegie Mellon

More information

Adjectives tell you more about a noun (for example: the red dress ).

Adjectives tell you more about a noun (for example: the red dress ). Curriculum Jargon busters Grammar glossary Key: Words in bold are examples. Words underlined are terms you can look up in this glossary. Words in italics are important to the definition. Term Adjective

More information

1.2 Interpretive Communication: Students will demonstrate comprehension of content from authentic audio and visual resources.

1.2 Interpretive Communication: Students will demonstrate comprehension of content from authentic audio and visual resources. Course French I Grade 9-12 Unit of Study Unit 1 - Bonjour tout le monde! & les Passe-temps Unit Type(s) x Topical Skills-based Thematic Pacing 20 weeks Overarching Standards: 1.1 Interpersonal Communication:

More information

GOLD Objectives for Development & Learning: Birth Through Third Grade

GOLD Objectives for Development & Learning: Birth Through Third Grade Assessment Alignment of GOLD Objectives for Development & Learning: Birth Through Third Grade WITH , Birth Through Third Grade aligned to Arizona Early Learning Standards Grade: Ages 3-5 - Adopted: 2013

More information

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases POS Tagging Problem Part-of-Speech Tagging L545 Spring 203 Given a sentence W Wn and a tagset of lexical categories, find the most likely tag T..Tn for each word in the sentence Example Secretariat/P is/vbz

More information

Greeley-Evans School District 6 French 1, French 1A Curriculum Guide

Greeley-Evans School District 6 French 1, French 1A Curriculum Guide Theme: Salut, les copains! - Greetings, friends! Inquiry Questions: How has the French language and culture influenced our lives, our language and the world? Vocabulary: Greetings, introductions, leave-taking,

More information

Correspondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy

Correspondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy 1 Desired Results Developmental Profile (2015) [DRDP (2015)] Correspondence to California Foundations: Language and Development (LLD) and the Foundations (PLF) The Language and Development (LLD) domain

More information

Two methods to incorporate local morphosyntactic features in Hindi dependency

Two methods to incorporate local morphosyntactic features in Hindi dependency Two methods to incorporate local morphosyntactic features in Hindi dependency parsing Bharat Ram Ambati, Samar Husain, Sambhav Jain, Dipti Misra Sharma and Rajeev Sangal Language Technologies Research

More information

Oakland Unified School District English/ Language Arts Course Syllabus

Oakland Unified School District English/ Language Arts Course Syllabus Oakland Unified School District English/ Language Arts Course Syllabus For Secondary Schools The attached course syllabus is a developmental and integrated approach to skill acquisition throughout the

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

TEKS Comments Louisiana GLE

TEKS Comments Louisiana GLE Side-by-Side Comparison of the Texas Educational Knowledge Skills (TEKS) Louisiana Grade Level Expectations (GLEs) ENGLISH LANGUAGE ARTS: Kindergarten TEKS Comments Louisiana GLE (K.1) Listening/Speaking/Purposes.

More information

Primary English Curriculum Framework

Primary English Curriculum Framework Primary English Curriculum Framework Primary English Curriculum Framework This curriculum framework document is based on the primary National Curriculum and the National Literacy Strategy that have been

More information

Leveraging Sentiment to Compute Word Similarity

Leveraging Sentiment to Compute Word Similarity Leveraging Sentiment to Compute Word Similarity Balamurali A.R., Subhabrata Mukherjee, Akshat Malu and Pushpak Bhattacharyya Dept. of Computer Science and Engineering, IIT Bombay 6th International Global

More information

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona Parallel Evaluation in Stratal OT * Adam Baker University of Arizona tabaker@u.arizona.edu 1.0. Introduction The model of Stratal OT presented by Kiparsky (forthcoming), has not and will not prove uncontroversial

More information

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque Approaches to control phenomena handout 6 5.4 Obligatory control and morphological case: Icelandic and Basque Icelandinc quirky case (displaying properties of both structural and inherent case: lexically

More information

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics (L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes

More information

Grade 5: Module 3A: Overview

Grade 5: Module 3A: Overview Grade 5: Module 3A: Overview This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Exempt third-party content is indicated by the footer: (name of copyright

More information

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,

More information

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards TABE 9&10 Revised 8/2013- with reference to College and Career Readiness Standards LEVEL E Test 1: Reading Name Class E01- INTERPRET GRAPHIC INFORMATION Signs Maps Graphs Consumer Materials Forms Dictionary

More information

UKLO Round Advanced solutions and marking schemes. 6 The long and short of English verbs [15 marks]

UKLO Round Advanced solutions and marking schemes. 6 The long and short of English verbs [15 marks] UKLO Round 1 2013 Advanced solutions and marking schemes [Remember: the marker assigns points which the spreadsheet converts to marks.] [No questions 1-4 at Advanced level.] 5 Bulgarian [15 marks] 12 points:

More information

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011 CAAP Content Analysis Report Institution Code: 911 Institution Type: 4-Year Normative Group: 4-year Colleges Introduction This report provides information intended to help postsecondary institutions better

More information

The College Board Redesigned SAT Grade 12

The College Board Redesigned SAT Grade 12 A Correlation of, 2017 To the Redesigned SAT Introduction This document demonstrates how myperspectives English Language Arts meets the Reading, Writing and Language and Essay Domains of Redesigned SAT.

More information

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches Yu-Chun Wang Chun-Kai Wu Richard Tzong-Han Tsai Department of Computer Science

More information

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment Takako Aikawa, Lee Schwartz, Ronit King Mo Corston-Oliver Carmen Lozano Microsoft

More information

Using a Native Language Reference Grammar as a Language Learning Tool

Using a Native Language Reference Grammar as a Language Learning Tool Using a Native Language Reference Grammar as a Language Learning Tool Stacey I. Oberly University of Arizona & American Indian Language Development Institute Introduction This article is a case study in

More information

South Carolina English Language Arts

South Carolina English Language Arts South Carolina English Language Arts A S O F J U N E 2 0, 2 0 1 0, T H I S S TAT E H A D A D O P T E D T H E CO M M O N CO R E S TAT E S TA N DA R D S. DOCUMENTS REVIEWED South Carolina Academic Content

More information

Intensive English Program Southwest College

Intensive English Program Southwest College Intensive English Program Southwest College ESOL 0352 Advanced Intermediate Grammar for Foreign Speakers CRN 55661-- Summer 2015 Gulfton Center Room 114 11:00 2:45 Mon. Fri. 3 hours lecture / 2 hours lab

More information

Memory-based grammatical error correction

Memory-based grammatical error correction Memory-based grammatical error correction Antal van den Bosch Peter Berck Radboud University Nijmegen Tilburg University P.O. Box 9103 P.O. Box 90153 NL-6500 HD Nijmegen, The Netherlands NL-5000 LE Tilburg,

More information

The stages of event extraction

The stages of event extraction The stages of event extraction David Ahn Intelligent Systems Lab Amsterdam University of Amsterdam ahn@science.uva.nl Abstract Event detection and recognition is a complex task consisting of multiple sub-tasks

More information

Question (1) Question (2) RAT : SEW : : NOW :? (A) OPY (B) SOW (C) OSZ (D) SUY. Correct Option : C Explanation : Question (3)

Question (1) Question (2) RAT : SEW : : NOW :? (A) OPY (B) SOW (C) OSZ (D) SUY. Correct Option : C Explanation : Question (3) Question (1) Correct Option : D (D) The tadpole is a young one's of frog and frogs are amphibians. The lamb is a young one's of sheep and sheep are mammals. Question (2) RAT : SEW : : NOW :? (A) OPY (B)

More information

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s)) Ohio Academic Content Standards Grade Level Indicators (Grade 11) A. ACQUISITION OF VOCABULARY Students acquire vocabulary through exposure to language-rich situations, such as reading books and other

More information

Comprehension Recognize plot features of fairy tales, folk tales, fables, and myths.

Comprehension Recognize plot features of fairy tales, folk tales, fables, and myths. 4 th Grade Language Arts Scope and Sequence 1 st Nine Weeks Instructional Units Reading Unit 1 & 2 Language Arts Unit 1& 2 Assessments Placement Test Running Records DIBELS Reading Unit 1 Language Arts

More information

Physics 270: Experimental Physics

Physics 270: Experimental Physics 2017 edition Lab Manual Physics 270 3 Physics 270: Experimental Physics Lecture: Lab: Instructor: Office: Email: Tuesdays, 2 3:50 PM Thursdays, 2 4:50 PM Dr. Uttam Manna 313C Moulton Hall umanna@ilstu.edu

More information

Problems of the Arabic OCR: New Attitudes

Problems of the Arabic OCR: New Attitudes Problems of the Arabic OCR: New Attitudes Prof. O.Redkin, Dr. O.Bernikova Department of Asian and African Studies, St. Petersburg State University, St Petersburg, Russia Abstract - This paper reviews existing

More information

Words come in categories

Words come in categories Nouns Words come in categories D: A grammatical category is a class of expressions which share a common set of grammatical properties (a.k.a. word class or part of speech). Words come in categories Open

More information

Myths, Legends, Fairytales and Novels (Writing a Letter)

Myths, Legends, Fairytales and Novels (Writing a Letter) Assessment Focus This task focuses on Communication through the mode of Writing at Levels 3, 4 and 5. Two linked tasks (Hot Seating and Character Study) that use the same context are available to assess

More information