A Hackathon for Classical Tibetan

Size: px
Start display at page:

Download "A Hackathon for Classical Tibetan"

Transcription

1 A Hackathon for Classical Tibetan Orna Almogi, Lena Dankin, Nachum Dershowitz, Lior Wolf To cite this version: Orna Almogi, Lena Dankin, Nachum Dershowitz, Lior Wolf. A Hackathon for Classical Tibetan., Episciences.org, 2019, Special Issue on Computer- Aided Processing of Intertextuality in Ancient Languages. <hal v3> HAL Id: hal Submitted on 30 Dec 2018 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. Public Domain L archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

2 A Hackathon for Classical Tibetan Orna Almogi 1, Lena Dankin 2*, Nachum Dershowitz 2,3, Lior Wolf 2 1 Universität Hamburg, Germany 2 Tel Aviv University, Israel 3 Institut d Études Avancées de Paris, France * Corresponding author: Lena Dankin, lenadank@tau.ac.il Abstract We describe the course of a hackathon dedicated to the development of linguistic tools for Tibetan Buddhist studies. Over a period of five days, a group of seventeen scholars, scientists, and students developed and compared algorithms for intertextual alignment and text classification, along with some basic language tools, including a stemmer and word segmenter. Keywords Tibetan; Buddhist studies; hackathon; stemming; segmentation; intertextual alignment; text classification. I INTRODUCTION In February 2016, a group of four Tibetologists (from the University of Hamburg), one digital humanities scholar (from Europe), and twelve computer scientists (from Israel and Europe) got together in Kibbutz Lotan in the Arava region of Israel with the stated goal of developing algorithmic methods for advancing Tibetan Buddhist textual studies. Participants were either recruited by the organizers or responded to an announcement on several mailing lists. See Figure 1. Most of the computer scientists had background in machine learning, and a few of them also had experience with natural language processing (NLP) research, but without any prior experience with Tibetan texts. The computer scientist organizers were quite familiar with programming workshops and contests and thought that the challenges presented by Tibetan texts would pose an ideal opportunity to explore the hackathon format. The hackathon is a short and intense event where computer scientists collaborate to develop software. For that purpose, it was essential to recruit as many software developers as possible. Some of the recruited students had participated in other hackathons. The plan was to have a focused hacking event, with specific goals to work towards, goals that had been provided by the Tibetan scholars. The six-hour drive down from Tel Aviv (including a stop to admire desert flora) afforded an opportunity for everyone to get to know each other. The back seats of the van were piled high with computer equipment and the kibbutz was to provide the necessary fast internet connection. The isolation of the kibbutz created an intense working environment and encouraged long hours; the stark natural beauty of the location contributed to a shared sense of tranquility of purpose. Several hackathons have been conducted for the purpose of the development of tools for digital humanities, before and after ours. In June 2015, the etrap team organized a 1

3 hackathon for text reuses ( Twenty-three participants from fifteen different institutes worked on the detection of textual reuses across data in different languages and from different genres, using the TRACER tool [Büchler, 2013; Büchler et al., 2014]. More recently, in May 2017, another hackathon took place in Helsinki, which brought together historians, linguists, psychologists, and computer scientists to work on four different tasks, including the analysis of the written media with the political elite ( In November 2017, The National Library of Israel hosted a 24-hour hackathon dedicated to the goal of developing state-of-the-art tools and applications for national cultural treasures using a iiif server for their large collection of images ( The two main tasks that confronted our group that week (February 14-18) were (1) to develop algorithms for finding intertextual parallels that are only approximately the same, and (2) to experiment with algorithmic classification methods for identifying authorship and style. In both cases, the concern was centered on language issues specific to Tibetan. After a quick lesson in Tibetan, the Buddhist canon, and modern Tibetan encoding conventions for the benefit of the less knowledgeable, the group split into four loose teams, devoted to the following goals: (A) dataset preparation; (B) language tool development; (C) intertextual alignment; and (D) text classification. We describe each of these efforts in turn in the sections that follow. Each team consisted of a few computer scientists, chosen based on the individual background, experience, and interests, plus a Tibetan scholar who provided annotated data sets and analyzed results. Twice a day we held synchronization round-ups, where each team briefed everyone about their progress, discussed their next steps, and raised problems they stumbled across. II PRELIMINARIES Tibetan is a monosyllabic language (Tibetan morphemes normally consist of one syllable) belonging to the Tibeto-Burman branch of the Sino-Tibetan family. The language is ergative, with a plethora of (usually monosyllabic) grammatical particles, which are often omitted. Occasionally, the same syllable can be written using one of several orthographic variations, for example, sogs and stsogs. In the case of verbs, the syllable has various inflectional forms that are often homophones, a fact that can result in variants in reading due to scribal errors or lack of standardization. An example of such inflectional forms is sgrub, bsgrubs, bsgrub, sgrubs (present, past, future and imperative, respectively), all of which are homophones. The intransitive form of the verb offers even more inflectional forms that yield homophones with their transitive counterpart, ʼgrub and grub (present/future and past, respectively). See [Beyer 1992] for details about the language. The Tibetan Buddhist canon consists of two parts: the Kangyur (bkaʼ ʼgyur), which commonly comprises 108 volumes containing what is believed by tradition to be the Word of the Buddha, texts that were mostly translated directly from the Sanskrit original (with some from other languages and others indirectly via Chinese); and the Tengyur (bstan ʼgyur), commonly comprising about 210 volumes consisting of canonical commentaries, treatises, and various kinds of manuals that were written in the seventh to thirteenth centuries and likewise mostly translated from Sanskrit, with some works from other languages and a few originally written in Tibetan. Overall, this corpus contains 77 million occurrences (tokens) of 81,000 different syllable types. The average transcribed syllable length is 3.5 and the average number of syllables in a single document is

4 III HACKATHON TASKS A. Dataset preparation A prerequisite for the main goals of the hackathon was data with which to work, that is, texts to compare and classify. For this, we took Tibetan Buddhist texts obtained from various sources. These included the Tibetan Buddhist canon in digital form (we used a modified form of the ACIP files of the Kangyur and Tengyur provided by Paul Hackett of Columbia University) and several sets of autochthonous Tibetan Buddhist texts of various authors (compiled by Eric Werner of Universität Hamburg). In addition, it was necessary to prepare test suites with manually prepared gold standard answers, so that the performance of algorithms for finding parallel passages and for classifying texts could be measured. The passages were selected from various sources, particularly from (a) two doxographical texts (ʼgrub mthaʼ), the gzhung lugs rnam byed by Phywa pa Chos kni sengge ( ) and the ʼGrub mthaʼ mdzod by Klong chen pa Dri med ʼod zer ( ), the latter including borrowed passages from the former [Werner, 2014], and (b) Rong zom Chos kyi bzang poʼs (11th c.) collected writings, which features numerous cases of parallel passages. These Tibetan works were provided in textual form, transcribed according to the Wylie convention [Wylie, 1959]. In this system, Tibetan is transliterated into Latin characters without diacritics; thus various Tibetan letters are represented by two or three Latin consonants. The decision to work with transliterated texts was made partly because they were the ones available at the time, but also because the computer scientists didn t understand Tibetan script, so this transliteration made it possible for them to progress quickly without the need to acquire a new alphabet. The texts had to be cleaned by removing sigla and by standardizing punctuation. B. Language tools Since syllables having the same base form may take many different surface forms, stemming is a crucial stage in almost every text-processing task one would like to perform in Tibetan, as for many other languages. So, to support present and future analysis of Tibetan texts, developing a stemmer was one of the first orders of business. Usually, in Indo-European and Semitic languages, stemming is performed on the word level. However, in Tibetan, in which multisyllabic words are not separated by spaces or other marks, a syllable-based stemming mechanism is required even in order to segment the text into lexical items. Stemming is not the same as (grammatical) lemmatization, and the stemming process can result in a stem that is not itself a lexical entry in a dictionary. Moreover, unlike Indo-European languages, stemming of Tibetan is mostly relevant to verbs and verbal nouns (which are common in the language). Despite being inaccurate in some cases, stemming (for Tibetan, as for other languages) can improve tasks such as word segmentation and the detection of intertextual parallels [Klein et al., 2014]. Even for Tibetan words consisting of more than one syllable, stemming each substantial syllable (i.e. excluding grammatical particles) makes sense since all the inflections are embedded at the syllable level. For instance, the words brtag dbyad (analysis) and brtags dpyad (analyzed) are stemmed to rtog dpyod (to analyze, analysis). 3

5 The stemmer we developed is a rule-based application that works in the following manner: first, the syllable is divided into a sequence of Tibetan letters. This stage is required because the Wylie transliteration scheme represents some Tibetan letters by more than one character (e.g. zh, tsh). There is, fortunately, no ambiguity in the process of segmentation into Tibetan letters. By design, the transliteration ensures that whenever a sequence of two or three characters represents a single letter, it cannot also be interpreted in context as a sequence of distinct Tibetan letters. For the analysis of the Tibetan syllable we used an octuple (8-component) scheme: Each Tibetan syllable should contain one core letter and one vowel. Other positions (subscript, superscript, coda, prescript, postscript, and appended particle) are not obligatory. Each position contains a single letter, except for that of the appended particle, which can be any of six syllables. The stem of a syllable is defined by us as consisting of the core letter or stacked letter (which, in turn, consists of the core letter and a superscript or a subscript, or both), the vowel (syllabic contractions contain two vowels at most), and the coda (if extant). Syllables can be considered stemmically identical if these are consistent, despite additions or omissions of a prescript and/or a postscript. The final stage of the stemming is normalization, since there are groups of Tibetan letters that can be replaced one with another without changing the basic meaning of the syllable (in inflectional forms). Since the goal is to group all syllables that are ultimately stemmically identical into one and the same stem, we normalized all tuples according to an elaborate set of rules. The stemmer, as described, extracts the information encoded in each Wylie transliterated syllable and makes it explicit. An important task, given two syllables, is to evaluate their stemmic similarity. Some substitutions can be considered silent or synonymous; others change the meaning completely; and there is a continuous spectrum in between. Metric learning algorithms were used to assess the relative importance of each substitution. Another important language task is word segmentation, that is, grouping syllables into words (lexical units). Since no spaces or special characters are used to mark word boundaries, the reader has to rely on language models to detect the word boundaries. As opposed to the stemming task, we had recourse to an annotated corpus for the segmentation task, that is, a word-segmented corpus, with which it was possible to train a supervised model. The training data that was used, consisting of 37,000 sentences, was obtained from the Tibetan in Digital Communication project ( The approach taken at the hackathon was based on a flavor of recurrent neural networks (RNNs) called long short-term memory (LSTM) [Hochreiter & Schmidhuber, 1997]. LSTMs have been used in the past for word segmentation of Chinese text [Chen et al., 2015]. The tuple representation of syllables was used for this purpose; see details in [Almogi et al., 2016]. Several LSTM setups were compared; the best configuration yielded an F1 score of In addition, a more traditional algorithm, the conditional random field (CRF), was applied to the data, yielding a lower F1 score of This technique was previously applied on Tibetan script in [Liu et al., 2011]. It bears noting that our efforts to train a word2vec model [Mikolov et al., 2013] to represent Tibetan syllables did not result in a solid representation, in the sense that pairs of vectors with high (cosine) similarity did not usually represent synonyms. For that reason, the vector representation that was developed for the stemmer was also essential for the word segmentation task. 4

6 Both the stemmer and word segmenter have been made publicly available and can be accessed from Additional details may be found in [Almogi et al., 2016]. C. Intertextual alignment The primary goal of the hackathon was to develop and compare tools for finding parallel passages between Tibetan texts that are the result of either acknowledged citations (with or without attributions) or borrowing (i.e. with no acknowledgement whatsoever). Generally, for determining the history of composition or relative chronology of a text, passages need not match precisely. That is, in addition to the fact that orthographical differences or omission/addition of grammatical particles are of no great significance, it is often the case that cited or borrowed passages are not necessarily reproduced verbatim, but are often slightly paraphrased or shortened, or both. For determining the identity of persons involved in the composition of the text and its transmission that is, the author, translator, scribe, or editor the precision of the match is of greater significance, and even variation in orthography or omission/addition of grammatical particles may be relevant. In this regard, however, textual scholars take into consideration that texts were often copied and edited and that through these processes changes could have been introduced into the text, either deliberately particularly in terms of standardization of orthography and verb inflection, employment of particles, and even substitutions of terminology in cases of archaism or unintentionally. Broadly speaking, there are two cases of interest: (a) an approximate alignment of what could be considered to be exactly the same text, that is, an alignment that allows variants that are considered accidental or non-substantial (that is, variations regarding omission/addition or different forms of the same grammatical particles, orthography, inflectional forms in the case of verbs, archaism vs. standardization, and the like), and (b) an approximate alignment of passages that contained the same text but in modified form of some sort, that is, an alignment that allows substantial variants in addition to the non-substantial ones (omission/addition of a substantial syllable, replacement of a substantial syllable by a completely different one, omission/addition of a string of syllables, occurrence of the same syllables in a different order, and the like). To address the problem of substantial variants that could occur also when a (more or less) exact citation or borrowing was intended, that is, such that have been intentionally introduced by either the author himself or by the scribes and editors during the process of transmission, or such that have been unintentionally crept in during the processes of composition and copying, a limited number of substantial variants must be admitted as well. Three algorithms competed with one another on this task during the hackathon. 1. One algorithm was TRACER [Büchler, 2013; Büchler et al., 2014], based on the bag of words representation method. TRACER is a general text reuse detection algorithm with a seven-level architecture. Each step is configurable and can be optimized to specific text reuse tasks and corpora. The steps are preprocessing, featuring, selection, scoring, and post-processing. This approach is called feature-based linking, where only text-reuse units with shared features are compared, as opposed to the comparison of the full text of passages, all against all. All passages are compared by comparing the words they contain, ignoring word order. 2. Another method was based on Agents for Actors (AfA) [Küster, 2013], a digital humanities framework for distributed microservices for text analysis. AfA was originally developed 5

7 for the purpose of identifying allusions to Shakespearean passages in transcriptions of dialogues in films (hence actors in its name). This algorithm compares passages both on the letter and the word level, and therefore catches variations at the orthographic and formulation levels, respectively. While its primary use is to identify references and allusions in texts, in the hackathon, the algorithm was tested to see how well it can also serve to identify parallel passages for very different types of texts in an unrelated language. 3. The third approach was based on an adaptation of the method of [Barsky et al., 2008], designed for matching DNA subsequences, to our problem, as described in [Klein et al., 2014]. This algorithm looks for all against all approximate matches (within some given threshold of difference between passages) by rephrasing the problem as finding maximal paths in a matching graph. That method was modified during the hackathon to work with syllable stems as the basic building block, rather than the individual character level used before. This change improved both the run time and the quality of the results. Since, on average, a syllable has 4 characters, the speedup was two orders of magnitude. As for the results, p@10 ( precision at ten, the fraction of the top ten results that are of relevance) increased from 0.67 to 1, and p@20 increased from 0.37 to The improvement were due to the fact that with character-wise alignment syllables can share many letters but have no semantic similarity; see [Labenski et al., 2016; Labenski, 2016]. An infrastructure subteam, in addition to keeping everything up and running, parallelized the implementation of the third algorithm to run on a Sparc cluster of computers, located at Tel Aviv University. This is necessary for the ultimate goal, considering the large size of the corpus. The idea is simple: divide the texts into overlapping chunks; then run the original algorithm on all chunks in parallel; finally, piece all the results together. All three algorithms were tested on a test set that was designed during the hackathon. The two doxological texts mentioned above and known to contain many shared passages were chosen, and 24 pairs of parallel passages were manually annotated. Out of the 24 pairs, the TRACER algorithm retrieved 13 pairs, the AFA algorithm retrieved 12 pairs, and the APBT algorithm retrieved 16. By finding cited or borrowed passages within the corpora of Indo-Tibetan (i.e. translated) and Tibetan (i.e. autochthonous) Buddhist literature, several research questions can be better addressed: determining the history of composition of individual texts; determining relative chronology of groups of texts; determining the intellectual scholarly milieu in which the texts emerged; and determining the intellectual history behind the texts (viz. terminology and concepts). After identifying parallel passages, one can assess the frequencies of letter/syllable/word replacements in the aligned passages of selected texts or text groups. This can serve to help answer further research questions like: determining editorial policies and processes, such as standardization of orthography, standardization of employment of grammatical particles (i.e. according to the so-called sandhi rules); and identifying processes of revisions of translated texts. D. Text classification 6

8 The second major task that was addressed at the hackathon was the question of author profiling. While the question as to what extent the issue of authorship can be addressed in the case of translated texts is yet to be looked into carefully, some general research questions related to authorship fall under the purview of machine classification. These include the following: (a) distinguishing between translated texts and autochthonous texts; (b) identifying the period in which a text was composed, viz. Old Tibetan (7 11th c.), Classical Tibetan I (11 14th c.), or Classical Tibetan II (15 20th c.); (c) determining whether a translated canonical work belongs to the early period of translation (snga ʼgyur) or the new period (phyi ʼgyur); (d) in the case of autochthonous literature, differentiating between the so-called revealed texts (texts that are portrayed as having been transmitted supernaturally) versus composed texts; and (e) identifying an author s intellectual milieu (e.g. affiliation with a particular school of thought). A series of experiments were performed on scriptures and treatises, early and late, translated and autochthonous texts. We tried several methods, including bag-of-word features and a perceptron classifier with stochastic gradient descent with features similar to [Volansky et al., 2015], mainly: mean syllable length; mean sentence length; frequency of verbal prefixes and function words; frequency of foreign (Sanskrit) words; and type-to-token ratio. For authorship detection, we first used an automatic word segmenter and then used n-gram frequency and bag-of-words as features. Such a method was shown to be useful in [Koppel et al., 2008]. We didn t advance further in this task, due to a shortage of time. Both parts of the canon were employed as training data to determine features that are peculiar for the Kangyur, the corpus containing scriptures, on the one hand, and the Tengyur, the corpus containing treatises, commentaries, manuals and the like, on the other. Numerous autochthonous texts, including the entire collected writings of Rong zom Chos kyi bzang po, the entire collected writings of Shākya mchog ldan ( ), several works by Sa kya paṇḍi ta Kun dgaʼ rgyal mtshan ( ), and several texts by Tsong kha pa Blo bzang grags pa ( ) were tested against the translated canonical texts in order to determine features of translated versus autochthonous works. In addition, selected individual texts were tested. For example, Sa skya paṇdi ta s Tshad ma rigs gter was compared with Dharmakīrti s (7th c.) Pramāṇavarttika in Tibetan translation, which enabled a comparison of autochthonous versus translated work on similar topics. The Mañjuśrīnāmasaṅgīti commentary ascribed to Rong zom pa (and at the same time included in the Tengyur as an Indian work in Tibetan translation) was compared with the canon in its entirety, as was the Tengyur alone with other works by Rong zom pa and additional autochthonous works, which provided a comparison of works whose origin has been considered doubtful with translated and autochthonous literature. The classification results are undergoing analysis by the Tibetan scholars. 7

9 Figure 1. Poster announcement of the hackathon. IV CONCLUSION The intense hackathon format proved to be quite exhilarating. Towards evening, each group reported on the day s accomplishments and vicissitudes. No single task was actually brought to completion on site, but the saplings were planted, and the ideas and prototype tools have continued to grow and develop in the ensuing weeks. 8

10 Based on our experience, we would recommend such a hackathon format for other welldefined interdisciplinary efforts in the computational humanities. It pays to come wellprepared to the event with clear goals and clean test data. And it is crucial to allocate resources for bringing the products and results of the hackathon to a stable and useful state after the event. As a matter of fact, the authors held a second hackathon one year later (February 2017) on a kibbutz in the Galilee, again for the development of tools for Tibetan Buddhist texts, but this time concentrating on manuscripts and computer-vision aspects. Acknowledgements We thank the staff at Kibbutz Lotan and all the hackathon participants (listed below). This research was supported in part by a grant (#I ) from the German-Israeli Foundation for Scientific Research and Development, and by the Khyentse Center for Tibetan uddhist Textual Scholarship, niversit t Hamburg, thanks to a grant by the Khyentse Foundation. N.D. s research benefitted from a fellowship at the Paris Institute for Advanced Studies (France), with the financial support of the French state, managed by the French National Research Agency s Investissements d avenir program (ANR-11-LABX Labex RFIEA+). Hackathon participants: rna Almogi, Kfir ar, Marco üchler, Lena Dankin, Nachum Dershowitz, Daniel Hershcovich, Yair Hoffman, Marc W. Küster, Daniel Labenski, Peter Naftaliev, Dimitri Pauls, Elad Shaked, Nadav Steiner, Lior Uzan, Dorji Wangchuk, Eric Werner, and Lior Wolf. Participating institutions: Tel Aviv University (School of Computer Science); Universität Hamburg (Khyentse Center for Tibetan Buddhist Textual Scholarship, Department for Indian and Tibetan Studies); Georg-August-Universität Göttingen (Göttingen Centre for Digital Humanities). References rna Almogi, Lena Dankin, Nachum Dershowitz, Yair Hoffman, Dimitri Pauls, Dorji Wangchuk, Lior Wolf, Stemming and segmentation for classical Tibetan, in: Revised Selected Papers of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing), Konya, Turkey (April 2016), Part I, A. Gelbukh, ed., Lecture Notes in Computer Science, vol. 9623, Springer-Verlag, Switzerland, pp , URL Marina arsky, lrike Stege, Alex Thomo, and Chris pton, A graph approach to the threshold all-against-all substring matching problem, ACM Journal of Experimental Algorithmics 12, Article 1.10, Stephan V. Beyer, The Classical Tibetan Language, SUNY Press, Albany, NY, Marco Büchler, Informationstechnische Aspekte des Historical Text Re-use, Ph.D. thesis, Fakultät für Mathematik und Informatik, Universität Leipzig, Germany, March Marco Büchler, Greta Franzini, Emily Franzini, and Maria Moritz, Scaling historical text re-use, in: Proceedings of the IEEE International Conference on Big Data 2014 (IEEE BigData 2014), pp , October Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu, and Xuanjing Huang, Long short-term memory neural networks for Chinese word segmentation, in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Lisbon, Portugal, pp , September Sepp Hochreiter and ürgen Schmidhuber, Long short-term memory, Neural Comput. 9(8): , November

11 enjamin Klein, Nachum Dershowitz, Lior Wolf, rna Almogi, and Dorji Wangchuk, Finding inexact quotations within a Tibetan uddhist corpus, in: Digital Humanities (DH) 2014, pp , Lausanne, Switzerland, July URL Moshe Koppel, Jonathan Schler and Eran Messeri, Authorship attribution in law enforcement scenarios, in: Security Informatics and Terrorism - Patrolling the Web, P. Cantor and B. Shapira (Eds), IOS Press NATO Series. Marc W. Küster, Agents for Actors: A Digital Humanities framework for distributed microservices for text linking and visualization, in: Digital Humanities (DH) 2013, University of Nebraska Lincoln, pp , July Daniel Labenski, Finding Inter-textual Relations in Historical Texts, M.Sc. thesis, School of Computer Science, Tel Aviv University, Israel, URL Daniel Labenski, Elad Shaked, rna Almogi, Lena Dankin, Nachum Dershowitz, and Lior Wolf, Intertextuality in Tibetan texts (Abstract), in: Israeli Seminar on Computational Linguistics (ISCOL), Haifa, Israel, May URL Huidan Liu, Minghua Nuo, Longlong Ma, ian Wu, Yeping He, Tibetan word segmentation as syllable tagging using conditional random field, in Proceedings of The 25th Pacific Asia Conference on Language, Information and Computation (PACLIC 2011), pages , 2011 Thomas Mikolov, Kai Chen, Greg S. Corrado, and effrey Dean, Efficient estimation of word representations in vector space, arxiv: [cs.cl], Vered Volansky, Noam rdan, and Shuly Wintner, n the features of translationese, Digital Scholarship in the Humanities 30(1): , April Eric Werner, Phywa-pa Chos-kyi-seng-ge s ( ) depiction of Mahāyāna philosophy: A critical edition and annotated translation of the chapters on Yogācāra and Mādhyamaka philosophy from the gzhung lugs rnam byed, a doxography of the twelfth century, M.A. thesis, University of Hamburg, Germany, Turrell V. Wylie, A standard system of Tibetan transcription, Harvard Journal of Asiatic Studies 22: , December

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

Designing Autonomous Robot Systems - Evaluation of the R3-COP Decision Support System Approach

Designing Autonomous Robot Systems - Evaluation of the R3-COP Decision Support System Approach Designing Autonomous Robot Systems - Evaluation of the R3-COP Decision Support System Approach Tapio Heikkilä, Lars Dalgaard, Jukka Koskinen To cite this version: Tapio Heikkilä, Lars Dalgaard, Jukka Koskinen.

More information

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3] Towards a MWE-driven A* parsing with LTAGs [WG2,WG3] Jakub Waszczuk, Agata Savary To cite this version: Jakub Waszczuk, Agata Savary. Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]. PARSEME 6th general

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

Teachers response to unexplained answers

Teachers response to unexplained answers Teachers response to unexplained answers Ove Gunnar Drageset To cite this version: Ove Gunnar Drageset. Teachers response to unexplained answers. Konrad Krainer; Naďa Vondrová. CERME 9 - Ninth Congress

More information

Constructing Parallel Corpus from Movie Subtitles

Constructing Parallel Corpus from Movie Subtitles Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing

More information

A Novel Approach for the Recognition of a wide Arabic Handwritten Word Lexicon

A Novel Approach for the Recognition of a wide Arabic Handwritten Word Lexicon A Novel Approach for the Recognition of a wide Arabic Handwritten Word Lexicon Imen Ben Cheikh, Abdel Belaïd, Afef Kacem To cite this version: Imen Ben Cheikh, Abdel Belaïd, Afef Kacem. A Novel Approach

More information

Problems of the Arabic OCR: New Attitudes

Problems of the Arabic OCR: New Attitudes Problems of the Arabic OCR: New Attitudes Prof. O.Redkin, Dr. O.Bernikova Department of Asian and African Studies, St. Petersburg State University, St Petersburg, Russia Abstract - This paper reviews existing

More information

South Carolina English Language Arts

South Carolina English Language Arts South Carolina English Language Arts A S O F J U N E 2 0, 2 0 1 0, T H I S S TAT E H A D A D O P T E D T H E CO M M O N CO R E S TAT E S TA N DA R D S. DOCUMENTS REVIEWED South Carolina Academic Content

More information

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany Jana Kitzmann and Dirk Schiereck, Endowed Chair for Banking and Finance, EUROPEAN BUSINESS SCHOOL, International

More information

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,

More information

AQUA: An Ontology-Driven Question Answering System

AQUA: An Ontology-Driven Question Answering System AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.

More information

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches Yu-Chun Wang Chun-Kai Wu Richard Tzong-Han Tsai Department of Computer Science

More information

Georgetown University at TREC 2017 Dynamic Domain Track

Georgetown University at TREC 2017 Dynamic Domain Track Georgetown University at TREC 2017 Dynamic Domain Track Zhiwen Tang Georgetown University zt79@georgetown.edu Grace Hui Yang Georgetown University huiyang@cs.georgetown.edu Abstract TREC Dynamic Domain

More information

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Notebook for PAN at CLEF 2013 Andrés Alfonso Caurcel Díaz 1 and José María Gómez Hidalgo 2 1 Universidad

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering

More information

arxiv: v1 [cs.cl] 2 Apr 2017

arxiv: v1 [cs.cl] 2 Apr 2017 Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,

More information

Mandarin Lexical Tone Recognition: The Gating Paradigm

Mandarin Lexical Tone Recognition: The Gating Paradigm Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

Literature and the Language Arts Experiencing Literature

Literature and the Language Arts Experiencing Literature Correlation of Literature and the Language Arts Experiencing Literature Grade 9 2 nd edition to the Nebraska Reading/Writing Standards EMC/Paradigm Publishing 875 Montreal Way St. Paul, Minnesota 55102

More information

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

User Profile Modelling for Digital Resource Management Systems

User Profile Modelling for Digital Resource Management Systems User Profile Modelling for Digital Resource Management Systems Daouda Sawadogo, Ronan Champagnat, Pascal Estraillier To cite this version: Daouda Sawadogo, Ronan Champagnat, Pascal Estraillier. User Profile

More information

Cross Language Information Retrieval

Cross Language Information Retrieval Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks 3rd Grade- 1st Nine Weeks R3.8 understand, make inferences and draw conclusions about the structure and elements of fiction and provide evidence from text to support their understand R3.8A sequence and

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

The College Board Redesigned SAT Grade 12

The College Board Redesigned SAT Grade 12 A Correlation of, 2017 To the Redesigned SAT Introduction This document demonstrates how myperspectives English Language Arts meets the Reading, Writing and Language and Essay Domains of Redesigned SAT.

More information

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics (L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,

More information

Software Maintenance

Software Maintenance 1 What is Software Maintenance? Software Maintenance is a very broad activity that includes error corrections, enhancements of capabilities, deletion of obsolete capabilities, and optimization. 2 Categories

More information

Detecting English-French Cognates Using Orthographic Edit Distance

Detecting English-French Cognates Using Orthographic Edit Distance Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National

More information

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas Exploiting Distance Learning Methods and Multimediaenhanced instructional content to support IT Curricula in Greek Technological Educational Institutes P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou,

More information

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading ELA/ELD Correlation Matrix for ELD Materials Grade 1 Reading The English Language Arts (ELA) required for the one hour of English-Language Development (ELD) Materials are listed in Appendix 9-A, Matrix

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

Rule Learning with Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS Ruslan Mitkov (R.Mitkov@wlv.ac.uk) University of Wolverhampton ViktorPekar (v.pekar@wlv.ac.uk) University of Wolverhampton Dimitar

More information

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction CLASSIFICATION OF PROGRAM Critical Elements Analysis 1 Program Name: Macmillan/McGraw Hill Reading 2003 Date of Publication: 2003 Publisher: Macmillan/McGraw Hill Reviewer Code: 1. X The program meets

More information

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks POS tagging of Chinese Buddhist texts using Recurrent Neural Networks Longlu Qin Department of East Asian Languages and Cultures longlu@stanford.edu Abstract Chinese POS tagging, as one of the most important

More information

Smart Grids Simulation with MECSYCO

Smart Grids Simulation with MECSYCO Smart Grids Simulation with MECSYCO Julien Vaubourg, Yannick Presse, Benjamin Camus, Christine Bourjot, Laurent Ciarletta, Vincent Chevrier, Jean-Philippe Tavella, Hugo Morais, Boris Deneuville, Olivier

More information

Students concept images of inverse functions

Students concept images of inverse functions Students concept images of inverse functions Sinéad Breen, Niclas Larson, Ann O Shea, Kerstin Pettersson To cite this version: Sinéad Breen, Niclas Larson, Ann O Shea, Kerstin Pettersson. Students concept

More information

Finding Translations in Scanned Book Collections

Finding Translations in Scanned Book Collections Finding Translations in Scanned Book Collections Ismet Zeki Yalniz Dept. of Computer Science University of Massachusetts Amherst, MA, 01003 zeki@cs.umass.edu R. Manmatha Dept. of Computer Science University

More information

The IDN Variant Issues Project: A Study of Issues Related to the Delegation of IDN Variant TLDs. 20 April 2011

The IDN Variant Issues Project: A Study of Issues Related to the Delegation of IDN Variant TLDs. 20 April 2011 The IDN Variant Issues Project: A Study of Issues Related to the Delegation of IDN Variant TLDs 20 April 2011 Project Proposal updated based on comments received during the Public Comment period held from

More information

The taming of the data:

The taming of the data: The taming of the data: Using text mining in building a corpus for diachronic analysis Stefania Degaetano-Ortlieb, Hannah Kermes, Ashraf Khamis, Jörg Knappen, Noam Ordan and Elke Teich Background Big data

More information

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

The Internet as a Normative Corpus: Grammar Checking with a Search Engine The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a

More information

Specification of a multilevel model for an individualized didactic planning: case of learning to read

Specification of a multilevel model for an individualized didactic planning: case of learning to read Specification of a multilevel model for an individualized didactic planning: case of learning to read Sofiane Aouag To cite this version: Sofiane Aouag. Specification of a multilevel model for an individualized

More information

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio Content 1. Empirical linguistics 2. Text corpora and corpus linguistics 3. Concordances 4. Application I: The German progressive 5. Part-of-speech tagging 6. Fequency analysis 7. Application II: Compounds

More information

SARDNET: A Self-Organizing Feature Map for Sequences

SARDNET: A Self-Organizing Feature Map for Sequences SARDNET: A Self-Organizing Feature Map for Sequences Daniel L. James and Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin Austin, TX 78712 dljames,risto~cs.utexas.edu

More information

Strategies for Solving Fraction Tasks and Their Link to Algebraic Thinking

Strategies for Solving Fraction Tasks and Their Link to Algebraic Thinking Strategies for Solving Fraction Tasks and Their Link to Algebraic Thinking Catherine Pearn The University of Melbourne Max Stephens The University of Melbourne

More information

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI

More information

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona Parallel Evaluation in Stratal OT * Adam Baker University of Arizona tabaker@u.arizona.edu 1.0. Introduction The model of Stratal OT presented by Kiparsky (forthcoming), has not and will not prove uncontroversial

More information

Postprint.

Postprint. http://www.diva-portal.org Postprint This is the accepted version of a paper presented at CLEF 2013 Conference and Labs of the Evaluation Forum Information Access Evaluation meets Multilinguality, Multimodality,

More information

GERMAN STUDIES (GRMN)

GERMAN STUDIES (GRMN) Bucknell University 1 GERMAN STUDIES (GRMN) Faculty Professors: Katherine M. Faull, Peter Keitel (Director) Associate Professors: Bastian Heinsohn, Helen G. Morris-Keitel (Chair) German Studies provides

More information

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique Hiromi Ishizaki 1, Susan C. Herring 2, Yasuhiro Takishima 1 1 KDDI R&D Laboratories, Inc. 2 Indiana University

More information

Prentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Correlated to Nebraska Reading/Writing Standards (Grade 10)

Prentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Correlated to Nebraska Reading/Writing Standards (Grade 10) Prentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Nebraska Reading/Writing Standards (Grade 10) 12.1 Reading The standards for grade 1 presume that basic skills in reading have

More information

Data Fusion Models in WSNs: Comparison and Analysis

Data Fusion Models in WSNs: Comparison and Analysis Proceedings of 2014 Zone 1 Conference of the American Society for Engineering Education (ASEE Zone 1) Data Fusion s in WSNs: Comparison and Analysis Marwah M Almasri, and Khaled M Elleithy, Senior Member,

More information

Universiteit Leiden ICT in Business

Universiteit Leiden ICT in Business Universiteit Leiden ICT in Business Ranking of Multi-Word Terms Name: Ricardo R.M. Blikman Student-no: s1184164 Internal report number: 2012-11 Date: 07/03/2013 1st supervisor: Prof. Dr. J.N. Kok 2nd supervisor:

More information

Master Program: Strategic Management. Master s Thesis a roadmap to success. Innsbruck University School of Management

Master Program: Strategic Management. Master s Thesis a roadmap to success. Innsbruck University School of Management Master Program: Strategic Management Department of Strategic Management, Marketing & Tourism Innsbruck University School of Management Master s Thesis a roadmap to success Index Objectives... 1 Topics...

More information

Promoting open access to research results

Promoting open access to research results Vol. 9, No 1, 2014 www.swiss-academies.ch Promoting open access to research results Position paper issued by the Swiss Academy of Medical Sciences Information on the preparation of this position paper

More information

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing Available online at www.sciencedirect.com ScienceDirect Procedia - Social and Behavioral Sciences 141 ( 2014 ) 124 128 WCLTA 2013 Using Corpus Linguistics in the Development of Writing Blanka Frydrychova

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

A Graph Based Authorship Identification Approach

A Graph Based Authorship Identification Approach A Graph Based Authorship Identification Approach Notebook for PAN at CLEF 2015 Helena Gómez-Adorno 1, Grigori Sidorov 1, David Pinto 2, and Ilia Markov 1 1 Center for Computing Research, Instituto Politécnico

More information

Prentice Hall Literature: Timeless Voices, Timeless Themes Gold 2000 Correlated to Nebraska Reading/Writing Standards, (Grade 9)

Prentice Hall Literature: Timeless Voices, Timeless Themes Gold 2000 Correlated to Nebraska Reading/Writing Standards, (Grade 9) Nebraska Reading/Writing Standards, (Grade 9) 12.1 Reading The standards for grade 1 presume that basic skills in reading have been taught before grade 4 and that students are independent readers. For

More information

Learning Methods for Fuzzy Systems

Learning Methods for Fuzzy Systems Learning Methods for Fuzzy Systems Rudolf Kruse and Andreas Nürnberger Department of Computer Science, University of Magdeburg Universitätsplatz, D-396 Magdeburg, Germany Phone : +49.39.67.876, Fax : +49.39.67.8

More information

Applications of memory-based natural language processing

Applications of memory-based natural language processing Applications of memory-based natural language processing Antal van den Bosch and Roser Morante ILK Research Group Tilburg University Prague, June 24, 2007 Current ILK members Principal investigator: Antal

More information

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading Program Requirements Competency 1: Foundations of Instruction 60 In-service Hours Teachers will develop substantive understanding of six components of reading as a process: comprehension, oral language,

More information

CS 100: Principles of Computing

CS 100: Principles of Computing CS 100: Principles of Computing Kevin Molloy August 29, 2017 1 Basic Course Information 1.1 Prerequisites: None 1.2 General Education Fulfills Mason Core requirement in Information Technology (ALL). 1.3

More information

Disambiguation of Thai Personal Name from Online News Articles

Disambiguation of Thai Personal Name from Online News Articles Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online

More information

Arabic Orthography vs. Arabic OCR

Arabic Orthography vs. Arabic OCR Arabic Orthography vs. Arabic OCR Rich Heritage Challenging A Much Needed Technology Mohamed Attia Having consistently been spoken since more than 2000 years and on, Arabic is doubtlessly the oldest among

More information

Firms and Markets Saturdays Summer I 2014

Firms and Markets Saturdays Summer I 2014 PRELIMINARY DRAFT VERSION. SUBJECT TO CHANGE. Firms and Markets Saturdays Summer I 2014 Professor Thomas Pugel Office: Room 11-53 KMC E-mail: tpugel@stern.nyu.edu Tel: 212-998-0918 Fax: 212-995-4212 This

More information

PH.D. IN COMPUTER SCIENCE PROGRAM (POST M.S.)

PH.D. IN COMPUTER SCIENCE PROGRAM (POST M.S.) PH.D. IN COMPUTER SCIENCE PROGRAM (POST M.S.) OVERVIEW ADMISSION REQUIREMENTS PROGRAM REQUIREMENTS OVERVIEW FOR THE PH.D. IN COMPUTER SCIENCE Overview The doctoral program is designed for those students

More information

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract

More information

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form Orthographic Form 1 Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form The development and testing of word-retrieval treatments for aphasia has generally focused

More information

A Comparison of Two Text Representations for Sentiment Analysis

A Comparison of Two Text Representations for Sentiment Analysis 010 International Conference on Computer Application and System Modeling (ICCASM 010) A Comparison of Two Text Representations for Sentiment Analysis Jianxiong Wang School of Computer Science & Educational

More information

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny By the End of Year 8 All Essential words lists 1-7 290 words Commonly Misspelt Words-55 working out more complex, irregular, and/or ambiguous words by using strategies such as inferring the unknown from

More information

What the National Curriculum requires in reading at Y5 and Y6

What the National Curriculum requires in reading at Y5 and Y6 What the National Curriculum requires in reading at Y5 and Y6 Word reading apply their growing knowledge of root words, prefixes and suffixes (morphology and etymology), as listed in Appendix 1 of the

More information

NCEO Technical Report 27

NCEO Technical Report 27 Home About Publications Special Topics Presentations State Policies Accommodations Bibliography Teleconferences Tools Related Sites Interpreting Trends in the Performance of Special Education Students

More information

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,

More information

HISTORY COURSE WORK GUIDE 1. LECTURES, TUTORIALS AND ASSESSMENT 2. GRADES/MARKS SCHEDULE

HISTORY COURSE WORK GUIDE 1. LECTURES, TUTORIALS AND ASSESSMENT 2. GRADES/MARKS SCHEDULE HISTORY COURSE WORK GUIDE 1. LECTURES, TUTORIALS AND ASSESSMENT Lectures and Tutorials Students studying History learn by reading, listening, thinking, discussing and writing. Undergraduate courses normally

More information

Generating Test Cases From Use Cases

Generating Test Cases From Use Cases 1 of 13 1/10/2007 10:41 AM Generating Test Cases From Use Cases by Jim Heumann Requirements Management Evangelist Rational Software pdf (155 K) In many organizations, software testing accounts for 30 to

More information

Learning and Retaining New Vocabularies: The Case of Monolingual and Bilingual Dictionaries

Learning and Retaining New Vocabularies: The Case of Monolingual and Bilingual Dictionaries Learning and Retaining New Vocabularies: The Case of Monolingual and Bilingual Dictionaries Mohsen Mobaraki Assistant Professor, University of Birjand, Iran mmobaraki@birjand.ac.ir *Amin Saed Lecturer,

More information

Florida Reading Endorsement Alignment Matrix Competency 1

Florida Reading Endorsement Alignment Matrix Competency 1 Florida Reading Endorsement Alignment Matrix Competency 1 Reading Endorsement Guiding Principle: Teachers will understand and teach reading as an ongoing strategic process resulting in students comprehending

More information

School Inspection in Hesse/Germany

School Inspection in Hesse/Germany Hessisches Kultusministerium School Inspection in Hesse/Germany Contents 1. Introduction...2 2. School inspection as a Procedure for Quality Assurance and Quality Enhancement...2 3. The Hessian framework

More information

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles) New York State Department of Civil Service Committed to Innovation, Quality, and Excellence A Guide to the Written Test for the Senior Stenographer / Senior Typist Series (including equivalent Secretary

More information

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5- New York Grade 7 Core Performance Indicators Grades 7 8: common to all four ELA standards Throughout grades 7 and 8, students demonstrate the following core performance indicators in the key ideas of reading,

More information

Axiom 2013 Team Description Paper

Axiom 2013 Team Description Paper Axiom 2013 Team Description Paper Mohammad Ghazanfari, S Omid Shirkhorshidi, Farbod Samsamipour, Hossein Rahmatizadeh Zagheli, Mohammad Mahdavi, Payam Mohajeri, S Abbas Alamolhoda Robotics Scientific Association

More information

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art

More information

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach BILINGUAL LEARNERS DICTIONARIES The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach Mark VAN MOL, Leuven, Belgium Abstract This paper reports on the

More information

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics Machine Learning from Garden Path Sentences: The Application of Computational Linguistics http://dx.doi.org/10.3991/ijet.v9i6.4109 J.L. Du 1, P.F. Yu 1 and M.L. Li 2 1 Guangdong University of Foreign Studies,

More information

Handbook for Graduate Students in TESL and Applied Linguistics Programs

Handbook for Graduate Students in TESL and Applied Linguistics Programs Handbook for Graduate Students in TESL and Applied Linguistics Programs Section A Section B Section C Section D M.A. in Teaching English as a Second Language (MA-TESL) Ph.D. in Applied Linguistics (PhD

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Tyler Perrachione LING 451-0 Proseminar in Sound Structure Prof. A. Bradlow 17 March 2006 Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Abstract Although the acoustic and

More information

Modeling full form lexica for Arabic

Modeling full form lexica for Arabic Modeling full form lexica for Arabic Susanne Alt Amine Akrout Atilf-CNRS Laurent Romary Loria-CNRS Objectives Presentation of the current standardization activity in the domain of lexical data modeling

More information

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database Journal of Computer and Communications, 2016, 4, 79-89 Published Online August 2016 in SciRes. http://www.scirp.org/journal/jcc http://dx.doi.org/10.4236/jcc.2016.410009 Performance Analysis of Optimized

More information

THE WEB 2.0 AS A PLATFORM FOR THE ACQUISITION OF SKILLS, IMPROVE ACADEMIC PERFORMANCE AND DESIGNER CAREER PROMOTION IN THE UNIVERSITY

THE WEB 2.0 AS A PLATFORM FOR THE ACQUISITION OF SKILLS, IMPROVE ACADEMIC PERFORMANCE AND DESIGNER CAREER PROMOTION IN THE UNIVERSITY THE WEB 2.0 AS A PLATFORM FOR THE ACQUISITION OF SKILLS, IMPROVE ACADEMIC PERFORMANCE AND DESIGNER CAREER PROMOTION IN THE UNIVERSITY F. Felip Miralles, S. Martín Martín, Mª L. García Martínez, J.L. Navarro

More information