English to Tamil Statistical Machine Translation and Alignment Using HMM
|
|
- Sarah Marshall
- 6 years ago
- Views:
Transcription
1 RECENT ADVANCES in NETWORING, VLSI and SIGNAL PROCESSING English to Tamil Statistical Machine Translation and Alignment Using HMM S.VETRIVEL, DIANA BABY Computer Science and Engineering arunya University arunya Nagar, Coimbatore, Tamilnadu INDIA Abstract: - This paper describes English to Tamil statistical machine translation and its alignment using Hidden Markov Model (HMM).Statistical machine translation is a part of natural language processing and is based on probability distribution. Machine translation is a sub-field of computational linguistics that uses computer software to translate text in one natural language to another language. Alignment is one of the major challenges in machine translation.hidden markov model (HMM) based alignment described in this paper is more accurate, avoids invalid alignments and improves translation quality. HMM uses bigram translation probabilities for keeping word context in the language model which produce close to error-free output that reads fluently in the target language. ey-words: - Bigram translation probability, Hidden Markov model, phrase alignment, Statistical Machine Translation, translation model, word alignment. 1 Introduction Language is the main form of human communication. Translation is essential for co-operation among communities that speaks different languages. Machine Translation refers to the use of computers to automate the task of translation between human languages. A human language system can be considered as a system of arbitrary symbols and meanings of these symbols are defined and adopted by the users of that language for the purpose of effective exchanging of information. The translation process converts a text in one human language to another which preserves not only the meaning, but also the form, effect and style. There are some countries in which more than one language is spoken but there is not enough human translators are available. So a scheme for automatic translation between two languages is very desirable for social and political interactions. This paper concerned with the analysis, design and building of a model for English to Tamil Statistical Machine Translation (SMT) system. One of the central modeling problems in statistical machine translation (SMT) is alignment between parallel texts. The duty of alignment methodology is to identify translation equivalence between sentences, words and phrases within sentences. This paper deals with the hidden Markov models (HMMs) which is used for automatic alignment of words and phrases in parallel text. Parameters of a statistical word alignment model are estimated from parallel text and the model is used for word alignment with in the same text used in estimation. Short sequences of words form phrase pairs, which align to each other are extracted from the word-aligned parallel text for use in translation. Phrase-based SMT performances are influenced by the quality and quantity of the word-aligned parallel text. HMMs are potentially an attractive alternative to other models used for word alignment and phrase alignment of parallel text [1, 2]. The paper proceeds as follows. In section two English to Tamil translation is explained. The HMM and alignment methodology is formally presented in section three. The final conclusion is described in section four. 2 English to Tamil Translation Translation requires extensive linguistic knowledge in both the source and target language. The linguistic knowledge of a language includes the knowledge of its phonology, morphology, syntax, semantics, pragmatics and discourse. Translation also requires a comparable knowledge of grammatical and various other correspondences between the source and target language. In addition to this, a basic knowledge in the subject matter of the sentence with general knowledge and common sense are all essential for a good translation. Finally, knowledge of the customs and culture of speakers of both languages helps translators to select the best among alternatives. The phases of translations and other specifications about Tamil language are explained in the following sections. ISSN: ISBN:
2 RECENT ADVANCES in NETWORING, VLSI and SIGNAL PROCESSING 2.1 Phases of Translation Implementation of statistical machine translation split into two main phases called training phase and translation phase. In training phase a statistical model of translation is built, using a corpus of texts in both the source and target language (English and Tamil). The training phase is split into three parts: (i) (ii) (iii) Document collection, from where the corpus of texts from which the statistical model will be inferred. Building the translation model from the source language to the target language. Building the language model for the target language. Input Sentence Translation Model translators etc. The better models of translational equivalence are built empirically. Computational linguists use machine learning techniques to induce them from bitexts that is pairs of texts that are translations of each other. Computers should be able to figure out which expressions are translationally equivalent [3]. Language Model: Language model plays an important role in statistical machine translation. It is the key knowledge source to determine the right word order of the translation. The nominal task of the language model is to guide the search (decoding) procedure towards grammatical output. Standard n-gram based language model predicts the next word based on the immediate left context. It work well, are easy to train, require no manual annotation and are well understood. Use of language model can improve the translation qualities [4]. The second phase is the translation phase, which uses a heuristic search procedure to find a good translation of a text. The idea of the heuristic search is to consider partial sentences and partial alignments and maintaining a stack of particularly promising candidates. It s this phase which is actually used directly by the end user the training phase all happens offline, beforehand. Bag of possible words Seek improvement by trying other combinations 2.2 Translation Architecture Three types of translation architectures are used in MT systems. They are Transformer (Direct), Transfer and Inter- Lingua architectures. The MT system considered here is based on transformer architecture [4]. Language Model Most Probable Translation Source Text in English English Parser: Uses Dictionary and small Grammar to Produce English Structure Fig.1 Flow of Implementation Document Collection: There are many possible ways to build the statistical model of translation. One way is to represent them as a file containing several million URLs, where each URL pointing to an English language text which is used to build a language model. The file also contains URLs pointing to pairs of translated texts in English and in Tamil, which is used to build the translation model [3, 4]. Translation Model: Translation models are used to describe the mathematical relationship between two or more languages. A good translation model is a key to many translingual applications like machine translation. Other applications include cross-language information retrieval, computer-assisted language learning, and various tools for Target Text in Tamil Fig.2 Translation Architecture English to Tamil Transformer: English to Tamil Transformation Rules 2.3 Basic Linguistic Specifications of Tamil Being an agglutinative language, Tamil words are the combination of several morphemes. A Tamil word consists ISSN: ISBN:
3 RECENT ADVANCES in NETWORING, VLSI and SIGNAL PROCESSING of a root combined with other grammatical accretions. The concept of case in languages refers to the phenomenon of expressing reciprocal relations of nouns by means of caseterminators such as post-positions or auxiliary words. In English these types of relations are accommodated using prepositions such as in, on, at, by, with etc. In Tamil, singular and plural forms of nouns have the same form of case-terminators. Tamil uses the crude root of the verb and Tamil verbs usually carry tenses. English verbs are pre-modified by auxiliaries to accommodate tense, aspect, voice and number of the sentence. The Tamil verb should, in addition to all the above functions, carry information about gender. All this information is represented in the Tamil verb by different grammatical formatives suffixed to it in a pre-defined order. The gender information of a verb may be derived from its terminator. The passive voice of a transitive verb in Tamil is formed by combining the verb with the auxiliary verb padu. In English, the negative concept is introduced by the use of conjugate not adverbially. There is no such word in Tamil, though the word illai (no) is used sometimes. The tense of the Tamil negative verb is indeterminate in point of time and is therefore determined by the context [6]. the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters. In a hidden Markov model, the state is not directly visible, but output dependent on the state is visible. Each state has a probability distribution over the possible output tokens. Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states [7]. 3.1 Architecture of Hidden Markov Model The diagram below shows the general architecture of an instantiated HMM. Each oval shape represents a random variable that can adopt any number of values. The random variable x (t) is the hidden state at time t (x (t) {x1, x2, x3}). The random variable y (t) is the observation at time t (y (t) {y1, y2, y3, y4}). The arrows in the diagram denote conditional dependencies. The conditional probability distribution of the hidden variable x (t) at time t, given the value of the hidden variable x (t 1), depends only on the value of the hidden variable x (t 1): the values at time t 2 and before have no influence. This is called the Markov property. Similarly, the value of the observed variable y (t) only depends on the value of the hidden variable x (t) (both at time t) [7]. 2.4 Word Combination Rules in Tamil Tamil word combination rules ensure the euphonic and natural composition of the adjacent words and inflectional and derivational processes. When combining two Tamil words (or affixes), the resultant word depends on the boundary syllables of components. There are three types of changes possible: insertion of a new letter, transmutation of letters and natural composition [6]. x(t-1) y(t-1) x(t) y(t) x(t+1) y(t+1) 2.5 Tamil Information Interchange Code ASCII is a standard representation widely used for information interchange within computers. ASCII is not sufficient to encode the letters of foreign alphabets such as Tamil. Therefore a Standard Code for Information Interchange in Tamil (SCIIT) is used. The SCIIT codes for the vowels a to au are assigned the codes 1 to 12. The consonants are assigned multiples of 20 in their order. The space is given the code zero. The code for a vowelconsonant is the addition of the codes of the corresponding vowel and consonant. The SCIIT code preserves Tamil alphabetical order [6]. 3 HMM Alignment HMM is used for word and phrase alignment of parallel text. A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unobserved state. In a regular Markov model, Fig. 3 HMM Architecture 3.2 Variables Used for Alignment The model variables contain e=e 1 I t=t 1 J source sentence of I words (English) target sentence of J words (Tamil) The target language word sequence is an intermediate sequence of target language phrases. The variable-length word sequences in the target language are called phrases. u 1 phrase count variable that is target language is segmented into phrases target sentence of N phrases. ISSN: ISBN:
4 RECENT ADVANCES in NETWORING, VLSI and SIGNAL PROCESSING k a 1 E ak h 1 if h k =0, If h k =1, phrase length of kth phrase alignment process aligns target phrases to source word. word in source sentence position a k hallucination sequence then u k is aligned to null. then uk is aligned to s ak. Word-to-Phrase Translation: The translation of words to phrases is given as P(u 1 a 1,h 1, 1,,J,e)=(u k E ak,,h k, k) (4) So that target phrases are conditionally independent given their alignment to individual source words. Specialized translation tables can be maintained for hallucinated phrases to allow their statistics to differ from phrases that arise from direct translation of specific source words. Word context within the target language phrase via bigram translation probabilities [5]. There hidden variables in HMM are: a= ( 1, a 1, h 1, ) (1) 3.3 Phrase Segmentation, Alignment and Translation Models The modeling objective is to define a conditional distribution P (t, a e) over the alignments of the source (English) and target sentence (Tamil).It can be calculated using the equation: 3.4 Viterbi Algorithm Viterbi algorithm is one of the algorithms in HMM which is used for alignment process. Given the parameters of the model and a particular output sequence, find the state sequence that is most likely to have generated that output sequence. This requires finding a maximum over all possible state sequences which can be solved efficiently by the viterbi algorithm [8]. q1 a12 q2 a23 q3 P(u 1,, a 1, h 1, 1 e) = P( J,e) P(a 1, 1, h 1 a21,j,e) P(u 1 a 1, h 1, 1,,J,e) (2) b1 b2 b3 Phrase Count Distribution: P ( J, e) specifies the distribution over the number of phrases in the target sentence given the source sentence and the number of words in the target sentence. Single parameter distribution P ( J, e) = P ( J, I) α η 1 controls the segmentation of the target sentence into phrases. Larger values of η favor target sentence segmentations with many short phrases. Word-to-Phrase Alignment Distribution: The alignment is modeled as a Markov process that specifies the lengths of phrases and the alignment of each to one of the source word positions. P (a 1, 1,h 1, J, e) = k, k,h k a k-1, k-1,h k-1,, J, e) (3) The word-to-phrase alignment (a k ) is a Markov process over the source sentence word indices, as in word-to-word HMM alignment. It is formulated with a dependency on the hallucination variable so that target phrases can be inserted without disrupting the Markov dependencies of phrases aligned to non- NULL source words. o1 o2 o3 o=observation output,q = state probability b = output probability, a= transition probability Fig.4 Working of Algorithm The steps of viterbi algorithm are as follows: Step1: Initialization Assume initial probability and emission probability and then calculate the probability of first state using this assumption. Step2: Induction Calculate the probabilities of other states except start state. Step 3: Backtracking Find the most likely path that produces highest probability. ISSN: ISBN:
5 RECENT ADVANCES in NETWORING, VLSI and SIGNAL PROCESSING 4 Conclusion This paper deals with English to Tamil statistical machine translation and its alignment methodology based on computationally practical hidden Markov models. Statistical machine translation is based on probability and it produces more accurate result than other types and statistical alignment models improves the translation quality.hmm based statistical alignment model is more powerful. References: [1] F. Och, C. Tillman, and H. Ney, Improved alignment models for statistical machine translation, in Proc. Joint Conf. Empirical Methods Natural Lang. Process. Very Large Corpora, College Park, MD, pp , [2] P. oehn,f. Och, and D. Marcu, Statistical phrasebased translation, in Proc. HLT-NAACL, pp , [3] Hongfei Jiang, Muyun Yang, Tiejun Zhao, Sheng Li and Bo Wang A Statistical Machine Translation Model Based on a Synthetic Synchronous Grammar, Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages ,Suntec, Singapore, 4 August [4]Matt Post and Daniel Gildea Parsers as language models for statistical machine translation, Department of Computer Science University of Rochester, [5] Y. Deng and W. Byrne, MTT: An alignment toolkit for statistical machine translation, presented at the HLT- NAACL Demonstrations Program, Jun [6] R.Ravi and S.ailasam Computer Vision of Single to Multi-Language Translation using Statistical Machine Translation, TIFAC-CORE, alasalingam University, Tamilnadu. [7] M. Ostendorf,V. Digalakis, and O.imball, From HMMs to segment models: A unified view of stochastic modeling for speech recognition,ieee Trans. Acoustics, Speech, Signal Process., vol. ASSP-4, no. 5, pp , Sep [8] G.D. Brushe, Robert.E. Mahony and John. B. Moore, A forward backward algorithm for ml state and sequence estimation, International Symposium on Signal Processing and its applications, ISSPA, Gold Coast, Australia, August, ISSN: ISBN:
Parsing of part-of-speech tagged Assamese Texts
IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal
More informationLanguage Model and Grammar Extraction Variation in Machine Translation
Language Model and Grammar Extraction Variation in Machine Translation Vladimir Eidelman, Chris Dyer, and Philip Resnik UMIACS Laboratory for Computational Linguistics and Information Processing Department
More informationNCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches
NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches Yu-Chun Wang Chun-Kai Wu Richard Tzong-Han Tsai Department of Computer Science
More informationBooks Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny
By the End of Year 8 All Essential words lists 1-7 290 words Commonly Misspelt Words-55 working out more complex, irregular, and/or ambiguous words by using strategies such as inferring the unknown from
More informationDerivational and Inflectional Morphemes in Pak-Pak Language
Derivational and Inflectional Morphemes in Pak-Pak Language Agustina Situmorang and Tima Mariany Arifin ABSTRACT The objectives of this study are to find out the derivational and inflectional morphemes
More informationTarget Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data
Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se
More informationWhat the National Curriculum requires in reading at Y5 and Y6
What the National Curriculum requires in reading at Y5 and Y6 Word reading apply their growing knowledge of root words, prefixes and suffixes (morphology and etymology), as listed in Appendix 1 of the
More informationNoisy SMS Machine Translation in Low-Density Languages
Noisy SMS Machine Translation in Low-Density Languages Vladimir Eidelman, Kristy Hollingshead, and Philip Resnik UMIACS Laboratory for Computational Linguistics and Information Processing Department of
More informationCS 598 Natural Language Processing
CS 598 Natural Language Processing Natural language is everywhere Natural language is everywhere Natural language is everywhere Natural language is everywhere!"#$%&'&()*+,-./012 34*5665756638/9:;< =>?@ABCDEFGHIJ5KL@
More informationUsing dialogue context to improve parsing performance in dialogue systems
Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,
More informationLearning Methods in Multilingual Speech Recognition
Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex
More informationLanguage Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus
Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,
More information1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature
1 st Grade Curriculum Map Common Core Standards Language Arts 2013 2014 1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature Key Ideas and Details
More informationBAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass
BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION Han Shu, I. Lee Hetherington, and James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge,
More informationELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading
ELA/ELD Correlation Matrix for ELD Materials Grade 1 Reading The English Language Arts (ELA) required for the one hour of English-Language Development (ELD) Materials are listed in Appendix 9-A, Matrix
More informationConstructing Parallel Corpus from Movie Subtitles
Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing
More informationThe MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation
The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation AUTHORS AND AFFILIATIONS MSR: Xiaodong He, Jianfeng Gao, Chris Quirk, Patrick Nguyen, Arul Menezes, Robert Moore, Kristina Toutanova,
More informationEdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,
More informationEli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology
ISCA Archive SUBJECTIVE EVALUATION FOR HMM-BASED SPEECH-TO-LIP MOVEMENT SYNTHESIS Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano Graduate School of Information Science, Nara Institute of Science & Technology
More informationFirst Grade Curriculum Highlights: In alignment with the Common Core Standards
First Grade Curriculum Highlights: In alignment with the Common Core Standards ENGLISH LANGUAGE ARTS Foundational Skills Print Concepts Demonstrate understanding of the organization and basic features
More informationProgram Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading
Program Requirements Competency 1: Foundations of Instruction 60 In-service Hours Teachers will develop substantive understanding of six components of reading as a process: comprehension, oral language,
More informationFlorida Reading Endorsement Alignment Matrix Competency 1
Florida Reading Endorsement Alignment Matrix Competency 1 Reading Endorsement Guiding Principle: Teachers will understand and teach reading as an ongoing strategic process resulting in students comprehending
More informationA Neural Network GUI Tested on Text-To-Phoneme Mapping
A Neural Network GUI Tested on Text-To-Phoneme Mapping MAARTEN TROMPPER Universiteit Utrecht m.f.a.trompper@students.uu.nl Abstract Text-to-phoneme (T2P) mapping is a necessary step in any speech synthesis
More informationModule 12. Machine Learning. Version 2 CSE IIT, Kharagpur
Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should
More informationDeveloping Grammar in Context
Developing Grammar in Context intermediate with answers Mark Nettle and Diana Hopkins PUBLISHED BY THE PRESS SYNDICATE OF THE UNIVERSITY OF CAMBRIDGE The Pitt Building, Trumpington Street, Cambridge, United
More information2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases
POS Tagging Problem Part-of-Speech Tagging L545 Spring 203 Given a sentence W Wn and a tagset of lexical categories, find the most likely tag T..Tn for each word in the sentence Example Secretariat/P is/vbz
More informationOpportunities for Writing Title Key Stage 1 Key Stage 2 Narrative
English Teaching Cycle The English curriculum at Wardley CE Primary is based upon the National Curriculum. Our English is taught through a text based curriculum as we believe this is the best way to develop
More informationMachine Learning from Garden Path Sentences: The Application of Computational Linguistics
Machine Learning from Garden Path Sentences: The Application of Computational Linguistics http://dx.doi.org/10.3991/ijet.v9i6.4109 J.L. Du 1, P.F. Yu 1 and M.L. Li 2 1 Guangdong University of Foreign Studies,
More informationThe NICT Translation System for IWSLT 2012
The NICT Translation System for IWSLT 2012 Andrew Finch Ohnmar Htun Eiichiro Sumita Multilingual Translation Group MASTAR Project National Institute of Information and Communications Technology Kyoto,
More informationHoughton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)
Houghton Mifflin Reading Correlation to the Standards for English Language Arts (Grade1) 8.3 JOHNNY APPLESEED Biography TARGET SKILLS: 8.3 Johnny Appleseed Phonemic Awareness Phonics Comprehension Vocabulary
More informationScienceDirect. Malayalam question answering system
Available online at www.sciencedirect.com ScienceDirect Procedia Technology 24 (2016 ) 1388 1392 International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST - 2015) Malayalam
More informationThe Karlsruhe Institute of Technology Translation Systems for the WMT 2011
The Karlsruhe Institute of Technology Translation Systems for the WMT 2011 Teresa Herrmann, Mohammed Mediani, Jan Niehues and Alex Waibel Karlsruhe Institute of Technology Karlsruhe, Germany firstname.lastname@kit.edu
More informationProblems of the Arabic OCR: New Attitudes
Problems of the Arabic OCR: New Attitudes Prof. O.Redkin, Dr. O.Bernikova Department of Asian and African Studies, St. Petersburg State University, St Petersburg, Russia Abstract - This paper reviews existing
More informationEnhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion
More informationA Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many
Schmidt 1 Eric Schmidt Prof. Suzanne Flynn Linguistic Study of Bilingualism December 13, 2013 A Minimalist Approach to Code-Switching In the field of linguistics, the topic of bilingualism is a broad one.
More informationTaught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,
First Grade Standards These are the standards for what is taught in first grade. It is the expectation that these skills will be reinforced after they have been taught. Taught Throughout the Year Foundational
More informationLongest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 6, Ver. IV (Nov Dec. 2015), PP 01-07 www.iosrjournals.org Longest Common Subsequence: A Method for
More informationBANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS
Daffodil International University Institutional Repository DIU Journal of Science and Technology Volume 8, Issue 1, January 2013 2013-01 BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS Uddin, Sk.
More informationEvaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment
Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment Akiko Sakamoto, Kazuhiko Abe, Kazuo Sumita and Satoshi Kamatani Knowledge Media Laboratory,
More informationA heuristic framework for pivot-based bilingual dictionary induction
2013 International Conference on Culture and Computing A heuristic framework for pivot-based bilingual dictionary induction Mairidan Wushouer, Toru Ishida, Donghui Lin Department of Social Informatics,
More informationTraining and evaluation of POS taggers on the French MULTITAG corpus
Training and evaluation of POS taggers on the French MULTITAG corpus A. Allauzen, H. Bonneau-Maynard LIMSI/CNRS; Univ Paris-Sud, Orsay, F-91405 {allauzen,maynard}@limsi.fr Abstract The explicit introduction
More informationCorrespondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy
1 Desired Results Developmental Profile (2015) [DRDP (2015)] Correspondence to California Foundations: Language and Development (LLD) and the Foundations (PLF) The Language and Development (LLD) domain
More informationDetecting English-French Cognates Using Orthographic Edit Distance
Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National
More informationLanguage Independent Passage Retrieval for Question Answering
Language Independent Passage Retrieval for Question Answering José Manuel Gómez-Soriano 1, Manuel Montes-y-Gómez 2, Emilio Sanchis-Arnal 1, Luis Villaseñor-Pineda 2, Paolo Rosso 1 1 Polytechnic University
More informationDeveloping a TT-MCTAG for German with an RCG-based Parser
Developing a TT-MCTAG for German with an RCG-based Parser Laura Kallmeyer, Timm Lichte, Wolfgang Maier, Yannick Parmentier, Johannes Dellert University of Tübingen, Germany CNRS-LORIA, France LREC 2008,
More informationESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly
ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly Inflected Languages Classical Approaches to Tagging The slides are posted on the web. The url is http://chss.montclair.edu/~feldmana/esslli10/.
More informationThe Acquisition of Person and Number Morphology Within the Verbal Domain in Early Greek
Vol. 4 (2012) 15-25 University of Reading ISSN 2040-3461 LANGUAGE STUDIES WORKING PAPERS Editors: C. Ciarlo and D.S. Giannoni The Acquisition of Person and Number Morphology Within the Verbal Domain in
More informationMemory-based grammatical error correction
Memory-based grammatical error correction Antal van den Bosch Peter Berck Radboud University Nijmegen Tilburg University P.O. Box 9103 P.O. Box 90153 NL-6500 HD Nijmegen, The Netherlands NL-5000 LE Tilburg,
More informationLinguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis
International Journal of Arts Humanities and Social Sciences (IJAHSS) Volume 1 Issue 1 ǁ August 216. www.ijahss.com Linguistic Variation across Sports Category of Press Reportage from British Newspapers:
More informationWeb as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics
(L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes
More informationCLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction
CLASSIFICATION OF PROGRAM Critical Elements Analysis 1 Program Name: Macmillan/McGraw Hill Reading 2003 Date of Publication: 2003 Publisher: Macmillan/McGraw Hill Reviewer Code: 1. X The program meets
More informationAQUA: An Ontology-Driven Question Answering System
AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.
More informationLecture 10: Reinforcement Learning
Lecture 1: Reinforcement Learning Cognitive Systems II - Machine Learning SS 25 Part III: Learning Programs and Strategies Q Learning, Dynamic Programming Lecture 1: Reinforcement Learning p. Motivation
More informationPREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES
PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES Po-Sen Huang, Kshitiz Kumar, Chaojun Liu, Yifan Gong, Li Deng Department of Electrical and Computer Engineering,
More informationMULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY
MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract
More informationLinking Task: Identifying authors and book titles in verbose queries
Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,
More informationENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist
Meeting 2 Chapter 7 (Morphology) and chapter 9 (Syntax) Today s agenda Repetition of meeting 1 Mini-lecture on morphology Seminar on chapter 7, worksheet Mini-lecture on syntax Seminar on chapter 9, worksheet
More informationChinese Language Parsing with Maximum-Entropy-Inspired Parser
Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art
More informationModeling function word errors in DNN-HMM based LVCSR systems
Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford
More informationTHE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING
SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,
More informationSpecification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments
Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,
More informationLanguage Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin
Stromswold & Rifkin, Language Acquisition by MZ & DZ SLI Twins (SRCLD, 1996) 1 Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin Dept. of Psychology & Ctr. for
More informationAUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION
JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 11/2007, ISSN 1642-6037 Marek WIŚNIEWSKI *, Wiesława KUNISZYK-JÓŹKOWIAK *, Elżbieta SMOŁKA *, Waldemar SUSZYŃSKI * HMM, recognition, speech, disorders
More informationAdvanced Grammar in Use
Advanced Grammar in Use A self-study reference and practice book for advanced learners of English Third Edition with answers and CD-ROM cambridge university press cambridge, new york, melbourne, madrid,
More informationSpeech Recognition at ICSI: Broadcast News and beyond
Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI
More informationCoast Academies Writing Framework Step 4. 1 of 7
1 KPI Spell further homophones. 2 3 Objective Spell words that are often misspelt (English Appendix 1) KPI Place the possessive apostrophe accurately in words with regular plurals: e.g. girls, boys and
More informationCS Machine Learning
CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing
More informationParallel Evaluation in Stratal OT * Adam Baker University of Arizona
Parallel Evaluation in Stratal OT * Adam Baker University of Arizona tabaker@u.arizona.edu 1.0. Introduction The model of Stratal OT presented by Kiparsky (forthcoming), has not and will not prove uncontroversial
More informationUnvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition
Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Hua Zhang, Yun Tang, Wenju Liu and Bo Xu National Laboratory of Pattern Recognition Institute of Automation, Chinese
More informationhave to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,
A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994
More informationDEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS
DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS Julia Tmshkina Centre for Text Techitology, North-West University, 253 Potchefstroom, South Africa 2025770@puk.ac.za
More informationExploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data
Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data Maja Popović and Hermann Ney Lehrstuhl für Informatik VI, Computer
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview
More informationSoftware Maintenance
1 What is Software Maintenance? Software Maintenance is a very broad activity that includes error corrections, enhancements of capabilities, deletion of obsolete capabilities, and optimization. 2 Categories
More informationCross Language Information Retrieval
Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................
More informationRole of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation
Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation Vivek Kumar Rangarajan Sridhar, John Chen, Srinivas Bangalore, Alistair Conkie AT&T abs - Research 180 Park Avenue, Florham Park,
More informationProject in the framework of the AIM-WEST project Annotation of MWEs for translation
Project in the framework of the AIM-WEST project Annotation of MWEs for translation 1 Agnès Tutin LIDILEM/LIG Université Grenoble Alpes 30 october 2014 Outline 2 Why annotate MWEs in corpora? A first experiment
More informationProcedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing
Available online at www.sciencedirect.com ScienceDirect Procedia - Social and Behavioral Sciences 141 ( 2014 ) 124 128 WCLTA 2013 Using Corpus Linguistics in the Development of Writing Blanka Frydrychova
More informationSTUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH
STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH Don McAllaster, Larry Gillick, Francesco Scattone, Mike Newman Dragon Systems, Inc. 320 Nevada Street Newton, MA 02160
More informationGrammars & Parsing, Part 1:
Grammars & Parsing, Part 1: Rules, representations, and transformations- oh my! Sentence VP The teacher Verb gave the lecture 2015-02-12 CS 562/662: Natural Language Processing Game plan for today: Review
More informationConsiderations for Aligning Early Grades Curriculum with the Common Core
Considerations for Aligning Early Grades Curriculum with the Common Core Diane Schilder, EdD and Melissa Dahlin, MA May 2013 INFORMATION REQUEST This state s department of education requested assistance
More informationUniversiteit Leiden ICT in Business
Universiteit Leiden ICT in Business Ranking of Multi-Word Terms Name: Ricardo R.M. Blikman Student-no: s1184164 Internal report number: 2012-11 Date: 07/03/2013 1st supervisor: Prof. Dr. J.N. Kok 2nd supervisor:
More informationHuman Emotion Recognition From Speech
RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati
More informationAutoregressive product of multi-frame predictions can improve the accuracy of hybrid models
Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Navdeep Jaitly 1, Vincent Vanhoucke 2, Geoffrey Hinton 1,2 1 University of Toronto 2 Google Inc. ndjaitly@cs.toronto.edu,
More informationGenerating Test Cases From Use Cases
1 of 13 1/10/2007 10:41 AM Generating Test Cases From Use Cases by Jim Heumann Requirements Management Evangelist Rational Software pdf (155 K) In many organizations, software testing accounts for 30 to
More informationRANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S
N S ER E P S I M TA S UN A I S I T VER RANKING AND UNRANKING LEFT SZILARD LANGUAGES Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A-1997-2 UNIVERSITY OF TAMPERE DEPARTMENT OF
More informationSemi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.
Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link
More informationPerformance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database
Journal of Computer and Communications, 2016, 4, 79-89 Published Online August 2016 in SciRes. http://www.scirp.org/journal/jcc http://dx.doi.org/10.4236/jcc.2016.410009 Performance Analysis of Optimized
More informationLING 329 : MORPHOLOGY
LING 329 : MORPHOLOGY TTh 10:30 11:50 AM, Physics 121 Course Syllabus Spring 2013 Matt Pearson Office: Vollum 313 Email: pearsonm@reed.edu Phone: 7618 (off campus: 503-517-7618) Office hrs: Mon 1:30 2:30,
More informationSINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)
SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,
More informationISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM
Proceedings of 28 ISFA 28 International Symposium on Flexible Automation Atlanta, GA, USA June 23-26, 28 ISFA28U_12 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM Amit Gil, Helman Stern, Yael Edan, and
More informationExtracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models
Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models Richard Johansson and Alessandro Moschitti DISI, University of Trento Via Sommarive 14, 38123 Trento (TN),
More informationPrediction of Maximal Projection for Semantic Role Labeling
Prediction of Maximal Projection for Semantic Role Labeling Weiwei Sun, Zhifang Sui Institute of Computational Linguistics Peking University Beijing, 100871, China {ws, szf}@pku.edu.cn Haifeng Wang Toshiba
More informationSome Principles of Automated Natural Language Information Extraction
Some Principles of Automated Natural Language Information Extraction Gregers Koch Department of Computer Science, Copenhagen University DIKU, Universitetsparken 1, DK-2100 Copenhagen, Denmark Abstract
More informationApplications of memory-based natural language processing
Applications of memory-based natural language processing Antal van den Bosch and Roser Morante ILK Research Group Tilburg University Prague, June 24, 2007 Current ILK members Principal investigator: Antal
More informationSpeech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines
Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Amit Juneja and Carol Espy-Wilson Department of Electrical and Computer Engineering University of Maryland,
More informationCalifornia Department of Education English Language Development Standards for Grade 8
Section 1: Goal, Critical Principles, and Overview Goal: English learners read, analyze, interpret, and create a variety of literary and informational text types. They develop an understanding of how language
More informationUsing SAM Central With iread
Using SAM Central With iread January 1, 2016 For use with iread version 1.2 or later, SAM Central, and Student Achievement Manager version 2.4 or later PDF0868 (PDF) Houghton Mifflin Harcourt Publishing
More information1. Introduction. 2. The OMBI database editor
OMBI bilingual lexical resources: Arabic-Dutch / Dutch-Arabic Carole Tiberius, Anna Aalstein, Instituut voor Nederlandse Lexicologie Jan Hoogland, Nederlands Instituut in Marokko (NIMAR) In this paper
More informationEvolution of Symbolisation in Chimpanzees and Neural Nets
Evolution of Symbolisation in Chimpanzees and Neural Nets Angelo Cangelosi Centre for Neural and Adaptive Systems University of Plymouth (UK) a.cangelosi@plymouth.ac.uk Introduction Animal communication
More information