Automatic Machine Translation in Broadcast News Domain
|
|
- Jody Thornton
- 6 years ago
- Views:
Transcription
1 Automatic Machine Translation in Broadcast News Domain Alexandre Gusmão L 2 F/INESC-ID Lisboa Rua Alves Redol, 9, Lisboa, Portugal {ajag}@l2f.inesc-id.pt Abstract. This paper describes the automatic translation system, from Portuguese into English, in broadcast news domain, developded in L 2 F, Laboratório de Língua Falada, in INESC-ID. It presents a brief introduction about the stat of the art in automatic translation and describes the tools used during the construction of this system and all the experiences done. In the end of this paper a table resumes all this experiences, as well as the evolution of BLEU obtained values. 1 Introduction The purpose of this paper is the development of an automatic translation system from Portuguese into English, from speech to text, for broadcast news. There are various inherent advantages to the building of such a system, the main one of which is, for example, to allow better comprehension of the news on the part of foreign listeners, as well as to help people with some kind of hearing problem. In the approach to the building of a machine translation system, several approaches were studied, such as IBM models, syntax-based models and phrasebased models. The systems based on statistical methods will be predominantly discussed in this paper. In order for the statistical methods to be used, a significant quantity of parallel texts is needed regarding the certain subject domain. The lack of parallel texts in the field of broadcast news will be one of the major problems to be addressed during the realization of this paper. In this dissertation several experiments will be demonstrated, which were carried out towards the building of a translation system from Portuguese into English in the field of broadcast news, the problems faced and the adopted solutions, as well as the results obtained from every experiment.
2 2 State of Art Various approaches to machine translation were made, where initially those with better results were the rule-based ones. However, this approach will not be considered in detail in this dissertation. On the other hand, through the work of Brown, the interest in the study of statistical methods was revived, and those deserve a more detailed study throughout this paper. Concerning the rule-based translation systems[1], plenty of linguistic information is needed and it is very difficult to write rules which cover all the language. These systems can be classified as direct systems, of transfer and interlingua. The figure 1 illustrates these systems. As far as the statistical-based transla- Fig. 1. Vauquois Triangle tion systems[2] are concerned, instead of using rigorous linguistic rules, they use probability distributions. These systems offer some advantages, such as the use of probabilities is easier, depending on the task to be performed, these can be added and multiplied, also the existence of algorithms which learn automatically to estimate the value of the probability, without human intervention, and contrary to the non-statistical approach, these systems do not require the manual development of linguistic rules.
3 However, these systems offer some disadvantages as well, such as difficult adaptation of the same system to different subject domains and the fact that these systems do not take into account the syntactic information of the phrases. In the statistical approach the distribution probability P(F E) of all the possible pairs (F, E) is considered and the translated phrase and the one with the highest probability is selected, that is E = argmax E P(F E).P(E) (1) where P(F E) stands for a translation model and P(E) stands for a fluency model. The translation models attribute a probability to each alignment between the input phrase and the output phrase. An alignment[3] is no more than a set of connections between the source phrase and the target phrase, where each word from the target phrase is connected only to one word from the source phrase. The IBM Models[4] are some of the models used for that purpose. There are 5 IBM Models, model 1 considers all possible connections between the source phrase and the target phrase and in model 2 the word order in the phrases now influences the probability value. Models 3, 4 and 5 consider aspects such as word fertility (number of words from the target phrase connected to a word from the source phrase), Identity (translation) of words in the target language and finally the position occupied by each word in the target phrase. As to the phrase-based systems[5] (word sequences), the phrase translation includes operations such as phrase segmentation, translation of each of the phrases in the target language and its reordering, in a way to form phrases from that language. These phrase-based systems are trained based on parallel aligned texts, employing word-based translation models to align each phrase pair in the corpus, in terms of words. Dew to the extreme difficulty of the search task, it is necessary to use efficient algorithms, such as A* or Dynamic Programming Beam. Dew to the good results which these models have achieved, it was decided to build a phrase-based translation system. As to the syntax-based systems[6], those aim at the introduction of structural aspects of the language and that is why they employ some operations, such as word reorder, introduction of more words and finally their translation. The Figure 2 illustrates an example of those operations applications. The evaluation of the translation system can be done by human intervention or automatic metric, and the latter has been chosen to determine the quality of the system developed in this dissertation. Some of the most famous automatic evaluation metrics are WER (Word Error Rate)[7], PER (Position Independent Word Error Rate)[7], NIST (National Institute of Standards and Technology)[8] and BLEU (Bilingual Evaluation Understudy)[9]. This last metric will be mostly used to evaluate the quality of the built translation system. It is a measure which
4 Fig.2. Reordering, introduction and translation operations compares the number of words shared between the phrases candidates and the reference phrases and is also based on n-grams matching, that is, instead of checking whether each word from the translated phrase is found in the source phrase, it is checked whether a sequence of words (up to 4 words) is found in those same phrases. As to the speech translation, a system of that kind is built by two sequent systems, one for speech recognition into text and another for the translation of the text in the desired language. In the Laboratório de Língua Falada (L 2 F laboratory) there are some speech recognition systems, namely Audimus, a system for the recognition of Portuguese language, also for broadcast news. Among the several existing speech translation systems, to be highlighted is a system developed by the European project TC-STAR ( as being most similar to the system developed in this dissertation, due to the fact that it includes extensive vocabulary(more than words) and also for being a project whose subject domain (European Parliament sessions) approximates the broadcast news. For the development of a translation system for the broadcast news, advantage was taken from the speech recognition system Audimus already referred and a phrase-based translation system was built, best possibly adapted to the subject domains.
5 3 The Translation System In this chapter the main stages associated with the translation system used in this paper will be described, namely the creation of language model, standardization of training corpus, training of the system, phrasetable filter, tuning and evaluation of the system. In regard to the language model, it is frequently used in speech recognition and translation, where this model tries to predict the following word in a sequence of words. When a phrase is inputted in the translator, several possible translations of that same phrase are formed, and the job of the language model is to attribute a certain weight to each one of these translations. As to the training corpus standardization, it is necessary to perform this task in a way that the corpus vocabulary used in the system training is compatible with the vocabulary coming from the speech recognizer. Typical standardization tasks are transformation of abbreviations to their full name, as well as roman numbers, decimal numbers, dates, currency symbols, etc. Following the building of the language model and the standardization of the training corpus is the training of the system through, for example, the Moses tool. Alignments are obtained between the words of the two languages and probabilities are attributed to each one of these alignments. All these alignments are kept in a phrase-table. Since the vocabulary to be used for the training of the system will be based on the European Parliament sessions, it is necessary to filter the phrase-table so that it will contain only words belonging to the broadcast news domain. In that way the system will be quicker in terms of calculation and will be based on the broadcast news area. In a later stage a system tuning is carried out, in other words, different values are combined for the features used by the Moses tool, throughout the successive translation iterations until the BLEU value obtained starts to converge to a final value. This is one of the stages of the translation system building process which takes more time and also one of the most important. In order to perform this task it is essential to use a development corpus based exclusively on the subject domain of the task in question, in this case the broadcast news area. At last, after the translation system is built, it is necessary to evaluate it. For this automated evaluation metrics are used, as for example the metric BLEU. The higher its value, the better the quality of the translations done by the system. To build the translation system some specific tools were used. Some of those tools will be further on discussed as well as their purpose.
6 The SRILM[10] is a tool used for creating and applying statistical language models, mainly to be used in the speech recognition. In this way, it permits the training of the statistical language model, as well as its testing. SRILM is a tool comprised of a set of libraries C and a set of scripts which facilitate the execution of this tool s tasks. To train the system, the tool Moses[11] was used. Moses is a statistical automated translation system which allows the training of the translation models for whatever language pair in an automated way. It has a set of components which allows the training of language models, training of translation models, system tuning and evaluation of translation phrases. One of the main components which comprise the Moses is the GIZA++[12]. This tool is used for the training of the translation model and achievement of alignments referred to previously. Concerning the corpora to be used in the building of a translation system, this was the most problematic aspect. If the translator was based on the European Parliament sessions, there are no major problems, since these sessions are translated into various languages and therefore more than sufficient quantity of parallel texts exists for the training of a translation system in this context. However, the planned translation system will be based on broadcast news and after conducting a research on the corpora of that field, it was concluded that various words exist which are not translated in the corpora of the European Parliament. In that way, the lack of corpora in the domain of broadcast news to train the system will be one of the main problems to face. The adopted solution was to train the system with the European Parliament corpus while the sets of development and testing would be based on the broadcast news context. The building of the corpora was done with resort to the euronews website ( where a total of 914 phrases was obtained of which 457 are written in Portuguese and the other half corresponds to the translation of these phrases, in English. 4 Text translation and Evaluation This chapter will deal with various experiments conducted for the creation of a translator from text to text for the language pair Portuguese English in the field of broadcast news. The translator which was first developed had as bases the sessions of the European Parliament, and this system obtained a BLEU rate of 0,3531 (vocabulary with tokens and in small letters Condition A) and 0,3445 in the conditions of the Europart system (Condition B). In this system the phrase-table was filtered in order to contain only words from the broadcast news domain. Dew to this filtering and also to the fact that the testing corpus used to evaluate the system was the one from the European Parliament area, the BLEU value obtained was
7 of 0,2699 in the condition A and 0,2643 in the condition B. The next step was the definition of a baseline system. At first a standardization of the training corpus was done and the phrase-table which was formed during the model training was filtered so that this could contain only phrases whose words were found in the word list used in the development corpus. In regard to the training corpus of the system, this continued to belong to the European Parliament domain. As to the language model, the development and testing set, corpora based on the broadcast news domain was used. At the end, the baseline system obtained a BLEU value of 0,1705 in condition A and 0,1650 in condition B. 4.1 Experiment 1 In the first adaptation in relation to the baseline system, it was decided to change the development and testing corpus, since most of the times, some of the phrases were comparable but were not direct translations. Only few adjustments were made in relation to the phrases in Portuguese and in the end, the same system but with these changed corpora, obtained a BLEU value of 0,4776 in condition A and 0,4722 in condition B. 4.2 Experiment 2 In this experiment it was opted for a language model through the interpolation technique. Two corpora from different domains were used, one of the broadcast news and the other of newspaper. The metric perplexity was used for the evaluation of the created language model. Relatively to the language model of the broadcast news, a model with words was built with a rate of perplexity of 154,453. As to the language model of the newspapers, a model of words was built with a rate of perplexity of 132,607. After the interpolation of both models, a final language model was obtained with words and a rate of perplexity of 112, Experiment 3 In this experiment, the interpolated language model was used in the translation system already developed up to the moment, which obtained a BLEU value of 0,4861 in condition A and 0,4799 in condition B. 4.4 Experiment 4 In all experiments it is confirmed that when recase of the phrases obtained by the decoder was conducted, the BLEU value decreased slightly. In that way, it was decided to refine the way in which this tool was being trained, joining two corpora of different domains, one connected to the broadcast news and the other
8 to the newspapers. Since the texts in the newspapers can only approximate the broadcast news transcriptions, not making part of their area, the BLEU value did not increase, on the contrary, it decreased, obtaining in that way a value of One more experiment was then made in order to try to find any solution which will present positive results. The translation system was then trained with training corpus, in which not all words in English were in small letters, and in that way the system learns to capitalize the supposedly correct words. In the end, the BLEU value was not satisfactory, having obtained a value of 0, Experiment 5 As a last experiment to try to obtain a better BLEU value, some post-translation processing was used. In that way, a reevaluation is made of a set of 1000 hypothesis of each phrase, formed by the translation system using some new features. The new features used are the following: Difference in the number of words between the phrase in Portuguese and the phrase hypothesis in English; POS (Part of Speech) Usage of rules for correspondence between the language pair Portuguese and English and of penalization patterns in English. In regard to the feature difference in the number of words, several experiments were made using the combination between the number of words of the phrase in Portuguese, number of words of the phrase in English and the difference in number of words. All the possible combinations presented satisfactory BLEU values, but it was the difference in the number of words that was mostly highlighted, with the system obtaining a BLEU value of 0,5055. As to the POS feature, two concepts are presented. The first is related to the calculation of similarities between the POS tags in both languages. The determined tags between both languages are counted and a score is attributed to each phrase, according to the number of equivalences found between them. The other concept is related to the calculation of penalization patterns in which whenever the system comes upon a pattern classified as penalization pattern in the phrase in English, a penalization is attributed to the phrase in question. With the POS feature contribution, the translation system obtained a BLEU value of 0,4967. In the end, with the usage of the features POS and the difference in number of words between the phrase in Portuguese and English, the system obtained a final BLEU value of 0, Point of Comparison In a way to compare the created translation system with another, it was decided to translate all phrases of the testing corpus through the translation system
9 provided by the search engine google. After this task was performed, the BLEU value registered by this search engine was 0,4102, which is far below the value obtained by the system created in this paper. 5 Conclusion and Future Work In order to improve the translation quality in future work, some approaches can be developed in relation to OOVWs (out of vocabulary words), as for example a dictionary can be used where all words and respective translations are inserted. In this case, about words in Portuguese and in English do not exist in the corpus of the European Parliament, making it unthinkable to transcribe all these words and respective translations. Another solution is to use the website ( where it is possible to obtain all verbal terms of the verbs contained in the training corpus. Yet another alternative is to copy some words which are simultaneously in the training corpus in Portuguese and English, that is, words which do not have translation such as proper names. Of the conclusions drawn from this paper, the following are to point out: The use of: Training corpus belonging to the area equivalent to the desired aim (in this case it was not possible to use corpus inherent to the broadcast news); Language model interpolated with another from a similar context ( in that case the model used was a model belonging to the newspaper texts domain); Clean development corpus, that is, with correctly translated phrases; Some post translation processing, using the features described in the previous chapter, which can maximize the system s BLEU value, by choosing the best phrase among the N possible. The conjugation of all these assumptions resulted in the building of an automatic translation system in the broadcast news context, with a BLEU value of 0,5088. The used corpus for the training of the translation system was always based on the European Parliament sessions, since there are not sufficient resources available for the broadcast news context. Relatively to the corpus used for the construction of the language model, a interpolation between two corpora of different domains was carried out, one related to the broadcast news[13] and another related to the newspapers[14]. As for the development and test corpora, these were always based on the broadcast news domain. However, these corpora suffered some corrections in a way that the system manages to produce translations with a better quality and consequently obtain a better BLEU value. The 1 table illustrates the type of corpora used and the respective description for the broadcast news domain.
10 Type of corpus Language Model Training corpus Development corpus Test corpus Description Set of phrases, based of broadcast news and newspapers, only written in the destination language. Paralel corpus, based on European Parliament. Paralel corpus, based on broadcast news. Paralel corpus, based on broadcast news. Table 1. Corpus and description. For a better comprehension of all experiments carried out, the table 2 describes all those experiments, offering a brief description of them and the respective BLEU values obtained. BLEU Experience Description Condition A Condition B Experience 1 Baseline system, language model based on broadcast news, training corpus based on European Parliament, tuning and test corpora changed Experience 2 Language model Interpolation - - Experience 3 Translation System with an interpolated language model Experience 4 Enhancement of the training corpus of the recase system with newspaper texts Automatic capitalization system Experience 5 Reprocess of the obtained translations (new features) Table 2. Experiences References 1. D Jurafsky, J.M.: Speech and language processing (2000) Publisher: Prentice Hall. 2. Ney, H.: One decade of statistical machine translation: 1996:2005. In: Human Language Technology and Pattern Recognition, Germany, Lehrstuhl informatik VI-Computer Science Department, RWTH Aachen (2005) 3. Knight, K.: Translation with finite-state devices. CA (2006) 4. Brown, P.: The mathematics of machine translation: Parameter estimation. In: Computational Linguistics. Volume 19. (2003)
11 5. Koehn, P.: Introduction to statistical machine translation (2005) 6. Marcu, D.: Spmt: Statistical machine translation with syntactified target language phrases, 4640 Admiralty Way, Suite 1210, Marina del Rey, Language Weaver Inc (2006) CA Nicola Ueffing, H.N.: Lehrstuhl fur informatik vi. bayes decision rules and confidence measures for statistical machine translation. In: Computer Science Department RWTH, University Ahornstrasse 55, 52056, Aachen, Germany (2004) 8. Doddington, G.: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. (2002) 9. Papinemi, K.: Bleu: a method for automatic evaluation of machine translation, IBM Research Report (2001) 10. Stolcke, A.: Srilm - an extensible language modeling toolkit. In: Proc. International Conference on Spoken Language Processing. Volume 2., Denver, CO (September 2002) Koehn, P.: Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, Prague, Czech Republic, Association for Computational Linguistics (June 2007) Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Volume 29. (2003) MacIntyre, R.: Ldc catalog number ldc98t31. (1998) 14. Graff, D.: Ldc catalog number ldc95t21, isbn (1995)
Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data
Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data Maja Popović and Hermann Ney Lehrstuhl für Informatik VI, Computer
More informationThe Karlsruhe Institute of Technology Translation Systems for the WMT 2011
The Karlsruhe Institute of Technology Translation Systems for the WMT 2011 Teresa Herrmann, Mohammed Mediani, Jan Niehues and Alex Waibel Karlsruhe Institute of Technology Karlsruhe, Germany firstname.lastname@kit.edu
More informationThe NICT Translation System for IWSLT 2012
The NICT Translation System for IWSLT 2012 Andrew Finch Ohnmar Htun Eiichiro Sumita Multilingual Translation Group MASTAR Project National Institute of Information and Communications Technology Kyoto,
More informationNoisy SMS Machine Translation in Low-Density Languages
Noisy SMS Machine Translation in Low-Density Languages Vladimir Eidelman, Kristy Hollingshead, and Philip Resnik UMIACS Laboratory for Computational Linguistics and Information Processing Department of
More informationTarget Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data
Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se
More informationThe MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation
The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation AUTHORS AND AFFILIATIONS MSR: Xiaodong He, Jianfeng Gao, Chris Quirk, Patrick Nguyen, Arul Menezes, Robert Moore, Kristina Toutanova,
More informationDomain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling
Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling Pratyush Banerjee, Sudip Kumar Naskar, Johann Roturier 1, Andy Way 2, Josef van Genabith
More informationThe KIT-LIMSI Translation System for WMT 2014
The KIT-LIMSI Translation System for WMT 2014 Quoc Khanh Do, Teresa Herrmann, Jan Niehues, Alexandre Allauzen, François Yvon and Alex Waibel LIMSI-CNRS, Orsay, France Karlsruhe Institute of Technology,
More informationCross Language Information Retrieval
Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................
More informationLanguage Model and Grammar Extraction Variation in Machine Translation
Language Model and Grammar Extraction Variation in Machine Translation Vladimir Eidelman, Chris Dyer, and Philip Resnik UMIACS Laboratory for Computational Linguistics and Information Processing Department
More informationGreedy Decoding for Statistical Machine Translation in Almost Linear Time
in: Proceedings of HLT-NAACL 23. Edmonton, Canada, May 27 June 1, 23. This version was produced on April 2, 23. Greedy Decoding for Statistical Machine Translation in Almost Linear Time Ulrich Germann
More informationSpecification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments
Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,
More informationSpeech Recognition at ICSI: Broadcast News and beyond
Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI
More informationModeling function word errors in DNN-HMM based LVCSR systems
Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford
More informationMemory-based grammatical error correction
Memory-based grammatical error correction Antal van den Bosch Peter Berck Radboud University Nijmegen Tilburg University P.O. Box 9103 P.O. Box 90153 NL-6500 HD Nijmegen, The Netherlands NL-5000 LE Tilburg,
More informationModeling function word errors in DNN-HMM based LVCSR systems
Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford
More informationImproved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation
Improved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation Baskaran Sankaran and Anoop Sarkar School of Computing Science Simon Fraser University Burnaby BC. Canada {baskaran,
More informationSouth Carolina English Language Arts
South Carolina English Language Arts A S O F J U N E 2 0, 2 0 1 0, T H I S S TAT E H A D A D O P T E D T H E CO M M O N CO R E S TAT E S TA N DA R D S. DOCUMENTS REVIEWED South Carolina Academic Content
More information2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases
POS Tagging Problem Part-of-Speech Tagging L545 Spring 203 Given a sentence W Wn and a tagset of lexical categories, find the most likely tag T..Tn for each word in the sentence Example Secretariat/P is/vbz
More informationLinking Task: Identifying authors and book titles in verbose queries
Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,
More informationPage 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified
Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General Grade(s): None specified Unit: Creating a Community of Mathematical Thinkers Timeline: Week 1 The purpose of the Establishing a Community
More informationA Quantitative Method for Machine Translation Evaluation
A Quantitative Method for Machine Translation Evaluation Jesús Tomás Escola Politècnica Superior de Gandia Universitat Politècnica de València jtomas@upv.es Josep Àngel Mas Departament d Idiomes Universitat
More informationLearning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models
Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za
More informationUsing dialogue context to improve parsing performance in dialogue systems
Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,
More informationA heuristic framework for pivot-based bilingual dictionary induction
2013 International Conference on Culture and Computing A heuristic framework for pivot-based bilingual dictionary induction Mairidan Wushouer, Toru Ishida, Donghui Lin Department of Social Informatics,
More informationWeb as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics
(L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes
More informationCLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction
CLASSIFICATION OF PROGRAM Critical Elements Analysis 1 Program Name: Macmillan/McGraw Hill Reading 2003 Date of Publication: 2003 Publisher: Macmillan/McGraw Hill Reviewer Code: 1. X The program meets
More informationSemi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.
Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link
More informationRe-evaluating the Role of Bleu in Machine Translation Research
Re-evaluating the Role of Bleu in Machine Translation Research Chris Callison-Burch Miles Osborne Philipp Koehn School on Informatics University of Edinburgh 2 Buccleuch Place Edinburgh, EH8 9LW callison-burch@ed.ac.uk
More informationarxiv: v1 [cs.cl] 2 Apr 2017
Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,
More informationSTUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH
STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH Don McAllaster, Larry Gillick, Francesco Scattone, Mike Newman Dragon Systems, Inc. 320 Nevada Street Newton, MA 02160
More informationInvestigation on Mandarin Broadcast News Speech Recognition
Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang 1, Xin Lei 1, Wen Wang 2, Takahiro Shinozaki 1 1 Univ. of Washington, Dept. of Electrical Engineering, Seattle, WA 98195 USA 2
More informationDeep Neural Network Language Models
Deep Neural Network Language Models Ebru Arısoy, Tara N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran IBM T.J. Watson Research Center Yorktown Heights, NY, 10598, USA {earisoy, tsainath, bedk, bhuvana}@us.ibm.com
More informationDublin City Schools Mathematics Graded Course of Study GRADE 4
I. Content Standard: Number, Number Sense and Operations Standard Students demonstrate number sense, including an understanding of number systems and reasonable estimates using paper and pencil, technology-supported
More informationThe development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach
BILINGUAL LEARNERS DICTIONARIES The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach Mark VAN MOL, Leuven, Belgium Abstract This paper reports on the
More informationMandarin Lexical Tone Recognition: The Gating Paradigm
Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition
More informationModule 12. Machine Learning. Version 2 CSE IIT, Kharagpur
Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should
More informationThe RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017
The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017 Jan-Thorsten Peter, Andreas Guta, Tamer Alkhouli, Parnia Bahar, Jan Rosendahl, Nick Rossenbach, Miguel
More informationContent Language Objectives (CLOs) August 2012, H. Butts & G. De Anda
Content Language Objectives (CLOs) Outcomes Identify the evolution of the CLO Identify the components of the CLO Understand how the CLO helps provide all students the opportunity to access the rigor of
More informationCommon Core Standards Alignment Chart Grade 5
Common Core Standards Alignment Chart Grade 5 Units 5.OA.1 5.OA.2 5.OA.3 5.NBT.1 5.NBT.2 5.NBT.3 5.NBT.4 5.NBT.5 5.NBT.6 5.NBT.7 5.NF.1 5.NF.2 5.NF.3 5.NF.4 5.NF.5 5.NF.6 5.NF.7 5.MD.1 5.MD.2 5.MD.3 5.MD.4
More informationThe taming of the data:
The taming of the data: Using text mining in building a corpus for diachronic analysis Stefania Degaetano-Ortlieb, Hannah Kermes, Ashraf Khamis, Jörg Knappen, Noam Ordan and Elke Teich Background Big data
More informationSwitchboard Language Model Improvement with Conversational Data from Gigaword
Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword
More informationA Case Study: News Classification Based on Term Frequency
A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center
More informationTraining and evaluation of POS taggers on the French MULTITAG corpus
Training and evaluation of POS taggers on the French MULTITAG corpus A. Allauzen, H. Bonneau-Maynard LIMSI/CNRS; Univ Paris-Sud, Orsay, F-91405 {allauzen,maynard}@limsi.fr Abstract The explicit introduction
More informationOn the Combined Behavior of Autonomous Resource Management Agents
On the Combined Behavior of Autonomous Resource Management Agents Siri Fagernes 1 and Alva L. Couch 2 1 Faculty of Engineering Oslo University College Oslo, Norway siri.fagernes@iu.hio.no 2 Computer Science
More informationAN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)
B. PALTRIDGE, DISCOURSE ANALYSIS: AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC. 2012. PP. VI, 282) Review by Glenda Shopen _ This book is a revised edition of the author s 2006 introductory
More informationStefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio
Content 1. Empirical linguistics 2. Text corpora and corpus linguistics 3. Concordances 4. Application I: The German progressive 5. Part-of-speech tagging 6. Fequency analysis 7. Application II: Compounds
More informationProviding student writers with pre-text feedback
Providing student writers with pre-text feedback Ana Frankenberg-Garcia This paper argues that the best moment for responding to student writing is before any draft is completed. It analyses ways in which
More informationBridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models
Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Jung-Tae Lee and Sang-Bum Kim and Young-In Song and Hae-Chang Rim Dept. of Computer &
More informationTwitter Sentiment Classification on Sanders Data using Hybrid Approach
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders
More informationCross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels
Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels Jörg Tiedemann Uppsala University Department of Linguistics and Philology firstname.lastname@lingfil.uu.se Abstract
More informationRole of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation
Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation Vivek Kumar Rangarajan Sridhar, John Chen, Srinivas Bangalore, Alistair Conkie AT&T abs - Research 180 Park Avenue, Florham Park,
More informationRadius STEM Readiness TM
Curriculum Guide Radius STEM Readiness TM While today s teens are surrounded by technology, we face a stark and imminent shortage of graduates pursuing careers in Science, Technology, Engineering, and
More informationDEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS
DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS Julia Tmshkina Centre for Text Techitology, North-West University, 253 Potchefstroom, South Africa 2025770@puk.ac.za
More informationRegression for Sentence-Level MT Evaluation with Pseudo References
Regression for Sentence-Level MT Evaluation with Pseudo References Joshua S. Albrecht and Rebecca Hwa Department of Computer Science University of Pittsburgh {jsa8,hwa}@cs.pitt.edu Abstract Many automatic
More informationLanguage Center. Course Catalog
Language Center Course Catalog 2016-2017 Mastery of languages facilitates access to new and diverse opportunities, and IE University (IEU) considers knowledge of multiple languages a key element of its
More informationCross-lingual Text Fragment Alignment using Divergence from Randomness
Cross-lingual Text Fragment Alignment using Divergence from Randomness Sirvan Yahyaei, Marco Bonzanini, and Thomas Roelleke Queen Mary, University of London Mile End Road, E1 4NS London, UK {sirvan,marcob,thor}@eecs.qmul.ac.uk
More informationUnsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode
Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode Diploma Thesis of Michael Heck At the Department of Informatics Karlsruhe Institute of Technology
More informationConstructing Parallel Corpus from Movie Subtitles
Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing
More informationChinese Language Parsing with Maximum-Entropy-Inspired Parser
Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art
More informationCalibration of Confidence Measures in Speech Recognition
Submitted to IEEE Trans on Audio, Speech, and Language, July 2010 1 Calibration of Confidence Measures in Speech Recognition Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IEEE
More informationThe Internet as a Normative Corpus: Grammar Checking with a Search Engine
The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a
More informationCombining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval
Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval Jianqiang Wang and Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park,
More informationhave to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,
A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994
More informationCorpus Linguistics (L615)
(L615) Basics of Markus Dickinson Department of, Indiana University Spring 2013 1 / 23 : the extent to which a sample includes the full range of variability in a population distinguishes corpora from archives
More informationLearning and Retaining New Vocabularies: The Case of Monolingual and Bilingual Dictionaries
Learning and Retaining New Vocabularies: The Case of Monolingual and Bilingual Dictionaries Mohsen Mobaraki Assistant Professor, University of Birjand, Iran mmobaraki@birjand.ac.ir *Amin Saed Lecturer,
More informationListening and Speaking Skills of English Language of Adolescents of Government and Private Schools
Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools Dr. Amardeep Kaur Professor, Babe Ke College of Education, Mudki, Ferozepur, Punjab Abstract The present
More informationSystem Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview
More informationEvaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment
Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment Akiko Sakamoto, Kazuhiko Abe, Kazuo Sumita and Satoshi Kamatani Knowledge Media Laboratory,
More informationDeploying Agile Practices in Organizations: A Case Study
Copyright: EuroSPI 2005, Will be presented at 9-11 November, Budapest, Hungary Deploying Agile Practices in Organizations: A Case Study Minna Pikkarainen 1, Outi Salo 1, and Jari Still 2 1 VTT Technical
More informationLiterature and the Language Arts Experiencing Literature
Correlation of Literature and the Language Arts Experiencing Literature Grade 9 2 nd edition to the Nebraska Reading/Writing Standards EMC/Paradigm Publishing 875 Montreal Way St. Paul, Minnesota 55102
More informationProject in the framework of the AIM-WEST project Annotation of MWEs for translation
Project in the framework of the AIM-WEST project Annotation of MWEs for translation 1 Agnès Tutin LIDILEM/LIG Université Grenoble Alpes 30 october 2014 Outline 2 Why annotate MWEs in corpora? A first experiment
More informationRule Learning With Negation: Issues Regarding Effectiveness
Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United
More informationMETHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS
METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS Ruslan Mitkov (R.Mitkov@wlv.ac.uk) University of Wolverhampton ViktorPekar (v.pekar@wlv.ac.uk) University of Wolverhampton Dimitar
More informationSoftware Maintenance
1 What is Software Maintenance? Software Maintenance is a very broad activity that includes error corrections, enhancements of capabilities, deletion of obsolete capabilities, and optimization. 2 Categories
More informationMULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY
MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract
More informationCS 446: Machine Learning
CS 446: Machine Learning Introduction to LBJava: a Learning Based Programming Language Writing classifiers Christos Christodoulopoulos Parisa Kordjamshidi Motivation 2 Motivation You still have not learnt
More informationAGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016
AGENDA Advanced Learning Theories Alejandra J. Magana, Ph.D. admagana@purdue.edu Introduction to Learning Theories Role of Learning Theories and Frameworks Learning Design Research Design Dual Coding Theory
More informationExperiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling
Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Notebook for PAN at CLEF 2013 Andrés Alfonso Caurcel Díaz 1 and José María Gómez Hidalgo 2 1 Universidad
More informationWord Segmentation of Off-line Handwritten Documents
Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department
More informationLife and career planning
Paper 30-1 PAPER 30 Life and career planning Bob Dick (1983) Life and career planning: a workbook exercise. Brisbane: Department of Psychology, University of Queensland. A workbook for class use. Introduction
More informationIntra-talker Variation: Audience Design Factors Affecting Lexical Selections
Tyler Perrachione LING 451-0 Proseminar in Sound Structure Prof. A. Bradlow 17 March 2006 Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Abstract Although the acoustic and
More informationMatching Similarity for Keyword-Based Clustering
Matching Similarity for Keyword-Based Clustering Mohammad Rezaei and Pasi Fränti University of Eastern Finland {rezaei,franti}@cs.uef.fi Abstract. Semantic clustering of objects such as documents, web
More informationRule discovery in Web-based educational systems using Grammar-Based Genetic Programming
Data Mining VI 205 Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming C. Romero, S. Ventura, C. Hervás & P. González Universidad de Córdoba, Campus Universitario de
More informationAbbey Academies Trust. Every Child Matters
Abbey Academies Trust Every Child Matters Amended POLICY For Modern Foreign Languages (MFL) September 2005 September 2014 September 2008 September 2011 Every Child Matters within a loving and caring Christian
More informationLecture 1: Machine Learning Basics
1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3
More informationThe A2iA Multi-lingual Text Recognition System at the second Maurdor Evaluation
2014 14th International Conference on Frontiers in Handwriting Recognition The A2iA Multi-lingual Text Recognition System at the second Maurdor Evaluation Bastien Moysset,Théodore Bluche, Maxime Knibbe,
More informationLearning Methods in Multilingual Speech Recognition
Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex
More informationPossessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand
1 Introduction Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand heidi.quinn@canterbury.ac.nz NWAV 33, Ann Arbor 1 October 24 This paper looks at
More informationWhat the National Curriculum requires in reading at Y5 and Y6
What the National Curriculum requires in reading at Y5 and Y6 Word reading apply their growing knowledge of root words, prefixes and suffixes (morphology and etymology), as listed in Appendix 1 of the
More informationModule 9: Performing HIV Rapid Tests (Demo and Practice)
Module 9: Performing HIV Rapid Tests (Demo and Practice) Purpose To provide the participants with necessary knowledge and skills to accurately perform 3 HIV rapid tests and to determine HIV status. Pre-requisite
More informationMISSISSIPPI OCCUPATIONAL DIPLOMA EMPLOYMENT ENGLISH I: NINTH, TENTH, ELEVENTH AND TWELFTH GRADES
MISSISSIPPI OCCUPATIONAL DIPLOMA EMPLOYMENT ENGLISH I: NINTH, TENTH, ELEVENTH AND TWELFTH GRADES Students will: 1. Recognize main idea in written, oral, and visual formats. Examples: Stories, informational
More informationA High-Quality Web Corpus of Czech
A High-Quality Web Corpus of Czech Johanka Spoustová, Miroslav Spousta Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University Prague, Czech Republic {johanka,spousta}@ufal.mff.cuni.cz
More informationInitial approaches on Cross-Lingual Information Retrieval using Statistical Machine Translation on User Queries
Initial approaches on Cross-Lingual Information Retrieval using Statistical Machine Translation on User Queries Marta R. Costa-jussà, Christian Paz-Trillo and Renata Wassermann 1 Computer Science Department
More informationA hybrid approach to translate Moroccan Arabic dialect
A hybrid approach to translate Moroccan Arabic dialect Ridouane Tachicart Mohammadia school of Engineers Mohamed Vth Agdal University, Rabat, Morocco tachicart@gmail.com Karim Bouzoubaa Mohammadia school
More informationDiscriminative Learning of Beam-Search Heuristics for Planning
Discriminative Learning of Beam-Search Heuristics for Planning Yuehua Xu School of EECS Oregon State University Corvallis,OR 97331 xuyu@eecs.oregonstate.edu Alan Fern School of EECS Oregon State University
More informationMath-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade
Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade The third grade standards primarily address multiplication and division, which are covered in Math-U-See
More informationTIMSS Highlights from the Primary Grades
TIMSS International Study Center June 1997 BOSTON COLLEGE TIMSS Highlights from the Primary Grades THIRD INTERNATIONAL MATHEMATICS AND SCIENCE STUDY Most Recent Publications International comparative results
More informationChapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard
Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA Alta de Waal, Jacobus Venter and Etienne Barnard Abstract Most actionable evidence is identified during the analysis phase of digital forensic investigations.
More information