Effect of additional in-domain parallel corpora in biomedical statistical machine translation
|
|
- Erika Bates
- 5 years ago
- Views:
Transcription
1 Effect of additional in-domain parallel corpora in biomedical statistical machine translation Antonio Jimeno-Yepes 1,3 and Aurélie Névéol 2,3 1 NICTA Victoria Research Lab, Melbourne VIC 3010, Australia 2 LIMSI-CNRS, BP 133, Orsay Cedex, France 3 National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894, USA antonio.jimeno@gmail.com, neveol@limsi.fr Abstract. Most institutional and research information in the biomedical domain is available only as English text. This is a limitation for non-native English speakers and individuals with low English proficiency. Unfortunately, obtaining parallel corpora to train a statistical machine translation system is difficult. In previous work, we introduced a method to automatically develop corpora for training and evaluating statistical machine translation systems. This method was intended to work with MEDLINE, so was limited to the resources available from the journals indexed in MEDLINE containing information in more than one language. In the current work, we have added in-domain corpora obtained from the UMLS Metathesaurus and a corpus obtained from the European Medicines Agency. Preliminary results indicate that adding in-domain corpora to our previously developed set slightly improves the translation performance. Most of the improvement is observed when the additional in-domain corpora are used to improve word alignment. 1 Introduction Most institutional and research information in the biomedical domain is available only as English text. This is a significant limitation for non-english speakers even in countries where English is an official language, such as the United States or Australia. This renders available biomedical information effectively inaccessible to the high number of non-native English speakers and individuals with low English proficiency. Advances in statistical machine translation (SMT) might improve this situation. Unfortunately, obtaining parallel corpora to train a statistical machine translation system is difficult. In previous work, we have presented a method to automatically develop a corpus for training and evaluating a biomedical SMT system [1] for the language pairs English (EN)/French (FR) and English/Spanish (ES). This method was intended to work on specific MEDLINE R records available from certain journals with information in multiple languages on the journal website. In this work, we present preliminary results of experiments adding more indomain corpora from various source to the MEDLINE corpus. We have extended
2 this corpus with other biomedical resources, namely using the Unified Medical Language System R (UMLS) and a corpus developed by the European Medicines Agency (EMA). We show that adding additional in-domain resources improves the performance of word alignment. 2 Methods We have used several in-domain corpora from different resources to train a SMT system. We present the MEDLINE corpus that we used in the initial system and used as baseline corpus. Then, we present the UMLS and the EMEA sets. Finally, we present the SMT used for training and evaluating the corpora evaluated on the MEDLINE set. 2.1 MEDLINE corpus MEDLINE currently indexes about 5,200 journals in the biomedical domain. Although most of them publish articles in English, 22% of the articles indexed in MEDLINE were written in a language other than English. From the available citations, the DOI points to the journal article, which in some cases contains the abstract in English and in the original language. We used the corpus described in [1]. This was built using a python script (available upon request from the authors) to obtain a corpus of MEDLINE titles and abstracts, extending the corpus used in [2]. 2.2 UMLS data set The UMLS [3] provides a large resource of knowledge and tools to create, process, retrieve, integrate and/or aggregate biomedical and health data. The UMLS has three main components: the Metathesaurus R, a compendium of biomedical and health content terminological resources under a common representation which contains lexical items for each one of the concepts, relations among them and possibly one or more definitions depending on the concept, the Semantic network, which provides a categorization of Metathesaurus concepts into semantic types and the SPECIALIST lexicon, containing lexical information required for natural language processing which covers commonly occurring English words and biomedical vocabulary. Concepts are assigned a unique identifier (CUI) that has linked to it a set of synonyms, which denote alternative ways to represent the concept, for instance, in text. Some UMLS sources (e.g. MeSH, SNOMED) contain entries in different languages, available from the MRCONSO table. We have processed MRCONSO (from UMLS2012AA) to extract terms in EN that can potentially be paired with their FR/ES counterpart. We paired terms in EN and FR/ES that had all of the following in common: CUI, vocabulary ID, term type (e.g. primary preferred term). From this list of terms, since we target high precision in the selection of term pairs, we removed
3 entries in which at least one of the terms contained one of the following symbols: / : -., ) ( [ ] > We also removed entries with the abbreviation NOS (Not Otherwise Specified) and any entry in which at least one of the terms was in all capital letters. As can be seen from Table 1, the UMLS lexicon contains entries with exact term translations (e.g. Warthin Tumor/Tumeur de Warthin) as well as entries with synonyms (e.g. Warthin Tumor/Cystadénolymphome) reflecting differences in term variation between languages. English French Adenolymphoma Adénolymphome Warthin Tumor Cystadénolymphome Warthin Tumor Cystadénolymphome Papillaire Warthin Tumor Tumeur de Warthin English Spanish Adenolymphoma Adenolinfoma Warthin Tumor Adenocistoma Papilar Linfomatoso Warthin Tumor Adenolinfoma Warthin Tumor Cistadenoma Linfomatoso Papilar Warthin Tumor Tumor de Warthin Table 1. C EN/FR and EN/ES example 2.3 EMEA data set The EMA (European Medicines Agency, is a European agency with a similar role to the United States Food and Drug Administration (FDA). Its mission is to harmonize national medicine regulatory bodies. Since national bodies use their original languages, this resource can be used to develop parallel corpora for SMT. EMEA [4] is a parallel corpus about medicinal products from EMA available in 22 European official languages, even though not all the documents are available in all languages. In total, there are about 1,500 documents for most languages. The documents are in PDF files that are converted to text using pdftotext and identified sentences are aligned. In total, there are 1,092,568 sentences for the English/French language pair and there are 1,098,333 for the pair English/Spanish language pair. Table 2 shows aligned example sentences for the three languages. 2.4 Translation software We have used the Moses [5] toolkit for Statistical Machine Translation (SMT). Moses is a state-of-the-art open-source phrase based SMT system. The experiments with Moses involved three steps: training, tuning and testing. Support packages SRILM [6] and GIZA++ [7] were installed per the standard Model
4 English French Spanish Abilify is a medicine containing the active substance qui contient le principe actif que contiene el principio ac- Abilify est un médicament Abilify es un medicamento aripiprazole. aripiprazole. tivo aripiprazol. Table 2. Example sentence in EN/FR/ES from EMEA setup. During the training step, Moses learns word-to-word translation and distortion models based on IBM Model 1-5 [8]. This model is used to build a phrase table and reordering model. During the tuning step, weights for translation, reordering and language models are learned. 2.5 Data set preparation Table 3 shows the final size of the corpus and the selection used in the experiments. The titles and abstract sentences were selected from the MEDLINE corpus. The UMLS term pairs were used only during the training step since only term mappings between two language pairs are available. Since the EMEA data set is much larger than the MEDLINE corpus, we have used 200k sentence pairs for the training step and 30k sentences for the tuning step. French Training Tuning Testing Spanish Training Tuning Testing Titles 458,543 57,317 57,317 Titles 198,512 24,814 24,814 Abstracts 17,351 17,365 28,881 Abstracts 5,403 5,418 7,772 UMLS 109, UMLS 449, EMEA 200,000 30,000 - EMEA 200,000 30,000 - Table 3. Translation corpus Training set Test set EtF FtE EtS StE Titles Titles Abs sentences Titles + Abstract Titles Sentences Abs sentences Titles + Abstract Titles Sentences + UMLS Abs sentences Titles + Abstract Titles Sentences + EMEA Abs sentences Table 4. Translation results. EtF (English to French), FtE (French to English), EtS (English to Spanish), StE (Spanish to English)
5 3 Results We have trained SMT models using different combinations of corpora, as specified in table 4. The test set comprises MEDLINE title and abstract sentences. Table 4 presents the translation performance of the different models, evaluated using BLEU scores [9]. We find that while the translation performance is improved when using the UMLS set on abstract sentences, using the EMEA corpus seems to inconsistently impact the translation performance. 4 Discussion In this work, we used a variety of in-domain corpora to train biomedical SMTs. We expected the UMLS lexicon to contribute to word alignment and the EMEA corpus to contribute to sentence structure generated by the trained SMT model. Table 4 shows that the translation of abstract sentences improves with the UMLS vocabulary but this observation is not reflected in title sentences. This might be due to less vocabulary variety in titles as compared with abstract sentences. The fact that some entries in the UMLS corpus are synonyms instead of direct translations might also be an impediment in the alignment phase, especially for multi word terms. Unigram and bigram precision scores (not shown; p1 > 60, p2 > 35 for ENES) have good overall performance. The use of the UMLS and EMEA corpus have a higher impact on unigrams and bigrams compared to higher order n-grams. Also, precision values increase more when using the UMLS corpus compared to EMEA. The EMEA corpus is a different genre of biomedical text compared to MEDLINE citations. It seems that the language usage in EMEA is sufficiently different from MEDLINE such that it does not result always in a better model. Results are in line with similar results in biomedical [10] and nonbiomedical [11, 12] data sets, in which the in and out-of-domain corpora helped to improve the word alignment probabilities. 5 Conclusions and Future Work We have introduced and reused additional methods to obtain in-domain parallel corpora to train a SMT system for the biomedical domain. The combination of these in-domain corpora improves word alignment while using these corpora for tuning the model seems to decrease the translation performance. We would like to further evaluate different corpora sizes for the tuning step and evaluate the performance of corpora size for tuning the translation model. In addition, we would like to research the contribution of out-of-domain corpora (e.g. Europarl [13]) in both word alignment and model tuning. The current evaluation has been performed on MEDLINE records and journal abstracts. In future work, we would like to extend this evaluation to EMEA sentences and UMLS records, which might contribute to develop these resources. Finally, the current work is focused on two language pairs. The techniques used in this work are not language dependent and can be extended easily to other languages.
6 6 Acknowledgements NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program. This work was supported in part by the Intramural Research Program of the NIH, National Library of Medicine. References 1. Jimeno-Yepes, A., Prieur-Gaston, E., Névéol, A.: Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text. BMC Bioinformatics (in press) (2013) 2. Wu, C., Xia, F., Deleger, L., Solti, I.: Statistical machine translation for biomedical text: are we there yet? In: AMIA Annual Symposium Proceedings. Volume 2011., American Medical Informatics Association (2011) Bodenreider, O.: The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research 32(Database Issue) (2004) D Tiedemann, J.: News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In Nicolov, N., Bontcheva, K., Angelova, G., Mitkov, R., eds.: Recent Advances in Natural Language Processing. Volume V. John Benjamins, Amsterdam/Philadelphia, Borovets, Bulgaria (2009) Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., et al.: Moses: Open source toolkit for statistical machine translation. In: Annual meeting-association for computational linguistics. Volume 45. (2007) 2 6. Stolcke, A.: SRILM-an extensible language modeling toolkit. In: Proceedings of the international conference on spoken language processing. Volume 2. (2002) Och, F., Ney, H.: A systematic comparison of various statistical alignment models. Computational linguistics 29(1) (2003) Brown, P., Pietra, V., Pietra, S., Mercer, R.: The mathematics of statistical machine translation: Parameter estimation. Computational linguistics 19(2) (1993) Papineni, K., Roukos, S., Ward, T., Zhu, W.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics (2002) Eck, M., Vogel, S., Waibel, A.: Improving statistical machine translation in the medical domain using the unified medical language system. In: Proceedings of the 20th international conference on Computational Linguistics, Association for Computational Linguistics (2004) Duh, K., Sudoh, K., Tsukada, H.: Analysis of translation model adaptation in statistical machine translation. In: Proceedings of the International Workshop on Spoken Language Translation (IWSLT 10), Paris, France. (2010) 12. Haddow, B., Koehn, P.: Analysing the effect of out-of-domain data on smt systems. In: Proceedings of the Seventh Workshop on Statistical Machine Translation. (2012) Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: MT summit. Volume 5. (2005)
Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling
Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling Pratyush Banerjee, Sudip Kumar Naskar, Johann Roturier 1, Andy Way 2, Josef van Genabith
More informationExploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data
Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data Maja Popović and Hermann Ney Lehrstuhl für Informatik VI, Computer
More informationThe KIT-LIMSI Translation System for WMT 2014
The KIT-LIMSI Translation System for WMT 2014 Quoc Khanh Do, Teresa Herrmann, Jan Niehues, Alexandre Allauzen, François Yvon and Alex Waibel LIMSI-CNRS, Orsay, France Karlsruhe Institute of Technology,
More informationThe Karlsruhe Institute of Technology Translation Systems for the WMT 2011
The Karlsruhe Institute of Technology Translation Systems for the WMT 2011 Teresa Herrmann, Mohammed Mediani, Jan Niehues and Alex Waibel Karlsruhe Institute of Technology Karlsruhe, Germany firstname.lastname@kit.edu
More informationThe NICT Translation System for IWSLT 2012
The NICT Translation System for IWSLT 2012 Andrew Finch Ohnmar Htun Eiichiro Sumita Multilingual Translation Group MASTAR Project National Institute of Information and Communications Technology Kyoto,
More informationCross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels
Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels Jörg Tiedemann Uppsala University Department of Linguistics and Philology firstname.lastname@lingfil.uu.se Abstract
More informationarxiv: v1 [cs.cl] 2 Apr 2017
Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,
More informationLanguage Model and Grammar Extraction Variation in Machine Translation
Language Model and Grammar Extraction Variation in Machine Translation Vladimir Eidelman, Chris Dyer, and Philip Resnik UMIACS Laboratory for Computational Linguistics and Information Processing Department
More informationMachine Translation on the Medical Domain: The Role of BLEU/NIST and METEOR in a Controlled Vocabulary Setting
Machine Translation on the Medical Domain: The Role of BLEU/NIST and METEOR in a Controlled Vocabulary Setting Andre CASTILLA castilla@terra.com.br Alice BACIC Informatics Service, Instituto do Coracao
More informationRe-evaluating the Role of Bleu in Machine Translation Research
Re-evaluating the Role of Bleu in Machine Translation Research Chris Callison-Burch Miles Osborne Philipp Koehn School on Informatics University of Edinburgh 2 Buccleuch Place Edinburgh, EH8 9LW callison-burch@ed.ac.uk
More informationInitial approaches on Cross-Lingual Information Retrieval using Statistical Machine Translation on User Queries
Initial approaches on Cross-Lingual Information Retrieval using Statistical Machine Translation on User Queries Marta R. Costa-jussà, Christian Paz-Trillo and Renata Wassermann 1 Computer Science Department
More informationNoisy SMS Machine Translation in Low-Density Languages
Noisy SMS Machine Translation in Low-Density Languages Vladimir Eidelman, Kristy Hollingshead, and Philip Resnik UMIACS Laboratory for Computational Linguistics and Information Processing Department of
More informationOntologies vs. classification systems
Ontologies vs. classification systems Bodil Nistrup Madsen Copenhagen Business School Copenhagen, Denmark bnm.isv@cbs.dk Hanne Erdman Thomsen Copenhagen Business School Copenhagen, Denmark het.isv@cbs.dk
More informationTarget Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data
Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se
More informationCross Language Information Retrieval
Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................
More informationConstructing Parallel Corpus from Movie Subtitles
Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing
More informationDetecting English-French Cognates Using Orthographic Edit Distance
Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National
More informationBridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models
Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Jung-Tae Lee and Sang-Bum Kim and Young-In Song and Hae-Chang Rim Dept. of Computer &
More informationLinking Task: Identifying authors and book titles in verbose queries
Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,
More informationCROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2
1 CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2 Peter A. Chew, Brett W. Bader, Ahmed Abdelali Proceedings of the 13 th SIGKDD, 2007 Tiago Luís Outline 2 Cross-Language IR (CLIR) Latent Semantic Analysis
More informationCross-lingual Text Fragment Alignment using Divergence from Randomness
Cross-lingual Text Fragment Alignment using Divergence from Randomness Sirvan Yahyaei, Marco Bonzanini, and Thomas Roelleke Queen Mary, University of London Mile End Road, E1 4NS London, UK {sirvan,marcob,thor}@eecs.qmul.ac.uk
More informationGreedy Decoding for Statistical Machine Translation in Almost Linear Time
in: Proceedings of HLT-NAACL 23. Edmonton, Canada, May 27 June 1, 23. This version was produced on April 2, 23. Greedy Decoding for Statistical Machine Translation in Almost Linear Time Ulrich Germann
More informationMETHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS
METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS Ruslan Mitkov (R.Mitkov@wlv.ac.uk) University of Wolverhampton ViktorPekar (v.pekar@wlv.ac.uk) University of Wolverhampton Dimitar
More informationA Quantitative Method for Machine Translation Evaluation
A Quantitative Method for Machine Translation Evaluation Jesús Tomás Escola Politècnica Superior de Gandia Universitat Politècnica de València jtomas@upv.es Josep Àngel Mas Departament d Idiomes Universitat
More informationThe MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation
The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation AUTHORS AND AFFILIATIONS MSR: Xiaodong He, Jianfeng Gao, Chris Quirk, Patrick Nguyen, Arul Menezes, Robert Moore, Kristina Toutanova,
More informationThe RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017
The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017 Jan-Thorsten Peter, Andreas Guta, Tamer Alkhouli, Parnia Bahar, Jan Rosendahl, Nick Rossenbach, Miguel
More informationTINE: A Metric to Assess MT Adequacy
TINE: A Metric to Assess MT Adequacy Miguel Rios, Wilker Aziz and Lucia Specia Research Group in Computational Linguistics University of Wolverhampton Stafford Street, Wolverhampton, WV1 1SB, UK {m.rios,
More informationA heuristic framework for pivot-based bilingual dictionary induction
2013 International Conference on Culture and Computing A heuristic framework for pivot-based bilingual dictionary induction Mairidan Wushouer, Toru Ishida, Donghui Lin Department of Social Informatics,
More informationImproved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation
Improved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation Baskaran Sankaran and Anoop Sarkar School of Computing Science Simon Fraser University Burnaby BC. Canada {baskaran,
More informationTraining and evaluation of POS taggers on the French MULTITAG corpus
Training and evaluation of POS taggers on the French MULTITAG corpus A. Allauzen, H. Bonneau-Maynard LIMSI/CNRS; Univ Paris-Sud, Orsay, F-91405 {allauzen,maynard}@limsi.fr Abstract The explicit introduction
More informationExperiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling
Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Notebook for PAN at CLEF 2013 Andrés Alfonso Caurcel Díaz 1 and José María Gómez Hidalgo 2 1 Universidad
More informationLearning Methods in Multilingual Speech Recognition
Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex
More informationCombining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval
Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval Jianqiang Wang and Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park,
More informationMulti-Lingual Text Leveling
Multi-Lingual Text Leveling Salim Roukos, Jerome Quin, and Todd Ward IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 {roukos,jlquinn,tward}@us.ibm.com Abstract. Determining the language proficiency
More informationTaking into Account the Oral-Written Dichotomy of the Chinese language :
Taking into Account the Oral-Written Dichotomy of the Chinese language : The division and connections between lexical items for Oral and for Written activities Bernard ALLANIC 安雄舒长瑛 SHU Changying 1 I.
More informationInvestigation on Mandarin Broadcast News Speech Recognition
Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang 1, Xin Lei 1, Wen Wang 2, Takahiro Shinozaki 1 1 Univ. of Washington, Dept. of Electrical Engineering, Seattle, WA 98195 USA 2
More informationDEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS
DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS Julia Tmshkina Centre for Text Techitology, North-West University, 253 Potchefstroom, South Africa 2025770@puk.ac.za
More informationWeb as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics
(L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes
More informationDistant Supervised Relation Extraction with Wikipedia and Freebase
Distant Supervised Relation Extraction with Wikipedia and Freebase Marcel Ackermann TU Darmstadt ackermann@tk.informatik.tu-darmstadt.de Abstract In this paper we discuss a new approach to extract relational
More informationUsing dialogue context to improve parsing performance in dialogue systems
Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,
More informationSINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)
SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,
More informationProject in the framework of the AIM-WEST project Annotation of MWEs for translation
Project in the framework of the AIM-WEST project Annotation of MWEs for translation 1 Agnès Tutin LIDILEM/LIG Université Grenoble Alpes 30 october 2014 Outline 2 Why annotate MWEs in corpora? A first experiment
More informationOntological spine, localization and multilingual access
Start Ontological spine, localization and multilingual access Some reflections and a proposal New Perspectives on Subject Indexing and Classification in an International Context International Symposium
More informationImpact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment
Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment Takako Aikawa, Lee Schwartz, Ronit King Mo Corston-Oliver Carmen Lozano Microsoft
More informationModeling function word errors in DNN-HMM based LVCSR systems
Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford
More informationEvaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment
Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment Akiko Sakamoto, Kazuhiko Abe, Kazuo Sumita and Satoshi Kamatani Knowledge Media Laboratory,
More information3 Character-based KJ Translation
NICT at WAT 2015 Chenchen Ding, Masao Utiyama, Eiichiro Sumita Multilingual Translation Laboratory National Institute of Information and Communications Technology 3-5 Hikaridai, Seikacho, Sorakugun, Kyoto,
More informationSemi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.
Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link
More informationA Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many
Schmidt 1 Eric Schmidt Prof. Suzanne Flynn Linguistic Study of Bilingualism December 13, 2013 A Minimalist Approach to Code-Switching In the field of linguistics, the topic of bilingualism is a broad one.
More informationRole of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation
Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation Vivek Kumar Rangarajan Sridhar, John Chen, Srinivas Bangalore, Alistair Conkie AT&T abs - Research 180 Park Avenue, Florham Park,
More informationLongest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 6, Ver. IV (Nov Dec. 2015), PP 01-07 www.iosrjournals.org Longest Common Subsequence: A Method for
More informationControlled vocabulary
Indexing languages 6.2.2. Controlled vocabulary Overview Anyone who has struggled to find the exact search term to retrieve information about a certain subject can benefit from controlled vocabulary. Controlled
More informationFinding Translations in Scanned Book Collections
Finding Translations in Scanned Book Collections Ismet Zeki Yalniz Dept. of Computer Science University of Massachusetts Amherst, MA, 01003 zeki@cs.umass.edu R. Manmatha Dept. of Computer Science University
More informationRegression for Sentence-Level MT Evaluation with Pseudo References
Regression for Sentence-Level MT Evaluation with Pseudo References Joshua S. Albrecht and Rebecca Hwa Department of Computer Science University of Pittsburgh {jsa8,hwa}@cs.pitt.edu Abstract Many automatic
More informationCross-Lingual Text Categorization
Cross-Lingual Text Categorization Nuria Bel 1, Cornelis H.A. Koster 2, and Marta Villegas 1 1 Grup d Investigació en Lingüística Computacional Universitat de Barcelona, 028 - Barcelona, Spain. {nuria,tona}@gilc.ub.es
More informationACADEMIC AFFAIRS GUIDELINES
ACADEMIC AFFAIRS GUIDELINES Section 8: General Education Title: General Education Assessment Guidelines Number (Current Format) Number (Prior Format) Date Last Revised 8.7 XIV 09/2017 Reference: BOR Policy
More informationProcedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing
Available online at www.sciencedirect.com ScienceDirect Procedia - Social and Behavioral Sciences 141 ( 2014 ) 124 128 WCLTA 2013 Using Corpus Linguistics in the Development of Writing Blanka Frydrychova
More informationClickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models
Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models Jianfeng Gao Microsoft Research One Microsoft Way Redmond, WA 98052 USA jfgao@microsoft.com Xiaodong He Microsoft
More informationIndividual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION
L I S T E N I N G Individual Component Checklist for use with ONE task ENGLISH VERSION INTRODUCTION This checklist has been designed for use as a practical tool for describing ONE TASK in a test of listening.
More informationExposé for a Master s Thesis
Exposé for a Master s Thesis Stefan Selent January 21, 2017 Working Title: TF Relation Mining: An Active Learning Approach Introduction The amount of scientific literature is ever increasing. Especially
More informationOverview of the 3rd Workshop on Asian Translation
Overview of the 3rd Workshop on Asian Translation Toshiaki Nakazawa Chenchen Ding and Hideya Mino Japan Science and National Institute of Technology Agency Information and nakazawa@pa.jst.jp Communications
More informationMatching Meaning for Cross-Language Information Retrieval
Matching Meaning for Cross-Language Information Retrieval Jianqiang Wang Department of Library and Information Studies University at Buffalo, the State University of New York Buffalo, NY 14260, U.S.A.
More informationEnhancing Morphological Alignment for Translating Highly Inflected Languages
Enhancing Morphological Alignment for Translating Highly Inflected Languages Minh-Thang Luong School of Computing National University of Singapore luongmin@comp.nus.edu.sg Min-Yen Kan School of Computing
More informationModeling function word errors in DNN-HMM based LVCSR systems
Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford
More informationCorpus Linguistics (L615)
(L615) Basics of Markus Dickinson Department of, Indiana University Spring 2013 1 / 23 : the extent to which a sample includes the full range of variability in a population distinguishes corpora from archives
More informationSpeech Recognition at ICSI: Broadcast News and beyond
Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI
More informationLearning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models
Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za
More informationA process by any other name
January 05, 2016 Roger Tregear A process by any other name thoughts on the conflicted use of process language What s in a name? That which we call a rose By any other name would smell as sweet. William
More informationGRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics
2017-2018 GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics Entrance requirements, program descriptions, degree requirements and other program policies for Biostatistics Master s Programs
More informationLanguage Independent Passage Retrieval for Question Answering
Language Independent Passage Retrieval for Question Answering José Manuel Gómez-Soriano 1, Manuel Montes-y-Gómez 2, Emilio Sanchis-Arnal 1, Luis Villaseñor-Pineda 2, Paolo Rosso 1 1 Polytechnic University
More informationThe taming of the data:
The taming of the data: Using text mining in building a corpus for diachronic analysis Stefania Degaetano-Ortlieb, Hannah Kermes, Ashraf Khamis, Jörg Knappen, Noam Ordan and Elke Teich Background Big data
More informationCOVER SHEET. This is the author version of article published as:
COVER SHEET This is the author version of article published as: Sivapalan, Siva and Cregan, Peter (2005) Value of online resources for learning by distance education. CAL-laborate 14:pp. 23-27. Copyright
More informationProceedings Chapter. Reference. Combining pre-editing and post-editing to improve SMT of user-generated content. GERLACH, Johanna, et al.
Proceedings Chapter Combining pre-editing and post-editing to improve SMT of user-generated content GERLACH, Johanna, et al. Abstract The poor quality of user-generated content (UGC) found in forums hinders
More informationSpecification of the Verity Learning Companion and Self-Assessment Tool
Specification of the Verity Learning Companion and Self-Assessment Tool Sergiu Dascalu* Daniela Saru** Ryan Simpson* Justin Bradley* Eva Sarwar* Joohoon Oh* * Department of Computer Science ** Dept. of
More informationA hybrid approach to translate Moroccan Arabic dialect
A hybrid approach to translate Moroccan Arabic dialect Ridouane Tachicart Mohammadia school of Engineers Mohamed Vth Agdal University, Rabat, Morocco tachicart@gmail.com Karim Bouzoubaa Mohammadia school
More informationInteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial ISSN:
Inteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial ISSN: 1137-3601 revista@aepia.org Asociación Española para la Inteligencia Artificial España Lucena, Diego Jesus de; Bastos Pereira,
More informationPre-editing by Forum Users: a Case Study
Pre-editing by Forum Users: a Case Study Pierrette Bouillon 1, Liliana Gaspar 2, Johanna Gerlach 1, Victoria Porro 1, Johann Roturier 2 1 Université de Genève FTI/TIM - 40 bvd Du Pont-d Arve, CH-1211 Genève
More informationBMC Medical Informatics and Decision Making 2012, 12:33
BMC Medical Informatics and Decision Making This Provisional PDF corresponds to the article as it appeared upon acceptance. Fully formatted PDF and full text (HTML) versions will be made available soon.
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview
More informationSpecification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments
Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,
More informationMACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions
MACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions Anne Garcia-Fernandez, Sophie Rosset, Anne Vilnat LIMSI - CNRS F-91403 Orsay Cedex {annegf, rosset, vilnat}@limsi.fr
More informationProcedia - Social and Behavioral Sciences 154 ( 2014 )
Available online at www.sciencedirect.com ScienceDirect Procedia - Social and Behavioral Sciences 154 ( 2014 ) 263 267 THE XXV ANNUAL INTERNATIONAL ACADEMIC CONFERENCE, LANGUAGE AND CULTURE, 20-22 October
More informationA Case Study: News Classification Based on Term Frequency
A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center
More informationEnglish-German Medical Dictionary And Phrasebook By A.H. Zemback
English-German Medical Dictionary And Phrasebook By A.H. Zemback If you are searching for a ebook English-German Medical Dictionary and Phrasebook by A.H. Zemback in pdf form, then you've come to loyal
More informationText-mining the Estonian National Electronic Health Record
Text-mining the Estonian National Electronic Health Record Raul Sirel rsirel@ut.ee 13.11.2015 Outline Electronic Health Records & Text Mining De-identifying the Texts Resolving the Abbreviations Terminology
More information1.2 Interpretive Communication: Students will demonstrate comprehension of content from authentic audio and visual resources.
Course French I Grade 9-12 Unit of Study Unit 1 - Bonjour tout le monde! & les Passe-temps Unit Type(s) x Topical Skills-based Thematic Pacing 20 weeks Overarching Standards: 1.1 Interpersonal Communication:
More informationSPANISH FOR MASTERY 3 PDF
SPANISH FOR MASTERY 3 PDF ==> Download: SPANISH FOR MASTERY 3 PDF SPANISH FOR MASTERY 3 PDF - Are you searching for Spanish For Mastery 3 Books? Now, you will be happy that at this time Spanish For Mastery
More informationModeling full form lexica for Arabic
Modeling full form lexica for Arabic Susanne Alt Amine Akrout Atilf-CNRS Laurent Romary Loria-CNRS Objectives Presentation of the current standardization activity in the domain of lexical data modeling
More informationPRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION
PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION SUMMARY 1. Motivation 2. Praat Software & Format 3. Extended Praat 4. Prosody Tagger 5. Demo 6. Conclusions What s the story behind?
More informationProgram Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading
Program Requirements Competency 1: Foundations of Instruction 60 In-service Hours Teachers will develop substantive understanding of six components of reading as a process: comprehension, oral language,
More informationEnglish Language and Applied Linguistics. Module Descriptions 2017/18
English Language and Applied Linguistics Module Descriptions 2017/18 Level I (i.e. 2 nd Yr.) Modules Please be aware that all modules are subject to availability. If you have any questions about the modules,
More informationCompositional Semantics
Compositional Semantics CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Words, bag of words Sequences Trees Meaning Representing Meaning An important goal of NLP/AI: convert natural language
More informationWest Windsor-Plainsboro Regional School District French Grade 7
West Windsor-Plainsboro Regional School District French Grade 7 Page 1 of 10 Content Area: World Language Course & Grade Level: French, Grade 7 Unit 1: La rentrée Summary and Rationale As they return to
More informationTIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy
TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE Pierre Foy TIMSS Advanced 2015 orks User Guide for the International Database Pierre Foy Contributors: Victoria A.S. Centurino, Kerry E. Cotter,
More informationTask Tolerance of MT Output in Integrated Text Processes
Task Tolerance of MT Output in Integrated Text Processes John S. White, Jennifer B. Doyon, and Susan W. Talbott Litton PRC 1500 PRC Drive McLean, VA 22102, USA {white_john, doyon jennifer, talbott_susan}@prc.com
More information2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases
POS Tagging Problem Part-of-Speech Tagging L545 Spring 203 Given a sentence W Wn and a tagset of lexical categories, find the most likely tag T..Tn for each word in the sentence Example Secretariat/P is/vbz
More informationOakland Unified School District English/ Language Arts Course Syllabus
Oakland Unified School District English/ Language Arts Course Syllabus For Secondary Schools The attached course syllabus is a developmental and integrated approach to skill acquisition throughout the
More informationHeritage Korean Stage 6 Syllabus Preliminary and HSC Courses
Heritage Korean Stage 6 Syllabus Preliminary and HSC Courses 2010 Board of Studies NSW for and on behalf of the Crown in right of the State of New South Wales This document contains Material prepared by
More informationA cognitive perspective on pair programming
Association for Information Systems AIS Electronic Library (AISeL) AMCIS 2006 Proceedings Americas Conference on Information Systems (AMCIS) December 2006 A cognitive perspective on pair programming Radhika
More information