Effect of additional in-domain parallel corpora in biomedical statistical machine translation

Similar documents
Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

The KIT-LIMSI Translation System for WMT 2014

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

The NICT Translation System for IWSLT 2012

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

arxiv: v1 [cs.cl] 2 Apr 2017

Language Model and Grammar Extraction Variation in Machine Translation

Machine Translation on the Medical Domain: The Role of BLEU/NIST and METEOR in a Controlled Vocabulary Setting

Re-evaluating the Role of Bleu in Machine Translation Research

Initial approaches on Cross-Lingual Information Retrieval using Statistical Machine Translation on User Queries

Noisy SMS Machine Translation in Low-Density Languages

Ontologies vs. classification systems

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Cross Language Information Retrieval

Constructing Parallel Corpus from Movie Subtitles

Detecting English-French Cognates Using Orthographic Edit Distance

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Linking Task: Identifying authors and book titles in verbose queries

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Cross-lingual Text Fragment Alignment using Divergence from Randomness

Greedy Decoding for Statistical Machine Translation in Almost Linear Time

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

A Quantitative Method for Machine Translation Evaluation

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017

TINE: A Metric to Assess MT Adequacy

A heuristic framework for pivot-based bilingual dictionary induction

Improved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation

Training and evaluation of POS taggers on the French MULTITAG corpus

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Learning Methods in Multilingual Speech Recognition

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

Multi-Lingual Text Leveling

Taking into Account the Oral-Written Dichotomy of the Chinese language :

Investigation on Mandarin Broadcast News Speech Recognition

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Distant Supervised Relation Extraction with Wikipedia and Freebase

Using dialogue context to improve parsing performance in dialogue systems

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Ontological spine, localization and multilingual access

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

Modeling function word errors in DNN-HMM based LVCSR systems

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

3 Character-based KJ Translation

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Controlled vocabulary

Finding Translations in Scanned Book Collections

Regression for Sentence-Level MT Evaluation with Pseudo References

Cross-Lingual Text Categorization

ACADEMIC AFFAIRS GUIDELINES

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Exposé for a Master s Thesis

Overview of the 3rd Workshop on Asian Translation

Matching Meaning for Cross-Language Information Retrieval

Enhancing Morphological Alignment for Translating Highly Inflected Languages

Modeling function word errors in DNN-HMM based LVCSR systems

Corpus Linguistics (L615)

Speech Recognition at ICSI: Broadcast News and beyond

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

A process by any other name

GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics

Language Independent Passage Retrieval for Question Answering

The taming of the data:

COVER SHEET. This is the author version of article published as:

Proceedings Chapter. Reference. Combining pre-editing and post-editing to improve SMT of user-generated content. GERLACH, Johanna, et al.

Specification of the Verity Learning Companion and Self-Assessment Tool

A hybrid approach to translate Moroccan Arabic dialect

Inteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial ISSN:

Pre-editing by Forum Users: a Case Study

BMC Medical Informatics and Decision Making 2012, 12:33

Probabilistic Latent Semantic Analysis

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

MACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions

Procedia - Social and Behavioral Sciences 154 ( 2014 )

A Case Study: News Classification Based on Term Frequency

English-German Medical Dictionary And Phrasebook By A.H. Zemback

Text-mining the Estonian National Electronic Health Record

1.2 Interpretive Communication: Students will demonstrate comprehension of content from authentic audio and visual resources.

SPANISH FOR MASTERY 3 PDF

Modeling full form lexica for Arabic

PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

English Language and Applied Linguistics. Module Descriptions 2017/18

Compositional Semantics

West Windsor-Plainsboro Regional School District French Grade 7

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Task Tolerance of MT Output in Integrated Text Processes

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Oakland Unified School District English/ Language Arts Course Syllabus

Heritage Korean Stage 6 Syllabus Preliminary and HSC Courses

A cognitive perspective on pair programming

Transcription:

Effect of additional in-domain parallel corpora in biomedical statistical machine translation Antonio Jimeno-Yepes 1,3 and Aurélie Névéol 2,3 1 NICTA Victoria Research Lab, Melbourne VIC 3010, Australia 2 LIMSI-CNRS, BP 133, 91403 Orsay Cedex, France 3 National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894, USA antonio.jimeno@gmail.com, neveol@limsi.fr Abstract. Most institutional and research information in the biomedical domain is available only as English text. This is a limitation for non-native English speakers and individuals with low English proficiency. Unfortunately, obtaining parallel corpora to train a statistical machine translation system is difficult. In previous work, we introduced a method to automatically develop corpora for training and evaluating statistical machine translation systems. This method was intended to work with MEDLINE, so was limited to the resources available from the journals indexed in MEDLINE containing information in more than one language. In the current work, we have added in-domain corpora obtained from the UMLS Metathesaurus and a corpus obtained from the European Medicines Agency. Preliminary results indicate that adding in-domain corpora to our previously developed set slightly improves the translation performance. Most of the improvement is observed when the additional in-domain corpora are used to improve word alignment. 1 Introduction Most institutional and research information in the biomedical domain is available only as English text. This is a significant limitation for non-english speakers even in countries where English is an official language, such as the United States or Australia. This renders available biomedical information effectively inaccessible to the high number of non-native English speakers and individuals with low English proficiency. Advances in statistical machine translation (SMT) might improve this situation. Unfortunately, obtaining parallel corpora to train a statistical machine translation system is difficult. In previous work, we have presented a method to automatically develop a corpus for training and evaluating a biomedical SMT system [1] for the language pairs English (EN)/French (FR) and English/Spanish (ES). This method was intended to work on specific MEDLINE R records available from certain journals with information in multiple languages on the journal website. In this work, we present preliminary results of experiments adding more indomain corpora from various source to the MEDLINE corpus. We have extended

this corpus with other biomedical resources, namely using the Unified Medical Language System R (UMLS) and a corpus developed by the European Medicines Agency (EMA). We show that adding additional in-domain resources improves the performance of word alignment. 2 Methods We have used several in-domain corpora from different resources to train a SMT system. We present the MEDLINE corpus that we used in the initial system and used as baseline corpus. Then, we present the UMLS and the EMEA sets. Finally, we present the SMT used for training and evaluating the corpora evaluated on the MEDLINE set. 2.1 MEDLINE corpus MEDLINE currently indexes about 5,200 journals in the biomedical domain. Although most of them publish articles in English, 22% of the articles indexed in MEDLINE were written in a language other than English. From the available citations, the DOI points to the journal article, which in some cases contains the abstract in English and in the original language. We used the corpus described in [1]. This was built using a python script (available upon request from the authors) to obtain a corpus of MEDLINE titles and abstracts, extending the corpus used in [2]. 2.2 UMLS data set The UMLS [3] provides a large resource of knowledge and tools to create, process, retrieve, integrate and/or aggregate biomedical and health data. The UMLS has three main components: the Metathesaurus R, a compendium of biomedical and health content terminological resources under a common representation which contains lexical items for each one of the concepts, relations among them and possibly one or more definitions depending on the concept, the Semantic network, which provides a categorization of Metathesaurus concepts into semantic types and the SPECIALIST lexicon, containing lexical information required for natural language processing which covers commonly occurring English words and biomedical vocabulary. Concepts are assigned a unique identifier (CUI) that has linked to it a set of synonyms, which denote alternative ways to represent the concept, for instance, in text. Some UMLS sources (e.g. MeSH, SNOMED) contain entries in different languages, available from the MRCONSO table. We have processed MRCONSO (from UMLS2012AA) to extract terms in EN that can potentially be paired with their FR/ES counterpart. We paired terms in EN and FR/ES that had all of the following in common: CUI, vocabulary ID, term type (e.g. primary preferred term). From this list of terms, since we target high precision in the selection of term pairs, we removed

entries in which at least one of the terms contained one of the following symbols: / : -., ) ( [ ] > We also removed entries with the abbreviation NOS (Not Otherwise Specified) and any entry in which at least one of the terms was in all capital letters. As can be seen from Table 1, the UMLS lexicon contains entries with exact term translations (e.g. Warthin Tumor/Tumeur de Warthin) as well as entries with synonyms (e.g. Warthin Tumor/Cystadénolymphome) reflecting differences in term variation between languages. English French Adenolymphoma Adénolymphome Warthin Tumor Cystadénolymphome Warthin Tumor Cystadénolymphome Papillaire Warthin Tumor Tumeur de Warthin English Spanish Adenolymphoma Adenolinfoma Warthin Tumor Adenocistoma Papilar Linfomatoso Warthin Tumor Adenolinfoma Warthin Tumor Cistadenoma Linfomatoso Papilar Warthin Tumor Tumor de Warthin Table 1. C0001429 - EN/FR and EN/ES example 2.3 EMEA data set The EMA (European Medicines Agency, http://www.ema.europa.eu/ema) is a European agency with a similar role to the United States Food and Drug Administration (FDA). Its mission is to harmonize national medicine regulatory bodies. Since national bodies use their original languages, this resource can be used to develop parallel corpora for SMT. EMEA [4] is a parallel corpus about medicinal products from EMA available in 22 European official languages, even though not all the documents are available in all languages. In total, there are about 1,500 documents for most languages. The documents are in PDF files that are converted to text using pdftotext and identified sentences are aligned. In total, there are 1,092,568 sentences for the English/French language pair and there are 1,098,333 for the pair English/Spanish language pair. Table 2 shows aligned example sentences for the three languages. 2.4 Translation software We have used the Moses [5] toolkit for Statistical Machine Translation (SMT). Moses is a state-of-the-art open-source phrase based SMT system. The experiments with Moses involved three steps: training, tuning and testing. Support packages SRILM [6] and GIZA++ [7] were installed per the standard Model

English French Spanish Abilify is a medicine containing the active substance qui contient le principe actif que contiene el principio ac- Abilify est un médicament Abilify es un medicamento aripiprazole. aripiprazole. tivo aripiprazol. Table 2. Example sentence in EN/FR/ES from EMEA setup. During the training step, Moses learns word-to-word translation and distortion models based on IBM Model 1-5 [8]. This model is used to build a phrase table and reordering model. During the tuning step, weights for translation, reordering and language models are learned. 2.5 Data set preparation Table 3 shows the final size of the corpus and the selection used in the experiments. The titles and abstract sentences were selected from the MEDLINE corpus. The UMLS term pairs were used only during the training step since only term mappings between two language pairs are available. Since the EMEA data set is much larger than the MEDLINE corpus, we have used 200k sentence pairs for the training step and 30k sentences for the tuning step. French Training Tuning Testing Spanish Training Tuning Testing Titles 458,543 57,317 57,317 Titles 198,512 24,814 24,814 Abstracts 17,351 17,365 28,881 Abstracts 5,403 5,418 7,772 UMLS 109,073 - - UMLS 449,101 - - EMEA 200,000 30,000 - EMEA 200,000 30,000 - Table 3. Translation corpus Training set Test set EtF FtE EtS StE Titles Titles 47.39 47.93 49.93 50.63 Abs sentences 19.29 21.12 25.00 25.59 Titles + Abstract Titles 47.01 48.05 49.82 50.58 Sentences Abs sentences 24.25 25.78 29.98 30.40 Titles + Abstract Titles 46.65 48.23 49.64 50.59 Sentences + UMLS Abs sentences 25.01 26.26 30.83 30.72 Titles + Abstract Titles 46.46 48.30 49.06 50.19 Sentences + EMEA Abs sentences 25.06 26.02 30.64 30.20 Table 4. Translation results. EtF (English to French), FtE (French to English), EtS (English to Spanish), StE (Spanish to English)

3 Results We have trained SMT models using different combinations of corpora, as specified in table 4. The test set comprises MEDLINE title and abstract sentences. Table 4 presents the translation performance of the different models, evaluated using BLEU scores [9]. We find that while the translation performance is improved when using the UMLS set on abstract sentences, using the EMEA corpus seems to inconsistently impact the translation performance. 4 Discussion In this work, we used a variety of in-domain corpora to train biomedical SMTs. We expected the UMLS lexicon to contribute to word alignment and the EMEA corpus to contribute to sentence structure generated by the trained SMT model. Table 4 shows that the translation of abstract sentences improves with the UMLS vocabulary but this observation is not reflected in title sentences. This might be due to less vocabulary variety in titles as compared with abstract sentences. The fact that some entries in the UMLS corpus are synonyms instead of direct translations might also be an impediment in the alignment phase, especially for multi word terms. Unigram and bigram precision scores (not shown; p1 > 60, p2 > 35 for ENES) have good overall performance. The use of the UMLS and EMEA corpus have a higher impact on unigrams and bigrams compared to higher order n-grams. Also, precision values increase more when using the UMLS corpus compared to EMEA. The EMEA corpus is a different genre of biomedical text compared to MEDLINE citations. It seems that the language usage in EMEA is sufficiently different from MEDLINE such that it does not result always in a better model. Results are in line with similar results in biomedical [10] and nonbiomedical [11, 12] data sets, in which the in and out-of-domain corpora helped to improve the word alignment probabilities. 5 Conclusions and Future Work We have introduced and reused additional methods to obtain in-domain parallel corpora to train a SMT system for the biomedical domain. The combination of these in-domain corpora improves word alignment while using these corpora for tuning the model seems to decrease the translation performance. We would like to further evaluate different corpora sizes for the tuning step and evaluate the performance of corpora size for tuning the translation model. In addition, we would like to research the contribution of out-of-domain corpora (e.g. Europarl [13]) in both word alignment and model tuning. The current evaluation has been performed on MEDLINE records and journal abstracts. In future work, we would like to extend this evaluation to EMEA sentences and UMLS records, which might contribute to develop these resources. Finally, the current work is focused on two language pairs. The techniques used in this work are not language dependent and can be extended easily to other languages.

6 Acknowledgements NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program. This work was supported in part by the Intramural Research Program of the NIH, National Library of Medicine. References 1. Jimeno-Yepes, A., Prieur-Gaston, E., Névéol, A.: Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text. BMC Bioinformatics (in press) (2013) 2. Wu, C., Xia, F., Deleger, L., Solti, I.: Statistical machine translation for biomedical text: are we there yet? In: AMIA Annual Symposium Proceedings. Volume 2011., American Medical Informatics Association (2011) 1290 3. Bodenreider, O.: The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research 32(Database Issue) (2004) D267 4. Tiedemann, J.: News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In Nicolov, N., Bontcheva, K., Angelova, G., Mitkov, R., eds.: Recent Advances in Natural Language Processing. Volume V. John Benjamins, Amsterdam/Philadelphia, Borovets, Bulgaria (2009) 237 248 5. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., et al.: Moses: Open source toolkit for statistical machine translation. In: Annual meeting-association for computational linguistics. Volume 45. (2007) 2 6. Stolcke, A.: SRILM-an extensible language modeling toolkit. In: Proceedings of the international conference on spoken language processing. Volume 2. (2002) 901 904 7. Och, F., Ney, H.: A systematic comparison of various statistical alignment models. Computational linguistics 29(1) (2003) 19 51 8. Brown, P., Pietra, V., Pietra, S., Mercer, R.: The mathematics of statistical machine translation: Parameter estimation. Computational linguistics 19(2) (1993) 263 311 9. Papineni, K., Roukos, S., Ward, T., Zhu, W.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics (2002) 311 318 10. Eck, M., Vogel, S., Waibel, A.: Improving statistical machine translation in the medical domain using the unified medical language system. In: Proceedings of the 20th international conference on Computational Linguistics, Association for Computational Linguistics (2004) 792 11. Duh, K., Sudoh, K., Tsukada, H.: Analysis of translation model adaptation in statistical machine translation. In: Proceedings of the International Workshop on Spoken Language Translation (IWSLT 10), Paris, France. (2010) 12. Haddow, B., Koehn, P.: Analysing the effect of out-of-domain data on smt systems. In: Proceedings of the Seventh Workshop on Statistical Machine Translation. (2012) 422 432 13. Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: MT summit. Volume 5. (2005)