Effect of additional in-domain parallel corpora in biomedical statistical machine translation

Size: px

Start display at page:

Download "Effect of additional in-domain parallel corpora in biomedical statistical machine translation"

Erika Bates
5 years ago
Views:

1 Effect of additional in-domain parallel corpora in biomedical statistical machine translation Antonio Jimeno-Yepes 1,3 and Aurélie Névéol 2,3 1 NICTA Victoria Research Lab, Melbourne VIC 3010, Australia 2 LIMSI-CNRS, BP 133, Orsay Cedex, France 3 National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894, USA antonio.jimeno@gmail.com, neveol@limsi.fr Abstract. Most institutional and research information in the biomedical domain is available only as English text. This is a limitation for non-native English speakers and individuals with low English proficiency. Unfortunately, obtaining parallel corpora to train a statistical machine translation system is difficult. In previous work, we introduced a method to automatically develop corpora for training and evaluating statistical machine translation systems. This method was intended to work with MEDLINE, so was limited to the resources available from the journals indexed in MEDLINE containing information in more than one language. In the current work, we have added in-domain corpora obtained from the UMLS Metathesaurus and a corpus obtained from the European Medicines Agency. Preliminary results indicate that adding in-domain corpora to our previously developed set slightly improves the translation performance. Most of the improvement is observed when the additional in-domain corpora are used to improve word alignment. 1 Introduction Most institutional and research information in the biomedical domain is available only as English text. This is a significant limitation for non-english speakers even in countries where English is an official language, such as the United States or Australia. This renders available biomedical information effectively inaccessible to the high number of non-native English speakers and individuals with low English proficiency. Advances in statistical machine translation (SMT) might improve this situation. Unfortunately, obtaining parallel corpora to train a statistical machine translation system is difficult. In previous work, we have presented a method to automatically develop a corpus for training and evaluating a biomedical SMT system [1] for the language pairs English (EN)/French (FR) and English/Spanish (ES). This method was intended to work on specific MEDLINE R records available from certain journals with information in multiple languages on the journal website. In this work, we present preliminary results of experiments adding more indomain corpora from various source to the MEDLINE corpus. We have extended

2 this corpus with other biomedical resources, namely using the Unified Medical Language System R (UMLS) and a corpus developed by the European Medicines Agency (EMA). We show that adding additional in-domain resources improves the performance of word alignment. 2 Methods We have used several in-domain corpora from different resources to train a SMT system. We present the MEDLINE corpus that we used in the initial system and used as baseline corpus. Then, we present the UMLS and the EMEA sets. Finally, we present the SMT used for training and evaluating the corpora evaluated on the MEDLINE set. 2.1 MEDLINE corpus MEDLINE currently indexes about 5,200 journals in the biomedical domain. Although most of them publish articles in English, 22% of the articles indexed in MEDLINE were written in a language other than English. From the available citations, the DOI points to the journal article, which in some cases contains the abstract in English and in the original language. We used the corpus described in [1]. This was built using a python script (available upon request from the authors) to obtain a corpus of MEDLINE titles and abstracts, extending the corpus used in [2]. 2.2 UMLS data set The UMLS [3] provides a large resource of knowledge and tools to create, process, retrieve, integrate and/or aggregate biomedical and health data. The UMLS has three main components: the Metathesaurus R, a compendium of biomedical and health content terminological resources under a common representation which contains lexical items for each one of the concepts, relations among them and possibly one or more definitions depending on the concept, the Semantic network, which provides a categorization of Metathesaurus concepts into semantic types and the SPECIALIST lexicon, containing lexical information required for natural language processing which covers commonly occurring English words and biomedical vocabulary. Concepts are assigned a unique identifier (CUI) that has linked to it a set of synonyms, which denote alternative ways to represent the concept, for instance, in text. Some UMLS sources (e.g. MeSH, SNOMED) contain entries in different languages, available from the MRCONSO table. We have processed MRCONSO (from UMLS2012AA) to extract terms in EN that can potentially be paired with their FR/ES counterpart. We paired terms in EN and FR/ES that had all of the following in common: CUI, vocabulary ID, term type (e.g. primary preferred term). From this list of terms, since we target high precision in the selection of term pairs, we removed

3 entries in which at least one of the terms contained one of the following symbols: / : -., ) ( [ ] > We also removed entries with the abbreviation NOS (Not Otherwise Specified) and any entry in which at least one of the terms was in all capital letters. As can be seen from Table 1, the UMLS lexicon contains entries with exact term translations (e.g. Warthin Tumor/Tumeur de Warthin) as well as entries with synonyms (e.g. Warthin Tumor/Cystadénolymphome) reflecting differences in term variation between languages. English French Adenolymphoma Adénolymphome Warthin Tumor Cystadénolymphome Warthin Tumor Cystadénolymphome Papillaire Warthin Tumor Tumeur de Warthin English Spanish Adenolymphoma Adenolinfoma Warthin Tumor Adenocistoma Papilar Linfomatoso Warthin Tumor Adenolinfoma Warthin Tumor Cistadenoma Linfomatoso Papilar Warthin Tumor Tumor de Warthin Table 1. C EN/FR and EN/ES example 2.3 EMEA data set The EMA (European Medicines Agency, is a European agency with a similar role to the United States Food and Drug Administration (FDA). Its mission is to harmonize national medicine regulatory bodies. Since national bodies use their original languages, this resource can be used to develop parallel corpora for SMT. EMEA [4] is a parallel corpus about medicinal products from EMA available in 22 European official languages, even though not all the documents are available in all languages. In total, there are about 1,500 documents for most languages. The documents are in PDF files that are converted to text using pdftotext and identified sentences are aligned. In total, there are 1,092,568 sentences for the English/French language pair and there are 1,098,333 for the pair English/Spanish language pair. Table 2 shows aligned example sentences for the three languages. 2.4 Translation software We have used the Moses [5] toolkit for Statistical Machine Translation (SMT). Moses is a state-of-the-art open-source phrase based SMT system. The experiments with Moses involved three steps: training, tuning and testing. Support packages SRILM [6] and GIZA++ [7] were installed per the standard Model

4 English French Spanish Abilify is a medicine containing the active substance qui contient le principe actif que contiene el principio ac- Abilify est un médicament Abilify es un medicamento aripiprazole. aripiprazole. tivo aripiprazol. Table 2. Example sentence in EN/FR/ES from EMEA setup. During the training step, Moses learns word-to-word translation and distortion models based on IBM Model 1-5 [8]. This model is used to build a phrase table and reordering model. During the tuning step, weights for translation, reordering and language models are learned. 2.5 Data set preparation Table 3 shows the final size of the corpus and the selection used in the experiments. The titles and abstract sentences were selected from the MEDLINE corpus. The UMLS term pairs were used only during the training step since only term mappings between two language pairs are available. Since the EMEA data set is much larger than the MEDLINE corpus, we have used 200k sentence pairs for the training step and 30k sentences for the tuning step. French Training Tuning Testing Spanish Training Tuning Testing Titles 458,543 57,317 57,317 Titles 198,512 24,814 24,814 Abstracts 17,351 17,365 28,881 Abstracts 5,403 5,418 7,772 UMLS 109, UMLS 449, EMEA 200,000 30,000 - EMEA 200,000 30,000 - Table 3. Translation corpus Training set Test set EtF FtE EtS StE Titles Titles Abs sentences Titles + Abstract Titles Sentences Abs sentences Titles + Abstract Titles Sentences + UMLS Abs sentences Titles + Abstract Titles Sentences + EMEA Abs sentences Table 4. Translation results. EtF (English to French), FtE (French to English), EtS (English to Spanish), StE (Spanish to English)

5 3 Results We have trained SMT models using different combinations of corpora, as specified in table 4. The test set comprises MEDLINE title and abstract sentences. Table 4 presents the translation performance of the different models, evaluated using BLEU scores [9]. We find that while the translation performance is improved when using the UMLS set on abstract sentences, using the EMEA corpus seems to inconsistently impact the translation performance. 4 Discussion In this work, we used a variety of in-domain corpora to train biomedical SMTs. We expected the UMLS lexicon to contribute to word alignment and the EMEA corpus to contribute to sentence structure generated by the trained SMT model. Table 4 shows that the translation of abstract sentences improves with the UMLS vocabulary but this observation is not reflected in title sentences. This might be due to less vocabulary variety in titles as compared with abstract sentences. The fact that some entries in the UMLS corpus are synonyms instead of direct translations might also be an impediment in the alignment phase, especially for multi word terms. Unigram and bigram precision scores (not shown; p1 > 60, p2 > 35 for ENES) have good overall performance. The use of the UMLS and EMEA corpus have a higher impact on unigrams and bigrams compared to higher order n-grams. Also, precision values increase more when using the UMLS corpus compared to EMEA. The EMEA corpus is a different genre of biomedical text compared to MEDLINE citations. It seems that the language usage in EMEA is sufficiently different from MEDLINE such that it does not result always in a better model. Results are in line with similar results in biomedical [10] and nonbiomedical [11, 12] data sets, in which the in and out-of-domain corpora helped to improve the word alignment probabilities. 5 Conclusions and Future Work We have introduced and reused additional methods to obtain in-domain parallel corpora to train a SMT system for the biomedical domain. The combination of these in-domain corpora improves word alignment while using these corpora for tuning the model seems to decrease the translation performance. We would like to further evaluate different corpora sizes for the tuning step and evaluate the performance of corpora size for tuning the translation model. In addition, we would like to research the contribution of out-of-domain corpora (e.g. Europarl [13]) in both word alignment and model tuning. The current evaluation has been performed on MEDLINE records and journal abstracts. In future work, we would like to extend this evaluation to EMEA sentences and UMLS records, which might contribute to develop these resources. Finally, the current work is focused on two language pairs. The techniques used in this work are not language dependent and can be extended easily to other languages.

6 6 Acknowledgements NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program. This work was supported in part by the Intramural Research Program of the NIH, National Library of Medicine. References 1. Jimeno-Yepes, A., Prieur-Gaston, E., Névéol, A.: Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text. BMC Bioinformatics (in press) (2013) 2. Wu, C., Xia, F., Deleger, L., Solti, I.: Statistical machine translation for biomedical text: are we there yet? In: AMIA Annual Symposium Proceedings. Volume 2011., American Medical Informatics Association (2011) Bodenreider, O.: The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research 32(Database Issue) (2004) D Tiedemann, J.: News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In Nicolov, N., Bontcheva, K., Angelova, G., Mitkov, R., eds.: Recent Advances in Natural Language Processing. Volume V. John Benjamins, Amsterdam/Philadelphia, Borovets, Bulgaria (2009) Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., et al.: Moses: Open source toolkit for statistical machine translation. In: Annual meeting-association for computational linguistics. Volume 45. (2007) 2 6. Stolcke, A.: SRILM-an extensible language modeling toolkit. In: Proceedings of the international conference on spoken language processing. Volume 2. (2002) Och, F., Ney, H.: A systematic comparison of various statistical alignment models. Computational linguistics 29(1) (2003) Brown, P., Pietra, V., Pietra, S., Mercer, R.: The mathematics of statistical machine translation: Parameter estimation. Computational linguistics 19(2) (1993) Papineni, K., Roukos, S., Ward, T., Zhu, W.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics (2002) Eck, M., Vogel, S., Waibel, A.: Improving statistical machine translation in the medical domain using the unified medical language system. In: Proceedings of the 20th international conference on Computational Linguistics, Association for Computational Linguistics (2004) Duh, K., Sudoh, K., Tsukada, H.: Analysis of translation model adaptation in statistical machine translation. In: Proceedings of the International Workshop on Spoken Language Translation (IWSLT 10), Paris, France. (2010) 12. Haddow, B., Koehn, P.: Analysing the effect of out-of-domain data on smt systems. In: Proceedings of the Seventh Workshop on Statistical Machine Translation. (2012) Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: MT summit. Volume 5. (2005)

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling Pratyush Banerjee, Sudip Kumar Naskar, Johann Roturier 1, Andy Way 2, Josef van Genabith