Automatic Machine Translation in Broadcast News Domain

Automatic Machine Translation in Broadcast News Domain Alexandre Gusmão L 2 F/INESC-ID Lisboa Rua Alves Redol, 9, 1000-029 Lisboa, Portugal {ajag}@l2f.inesc-id.pt Abstract. This paper describes the automatic translation system, from Portuguese into English, in broadcast news domain, developded in L 2 F, Laboratório de Língua Falada, in INESC-ID. It presents a brief introduction about the stat of the art in automatic translation and describes the tools used during the construction of this system and all the experiences done. In the end of this paper a table resumes all this experiences, as well as the evolution of BLEU obtained values. 1 Introduction The purpose of this paper is the development of an automatic translation system from Portuguese into English, from speech to text, for broadcast news. There are various inherent advantages to the building of such a system, the main one of which is, for example, to allow better comprehension of the news on the part of foreign listeners, as well as to help people with some kind of hearing problem. In the approach to the building of a machine translation system, several approaches were studied, such as IBM models, syntax-based models and phrasebased models. The systems based on statistical methods will be predominantly discussed in this paper. In order for the statistical methods to be used, a significant quantity of parallel texts is needed regarding the certain subject domain. The lack of parallel texts in the field of broadcast news will be one of the major problems to be addressed during the realization of this paper. In this dissertation several experiments will be demonstrated, which were carried out towards the building of a translation system from Portuguese into English in the field of broadcast news, the problems faced and the adopted solutions, as well as the results obtained from every experiment.

2 State of Art Various approaches to machine translation were made, where initially those with better results were the rule-based ones. However, this approach will not be considered in detail in this dissertation. On the other hand, through the work of Brown, the interest in the study of statistical methods was revived, and those deserve a more detailed study throughout this paper. Concerning the rule-based translation systems[1], plenty of linguistic information is needed and it is very difficult to write rules which cover all the language. These systems can be classified as direct systems, of transfer and interlingua. The figure 1 illustrates these systems. As far as the statistical-based transla- Fig. 1. Vauquois Triangle tion systems[2] are concerned, instead of using rigorous linguistic rules, they use probability distributions. These systems offer some advantages, such as the use of probabilities is easier, depending on the task to be performed, these can be added and multiplied, also the existence of algorithms which learn automatically to estimate the value of the probability, without human intervention, and contrary to the non-statistical approach, these systems do not require the manual development of linguistic rules.

However, these systems offer some disadvantages as well, such as difficult adaptation of the same system to different subject domains and the fact that these systems do not take into account the syntactic information of the phrases. In the statistical approach the distribution probability P(F E) of all the possible pairs (F, E) is considered and the translated phrase and the one with the highest probability is selected, that is E = argmax E P(F E).P(E) (1) where P(F E) stands for a translation model and P(E) stands for a fluency model. The translation models attribute a probability to each alignment between the input phrase and the output phrase. An alignment[3] is no more than a set of connections between the source phrase and the target phrase, where each word from the target phrase is connected only to one word from the source phrase. The IBM Models[4] are some of the models used for that purpose. There are 5 IBM Models, model 1 considers all possible connections between the source phrase and the target phrase and in model 2 the word order in the phrases now influences the probability value. Models 3, 4 and 5 consider aspects such as word fertility (number of words from the target phrase connected to a word from the source phrase), Identity (translation) of words in the target language and finally the position occupied by each word in the target phrase. As to the phrase-based systems[5] (word sequences), the phrase translation includes operations such as phrase segmentation, translation of each of the phrases in the target language and its reordering, in a way to form phrases from that language. These phrase-based systems are trained based on parallel aligned texts, employing word-based translation models to align each phrase pair in the corpus, in terms of words. Dew to the extreme difficulty of the search task, it is necessary to use efficient algorithms, such as A* or Dynamic Programming Beam. Dew to the good results which these models have achieved, it was decided to build a phrase-based translation system. As to the syntax-based systems[6], those aim at the introduction of structural aspects of the language and that is why they employ some operations, such as word reorder, introduction of more words and finally their translation. The Figure 2 illustrates an example of those operations applications. The evaluation of the translation system can be done by human intervention or automatic metric, and the latter has been chosen to determine the quality of the system developed in this dissertation. Some of the most famous automatic evaluation metrics are WER (Word Error Rate)[7], PER (Position Independent Word Error Rate)[7], NIST (National Institute of Standards and Technology)[8] and BLEU (Bilingual Evaluation Understudy)[9]. This last metric will be mostly used to evaluate the quality of the built translation system. It is a measure which

Fig.2. Reordering, introduction and translation operations compares the number of words shared between the phrases candidates and the reference phrases and is also based on n-grams matching, that is, instead of checking whether each word from the translated phrase is found in the source phrase, it is checked whether a sequence of words (up to 4 words) is found in those same phrases. As to the speech translation, a system of that kind is built by two sequent systems, one for speech recognition into text and another for the translation of the text in the desired language. In the Laboratório de Língua Falada (L 2 F laboratory) there are some speech recognition systems, namely Audimus, a system for the recognition of Portuguese language, also for broadcast news. Among the several existing speech translation systems, to be highlighted is a system developed by the European project TC-STAR (http://www.tc-star.org/), as being most similar to the system developed in this dissertation, due to the fact that it includes extensive vocabulary(more than 60000 words) and also for being a project whose subject domain (European Parliament sessions) approximates the broadcast news. For the development of a translation system for the broadcast news, advantage was taken from the speech recognition system Audimus already referred and a phrase-based translation system was built, best possibly adapted to the subject domains.

3 The Translation System In this chapter the main stages associated with the translation system used in this paper will be described, namely the creation of language model, standardization of training corpus, training of the system, phrasetable filter, tuning and evaluation of the system. In regard to the language model, it is frequently used in speech recognition and translation, where this model tries to predict the following word in a sequence of words. When a phrase is inputted in the translator, several possible translations of that same phrase are formed, and the job of the language model is to attribute a certain weight to each one of these translations. As to the training corpus standardization, it is necessary to perform this task in a way that the corpus vocabulary used in the system training is compatible with the vocabulary coming from the speech recognizer. Typical standardization tasks are transformation of abbreviations to their full name, as well as roman numbers, decimal numbers, dates, currency symbols, etc. Following the building of the language model and the standardization of the training corpus is the training of the system through, for example, the Moses tool. Alignments are obtained between the words of the two languages and probabilities are attributed to each one of these alignments. All these alignments are kept in a phrase-table. Since the vocabulary to be used for the training of the system will be based on the European Parliament sessions, it is necessary to filter the phrase-table so that it will contain only words belonging to the broadcast news domain. In that way the system will be quicker in terms of calculation and will be based on the broadcast news area. In a later stage a system tuning is carried out, in other words, different values are combined for the features used by the Moses tool, throughout the successive translation iterations until the BLEU value obtained starts to converge to a final value. This is one of the stages of the translation system building process which takes more time and also one of the most important. In order to perform this task it is essential to use a development corpus based exclusively on the subject domain of the task in question, in this case the broadcast news area. At last, after the translation system is built, it is necessary to evaluate it. For this automated evaluation metrics are used, as for example the metric BLEU. The higher its value, the better the quality of the translations done by the system. To build the translation system some specific tools were used. Some of those tools will be further on discussed as well as their purpose.

The SRILM[10] is a tool used for creating and applying statistical language models, mainly to be used in the speech recognition. In this way, it permits the training of the statistical language model, as well as its testing. SRILM is a tool comprised of a set of libraries C and a set of scripts which facilitate the execution of this tool s tasks. To train the system, the tool Moses[11] was used. Moses is a statistical automated translation system which allows the training of the translation models for whatever language pair in an automated way. It has a set of components which allows the training of language models, training of translation models, system tuning and evaluation of translation phrases. One of the main components which comprise the Moses is the GIZA++[12]. This tool is used for the training of the translation model and achievement of alignments referred to previously. Concerning the corpora to be used in the building of a translation system, this was the most problematic aspect. If the translator was based on the European Parliament sessions, there are no major problems, since these sessions are translated into various languages and therefore more than sufficient quantity of parallel texts exists for the training of a translation system in this context. However, the planned translation system will be based on broadcast news and after conducting a research on the corpora of that field, it was concluded that various words exist which are not translated in the corpora of the European Parliament. In that way, the lack of corpora in the domain of broadcast news to train the system will be one of the main problems to face. The adopted solution was to train the system with the European Parliament corpus while the sets of development and testing would be based on the broadcast news context. The building of the corpora was done with resort to the euronews website (http://www.euronews.net), where a total of 914 phrases was obtained of which 457 are written in Portuguese and the other half corresponds to the translation of these phrases, in English. 4 Text translation and Evaluation This chapter will deal with various experiments conducted for the creation of a translator from text to text for the language pair Portuguese English in the field of broadcast news. The translator which was first developed had as bases the sessions of the European Parliament, and this system obtained a BLEU rate of 0,3531 (vocabulary with tokens and in small letters Condition A) and 0,3445 in the conditions of the Europart system (Condition B). In this system the phrase-table was filtered in order to contain only words from the broadcast news domain. Dew to this filtering and also to the fact that the testing corpus used to evaluate the system was the one from the European Parliament area, the BLEU value obtained was

of 0,2699 in the condition A and 0,2643 in the condition B. The next step was the definition of a baseline system. At first a standardization of the training corpus was done and the phrase-table which was formed during the model training was filtered so that this could contain only phrases whose words were found in the word list used in the development corpus. In regard to the training corpus of the system, this continued to belong to the European Parliament domain. As to the language model, the development and testing set, corpora based on the broadcast news domain was used. At the end, the baseline system obtained a BLEU value of 0,1705 in condition A and 0,1650 in condition B. 4.1 Experiment 1 In the first adaptation in relation to the baseline system, it was decided to change the development and testing corpus, since most of the times, some of the phrases were comparable but were not direct translations. Only few adjustments were made in relation to the phrases in Portuguese and in the end, the same system but with these changed corpora, obtained a BLEU value of 0,4776 in condition A and 0,4722 in condition B. 4.2 Experiment 2 In this experiment it was opted for a language model through the interpolation technique. Two corpora from different domains were used, one of the broadcast news and the other of newspaper. The metric perplexity was used for the evaluation of the created language model. Relatively to the language model of the broadcast news, a model with 4.825.719 words was built with a rate of perplexity of 154,453. As to the language model of the newspapers, a model of 14.462.901 words was built with a rate of perplexity of 132,607. After the interpolation of both models, a final language model was obtained with 16.407.968 words and a rate of perplexity of 112,894. 4.3 Experiment 3 In this experiment, the interpolated language model was used in the translation system already developed up to the moment, which obtained a BLEU value of 0,4861 in condition A and 0,4799 in condition B. 4.4 Experiment 4 In all experiments it is confirmed that when recase of the phrases obtained by the decoder was conducted, the BLEU value decreased slightly. In that way, it was decided to refine the way in which this tool was being trained, joining two corpora of different domains, one connected to the broadcast news and the other

to the newspapers. Since the texts in the newspapers can only approximate the broadcast news transcriptions, not making part of their area, the BLEU value did not increase, on the contrary, it decreased, obtaining in that way a value of 0.4790. One more experiment was then made in order to try to find any solution which will present positive results. The translation system was then trained with training corpus, in which not all words in English were in small letters, and in that way the system learns to capitalize the supposedly correct words. In the end, the BLEU value was not satisfactory, having obtained a value of 0,4071. 4.5 Experiment 5 As a last experiment to try to obtain a better BLEU value, some post-translation processing was used. In that way, a reevaluation is made of a set of 1000 hypothesis of each phrase, formed by the translation system using some new features. The new features used are the following: Difference in the number of words between the phrase in Portuguese and the phrase hypothesis in English; POS (Part of Speech) Usage of rules for correspondence between the language pair Portuguese and English and of penalization patterns in English. In regard to the feature difference in the number of words, several experiments were made using the combination between the number of words of the phrase in Portuguese, number of words of the phrase in English and the difference in number of words. All the possible combinations presented satisfactory BLEU values, but it was the difference in the number of words that was mostly highlighted, with the system obtaining a BLEU value of 0,5055. As to the POS feature, two concepts are presented. The first is related to the calculation of similarities between the POS tags in both languages. The determined tags between both languages are counted and a score is attributed to each phrase, according to the number of equivalences found between them. The other concept is related to the calculation of penalization patterns in which whenever the system comes upon a pattern classified as penalization pattern in the phrase in English, a penalization is attributed to the phrase in question. With the POS feature contribution, the translation system obtained a BLEU value of 0,4967. In the end, with the usage of the features POS and the difference in number of words between the phrase in Portuguese and English, the system obtained a final BLEU value of 0,5088. 4.6 Point of Comparison In a way to compare the created translation system with another, it was decided to translate all phrases of the testing corpus through the translation system

provided by the search engine google. After this task was performed, the BLEU value registered by this search engine was 0,4102, which is far below the value obtained by the system created in this paper. 5 Conclusion and Future Work In order to improve the translation quality in future work, some approaches can be developed in relation to OOVWs (out of vocabulary words), as for example a dictionary can be used where all words and respective translations are inserted. In this case, about 14.521 words in Portuguese and 11.036 in English do not exist in the corpus of the European Parliament, making it unthinkable to transcribe all these words and respective translations. Another solution is to use the website (http://www.verbix.com), where it is possible to obtain all verbal terms of the verbs contained in the training corpus. Yet another alternative is to copy some words which are simultaneously in the training corpus in Portuguese and English, that is, words which do not have translation such as proper names. Of the conclusions drawn from this paper, the following are to point out: The use of: Training corpus belonging to the area equivalent to the desired aim (in this case it was not possible to use corpus inherent to the broadcast news); Language model interpolated with another from a similar context ( in that case the model used was a model belonging to the newspaper texts domain); Clean development corpus, that is, with correctly translated phrases; Some post translation processing, using the features described in the previous chapter, which can maximize the system s BLEU value, by choosing the best phrase among the N possible. The conjugation of all these assumptions resulted in the building of an automatic translation system in the broadcast news context, with a BLEU value of 0,5088. The used corpus for the training of the translation system was always based on the European Parliament sessions, since there are not sufficient resources available for the broadcast news context. Relatively to the corpus used for the construction of the language model, a interpolation between two corpora of different domains was carried out, one related to the broadcast news[13] and another related to the newspapers[14]. As for the development and test corpora, these were always based on the broadcast news domain. However, these corpora suffered some corrections in a way that the system manages to produce translations with a better quality and consequently obtain a better BLEU value. The 1 table illustrates the type of corpora used and the respective description for the broadcast news domain.

Type of corpus Language Model Training corpus Development corpus Test corpus Description Set of phrases, based of broadcast news and newspapers, only written in the destination language. Paralel corpus, based on European Parliament. Paralel corpus, based on broadcast news. Paralel corpus, based on broadcast news. Table 1. Corpus and description. For a better comprehension of all experiments carried out, the table 2 describes all those experiments, offering a brief description of them and the respective BLEU values obtained. BLEU Experience Description Condition A Condition B Experience 1 Baseline system, language model 0.4776 0.4722 based on broadcast news, training corpus based on European Parliament, tuning and test corpora changed Experience 2 Language model Interpolation - - Experience 3 Translation System with an interpolated 0.4861 0.4799 language model Experience 4 Enhancement of the training corpus 0.4799 0.4790 of the recase system with newspaper texts Automatic capitalization system - 0.4071 Experience 5 Reprocess of the obtained translations - 0.5088 (new features) Table 2. Experiences References 1. D Jurafsky, J.M.: Speech and language processing (2000) Publisher: Prentice Hall. 2. Ney, H.: One decade of statistical machine translation: 1996:2005. In: Human Language Technology and Pattern Recognition, Germany, Lehrstuhl informatik VI-Computer Science Department, RWTH Aachen (2005) 3. Knight, K.: Translation with finite-state devices. CA 90292 (2006) 4. Brown, P.: The mathematics of machine translation: Parameter estimation. In: Computational Linguistics. Volume 19. (2003) 263 310

5. Koehn, P.: Introduction to statistical machine translation (2005) 6. Marcu, D.: Spmt: Statistical machine translation with syntactified target language phrases, 4640 Admiralty Way, Suite 1210, Marina del Rey, Language Weaver Inc (2006) CA 90292. 7. Nicola Ueffing, H.N.: Lehrstuhl fur informatik vi. bayes decision rules and confidence measures for statistical machine translation. In: Computer Science Department RWTH, University Ahornstrasse 55, 52056, Aachen, Germany (2004) 8. Doddington, G.: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. (2002) 9. Papinemi, K.: Bleu: a method for automatic evaluation of machine translation, IBM Research Report (2001) 10. Stolcke, A.: Srilm - an extensible language modeling toolkit. In: Proc. International Conference on Spoken Language Processing. Volume 2., Denver, CO (September 2002) 901 904 11. Koehn, P.: Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, Prague, Czech Republic, Association for Computational Linguistics (June 2007) 177 180 12. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Volume 29. (2003) 19 51 13. MacIntyre, R.: Ldc catalog number ldc98t31. (1998) 14. Graff, D.: Ldc catalog number ldc95t21, isbn 1-58563-053-5. (1995)