Improvement Issues in -Thai Speech Translation Chai Wutiwiwatchai, Thepchai Supnithi, Peerachet Porkaew, Nattanun Thatphithakkul Human Language Technology Laboratory, National Electronics and Computer Technology Center, 112 Pahonyothin Rd., Klong-luang, Pathumthani, 12120 Thailand {chai.wut, thepchai.sup, peerachet.por, nattanun.tha}@nectec.or.th Abstract The first -Thai speech translation web service has been developed at NECTEC, Thailand. The service is based on the STML standard protocol provided under the ASTAR consortium. In parallel, several research works have also been conducted to improve the system performance, including an exploration of better Thai word segmentation algorithms, translation modeling using word reordering within noun phrases, and Thai named entity detection using an existing named entity tagging tool. This article reports the status of the speech translation web service development, summarizes the mentioned research works, and discusses our research path. 1 Introduction Speech translation is an innovative technology that allows people to converse with each other by speaking in their own languages. This technology is useful in the present society where communication happens among people from most of regions in the world. In order to achieve effective systems, international collaborative efforts were established. The Negotiating through spoken language in e-commerce (NESPOLE!) project (Taddei et al., 2002) gathered a number of European countries to develop an e-commerce service using multilingual speech translation systems. Recently, the Asian Speech Translation Advanced Research (ASTAR) consortium 1 was set up with collaboration among many Asian countries to jointly develop infrastructures for building web service based speech translation systems. In Thailand, the National Electronics and Computer Technology Center (NECTEC) has investigated on this research and development 1 http://www.slc.atr.jp/astar/ area since 2007 thru the ASTAR consortium. The aim is to build components and resources required for -Thai speech translation focusing on the travel domain. A stand-alone prototype where major components; Automatic speech recognition (ASR), machine translation (MT), and text-to-speech synthesis (TTS), were simply connected was shown in 2007 (Wutiwiwatchai, 2007). At present, the engine has been ported to be available as a web service via an infrastructure provided under the consortium. However, there are still a number of works needed to complete and improve the system such as enlarging the training resource, improving the statistical machine translator (SMT) with language specific knowledge, treating named entities, and handling non-native speakers speaking, improving the quality and variety of synthesized speech, etc. This article aims to summarize our current focus in improving the baseline system. In parallel to the web service implementation, an - Thai parallel text corpus was prepared for improving ASR language models and SMT translation models. As the Thai language is written without word boundary markers, several word segmentation algorithms have been investigated as a basic tool for some development and implementation steps e.g. text preparation for language modeling and text pre-processing for SMT and TTS. A combinatory categorical grammar (CCG) based parser has been introduced to Thai text analysis (Boonkwan and Supnithi, 2008). The CCG parser is useful to enclose the noun phrase (NP), where the inside word order can be additionally constrained under the SMT process. For the travel domain, the named entity (NE) especially place and organization names occur frequently. To accelerate the annotation process for parallel corpus development, an automatic NE tagging tool available for is helpful for automatic tagging of Thai NEs.
The next section overviews our -Thai speech translation web service, followed by the improvement issues mentioned above. Section 4 discusses and concludes this article. 2 -Thai Speech Translation Web Service Figure 1 illustrates an overall structure of our -Thai speech translation web service. /Thai ASR -Thai SMT STML web service center /Thai TTS Figure 1. The architecture of the -Thai speech translation web service Table 1. Summary of basic tools and configuration used in each service component Component Tool Configuration ASR SPHINX 2 CMU acoustic model and n-gram language model trained by 100K sentences from BTEC Thai ASR ISPEECH Acoustic model trained by LOTUS and NECTEC- ATR and n-gram language model trained by 100K Thai sentences from BTEC - Thai SMT TTS MOSES Microsoft SAPI 3 Translation and language models trained by 160K sentence pairs from BTEC and dictionaries Embedded Speech API provided under Microsoft Windows Thai TTS VAJA Variable-length unit selection on 14-hour female speech corpus The web service follows a standard protocol provided under the ASTAR consortium called 2 CMU SPHINX, http://cmusphinx.sourceforge.net/ 3 Microsoft SAPI, http://www.microsoft.com/speech Speech Translation Marked-up Language (STML version 1.0). According to the Figure 1, the STML web service center functions to manage data exchange among client s and requested services. Currently, the services available include /Thai ASR, -to-thai and Thai-to- SMT, and /Thai TTS 4. Table 1 summarizes basic tools and configuration of each component. The ASR has been adopted from the SPHINX speech recognition engine provided by Carnegie Mellon University, which also provides a US acoustic model. A language model is based on n-grams trained by a combination of text data, one generated from a handcrafted regular grammar and the other from the side of the Basic Travel Expressions Corpus (BTEC) provided by ATR (NICT), Japan. The Thai ASR consists of the ISPEECH Thai speech recognizer built at NECTEC based on a similar architecture to the JULIUS 5 speech recognizer from Kyoto University. The ISPEECH is incorporated with a general Thai acoustic model trained on a number of training sets e.g. the LOTUS corpus (Kasuriya et al., 2002), the NECTEC-ATR corpus (Kasuriya et al., 2003) added with multiple noise signals for sake of robustness. The Thai language model is trained on the Thai side of the BTEC corpus with Thai translation. The MOSES tool (Koehn et al., 2007) is used to build the -Thai SMT engine. It is trained by a combined set of two parallel text corpora, more than 100,000 sample sentence pairs from Thai- dictionaries and 100,000 sentence pairs from the -Thai BTEC corpus. Since the Thai script has no word boundaries, automatic word segmentation is applied before the SMT conventional process. The same word segmentation tool must also be applied to the Thai part of the training parallel text so that the automatic word alignment tool available in the SMT builder can be performed as usual. VAJA is a general Thai TTS engine based on variable-length unit selection with a 14-hour female speech corpus. words written in the Roman alphabet can be speech synthesized by using a look-up dictionary that maps words to Thai pronunciations. Regarding TTS, the free TTS engine provided under the Microsoft Speech API (SAPI) is adopted. 4 At the time of writing this article, the Thai ASR and TTS are under-construction. 5 JULIUS, http://julius.sourceforge.jp/en_index.php
All the above components have been developed in a client/ framework which allows s to access to the engines via the STMLbased web service format. 3 Improvement Issues The speech translation web service is considered a baseline system where most of components are developed based on conventional tools. There are still many issues left to improve. In this section, three improvement issues recently investigated are described, including word segmentation, word re-ordering in noun phrases (NP), and named entity (NE) tagging. 3.1 Thai word segmentation A common problem in languages without word boundary markers is to segment text into word sequences. The difficulty of Thai language processing not only comes from missing syllable, word and sentence boundary markers, but also the ambiguity when the script is mixed with loan words and named entities written in Thai. Word segmentation in Thai has been widely researched for decades. However, it is still far from completion due to the lack of large training data and standard evaluation sets. Recently, a project named Benchmarks for Enhancing the Standard of Thai language processing (BEST) 6 has been initiated. As a result, a guideline for Thai word segmentation as well as an official large annotated text corpus for training and evaluation is available. NECTEC s traditional word segmentation tool is based on word/word-class n-gram, where the part-of-speech of words represents the word class. Recently, three new algorithms have been evaluated using the ORCHID (Charoenporn et al., 1997) and the BEST corpus. The first and second algorithms are based on Support vector machines (SVM) and Conditional random fields (CRF), which have been proven to be effective for many tasks. These algorithms decide at every character whether a word boundary will be placed right after that character. Character-level features have potential for word boundary prediction such as the character type, the possibility for the character to be at the beginning or the end of a word, etc. Another innovative algorithm applies the decoder of SMT for word segmentation. This is done by simply viewing a string of single characters as a source language and another 6 BEST project, http://www.hlt.nectec.or.th/best string of corresponding word units as a target language, and then the phrase-based SMT decoder can be adopted. Table 2 shows the comparative result of three above algorithms, SMT, SVM and CRF. Currently, the SMT decoder is used in the SMT preprocessor in our speech translation web service due to its simplicity. The CRF-based approach is now integrated in the VAJA speech synthesizer and will be integrated in the pre-processing step of the SMT in the near future. Table 2. Comparative results of three Thai word segmentation algorithms Algorithm F-measure SMT 75.5 SVM 90.7 CRF 95.4 3.2 Word re-ordering in noun phrases The state-of-the-art phrase-based SMT engine, trained by the large parallel text corpus mentioned in the Section 2, achieved as high BLEU score as 40.1% on 10-fold cross validation. Although this conventional system has yielded a promising result, its limitation lies on the large training database required and the difficulty to capture complex translation phenomena like word reordering. The different word order commonly found in -Thai translation happens frequently within noun phrases (NP). Figure 2 shows an example of word alignment in a sentence pair. To build -Thai SMT by using the phrase-based approach, the difference of word order in NP between the two languages becomes one of major issue to be solved. To overcome the problem, we introduced a set of word reordering rules within NPs as shown in Table 3, where the drop means deletion. The sentence in each training sentence pair was first NP detected by using our recent proposed Combinatory categorical grammar (CCG) based syntactic parser. Words within NP were reordered using the above rules and the resulting sentences together with their associated Thai sentences were used to train phrase-based SMT. To evaluate the proposed model, a training set containing approximately 160,000 sentence pairs, mostly from the -Thai BTEC corpus, were used to train SMT. Randomly selected 2,310 sentences from other sources were prepared for testing. The test sentences were also parsed and reordered before getting translated by
the trained translation model. Objectively measuring showed that the proposed model achieved 0.4 BLEU score improvement from the conventional model without word reordering. A subjective test revealed that, using the proposed model, 75% of test sentences were graded to be relatively higher translation quality, whereas only 13% were considered worse. Figure 2. An example of word alignment in -Thai sentence pair Table 3. Reordering rules within the NP NP structure Adjective Noun Determiner Noun Article Noun 3.3 Named entity tagging Thai reordering Noun Adjective Noun Determiner (drop) Noun Having developed the parallel corpus, another important issue is generating an automatic named entity (NE) tagging tool. NE words such as person, location, and organization names are generally an open set where new words can often be introduced. A common way to handle such NE words is to define them in a word class e.g. PER denoted a class of person names. Any new name can then be included in system s dictionary with only little effect to the class-based language model. NE words can also be specially treated in the translation process. For example, transliteration can be acceptable in translating NE words when the words are not found in the system s dictionary. In our speech translation implementation, automatic NE tagging is helpful for annotating NE words in the parallel text corpus. Some NE taggers, such as the Stanford named entity recognizer 7, are applicable for the -to- Thai translation side. However, at present, there is no NE tagger available for Thai. Our problem becomes how is it possible to tag Thai NE words in the parallel corpus whilst having NE 7 Stanford named entity recognizer, http://nlp.stanford.edu/software/crf-ner.shtml words tagged automatically by an NE tagging tool. The first solution is that for an sentence having NE, all words in the associated Thai sentence will be romanized automatically by using our romanization tool (Tarsaku et al., 2001). Then a Thai NE word can be detected if its romanized version is matched to the NE word. Figure 3 illustrates the overall process. It is noted that this idea relies on the fact that the parallel corpus was created by translating sentences and thus most of NE words were transliterated to Thai scripts. sentences NE tagging NE words Thai sentences Word segmentation & romanization NE word matching Recognized Thai NE words Romanized Thai sentences Figure 3. The process of Thai NE tagging by matching between romanized Thai words and automatically tagged NE words Table 4. Analysis and detection results of NE tagging using matching of transliterated words Aspect Result Total no. of test sentences 1,000 Total no. of NE words 116 No. of transliterated NE words 107 NE tagging F-measure - Person name - Location name - Organization name Thai NE tagging F-measure - Person name - Location name - Organization name 88% 77% 100% 83% 77% 100% Table 4 expresses some analyses and results from a preliminary test on a part of the - Thai BTEC corpus. This test shows that as high as 87.9% of NE words are translated to Thai on the basis of transliteration. Therefore as shown in the last two rows in the Table, tagging Thai NE words by matching the romanized version of Thai words to the detected NE
words is quite promising. Though a minority, the other Thai NE words not transliterated from their associated ones can be detected by other approaches. One possible approach is to search over the word-alignment result provided during the SMT training. 4 Conclusion and Future Aspects In this article, our progress in -Thai speech translation web service development was described. The system is currently available for translating from to Thai in the travel domain, while translating back from Thai to will be ready in the near future. This baseline system has opened a number of research issues for system improvement. The already developed -Thai BTEC corpus has been a good source for further studies as summarized in this article. It is noted that apart from the mentioned research issues, some other works related to this project have also been conducted, e.g. the development of bi-lingual TTS to support synthesizing Thai text mixed with words, and the way of handling out-of-vocabulary (OOV) words in Thai speech recognition. These research works can also contribute to improving the overall performance of the speech translation system. Finally, the use of the ASTAR standard protocol for speech translation web services, STML, allows all members in the consortium to share their services across countries. It is thus our future perspective to include the other languages in our speech translation service, such as Indonesian, Malay, Vietnamese, Chinese, Hindi, Korean, and Japanese. Kasuriya, S., Jitsuhiro, T., Kikui, G., Sagisaka, Y. 2002. Thai speech recognition by acoustic models mapped from Japanese. Proc. Joint International Conference of SNLP and O-COCOSDA, pp. 211 216. Kasuriya, S., Sornlertlamvanich, V., Cotsomrong, P., Jitsuhiro, T., Kikui, G., Sagisaka, Y. 2003. NEC- TEC-ATR Thai speech corpus. Proc. International Conference on Speech Databases and Assessments (O-COCOSDA 2003). Koehn, P., Hoang, H., Birch, A., Callison-Burch C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. Annual Meeting of the Association for Computational Linguistics (ACL 2007), Demonstration session, Prague, Czech Republic. Taddei, L., Costantini, E., Lavie, A. 2002. The NESPOLE! Multimodal Interface for Cross-lingual Communication - Experience and Lessons Learned. Proc. International Conference on Multimodal Interfaces (ICMI 2002), Pittsburgh, USA, pp. 14-16. Tarsaku, P., Sornlertlamvanich, V., Thongprasirt, R. 2001. Thai grapheme-to-phoneme using probabilistic GLR parser. Proc. European Conference on Speech Communication and Technology (EU- ROSPEECH), pp. 1057 1060. Wutiwiwatchai, C. 2007. Toward Thai- speech translation. Proc. International Symposium on Universal Communications (ISUC 2007), Japan, pp. 225-227. Acknowledgments The authors would like to thank the ATR, NICT for supporting the development of -Thai BTEC corpus, which has become a fruitful resource for our speech translation project. References Boonkwan, P., Supnithi, T. 2008. Memory-inductive categorial grammar: an approach to gap resolution in analytic-language translation. Proc. International Joint Conference on Natural Language Processing (IJCNLP 2008), Hyderabad. Charoenporn, T., Sornlertlamvanich, V., Isahara, H. 1997. Building a large Thai text corpus part-ofspeech tagged corpus: ORCHID. Proc. Natural Language Processing Pacific Rim Symposium (NLPRS).