Improvement Issues in English-Thai Speech Translation

Similar documents
The NICT Translation System for IWSLT 2012

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

The KIT-LIMSI Translation System for WMT 2014

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

arxiv: v1 [cs.cl] 2 Apr 2017

Initial approaches on Cross-Lingual Information Retrieval using Statistical Machine Translation on User Queries

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Disambiguation of Thai Personal Name from Online News Articles

3 Character-based KJ Translation

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Overview of the 3rd Workshop on Asian Translation

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Learning Methods in Multilingual Speech Recognition

Cross Language Information Retrieval

Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text

Speech Recognition at ICSI: Broadcast News and beyond

Using dialogue context to improve parsing performance in dialogue systems

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

BYLINE [Heng Ji, Computer Science Department, New York University,

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

ROSETTA STONE PRODUCT OVERVIEW

Applications of memory-based natural language processing

Parsing of part-of-speech tagged Assamese Texts

Named Entity Recognition: A Survey for the Indian Languages

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

AQUA: An Ontology-Driven Question Answering System

Linking Task: Identifying authors and book titles in verbose queries

Noisy SMS Machine Translation in Low-Density Languages

21st Century Community Learning Center

Small-Vocabulary Speech Recognition for Resource- Scarce Languages

Modeling function word errors in DNN-HMM based LVCSR systems

Effect of Word Complexity on L2 Vocabulary Learning

Modeling function word errors in DNN-HMM based LVCSR systems

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Language Model and Grammar Extraction Variation in Machine Translation

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Indian Institute of Technology, Kanpur

The Smart/Empire TIPSTER IR System

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Speech Emotion Recognition Using Support Vector Machine

Improving the Quality of MT Output using Novel Name Entity Translation Scheme

Training and evaluation of POS taggers on the French MULTITAG corpus

Search right and thou shalt find... Using Web Queries for Learner Error Detection

A heuristic framework for pivot-based bilingual dictionary induction

The IDN Variant Issues Project: A Study of Issues Related to the Delegation of IDN Variant TLDs. 20 April 2011

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

A study of speaker adaptation for DNN-based speech synthesis

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Word Segmentation of Off-line Handwritten Documents

Multi-Lingual Text Leveling

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Problems of the Arabic OCR: New Attitudes

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Phonological Processing for Urdu Text to Speech System

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Correspondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy

An Interactive Intelligent Language Tutor Over The Internet

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Ensemble Technique Utilization for Indonesian Dependency Parser

Universiteit Leiden ICT in Business

Eye Movements in Speech Technologies: an overview of current research

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Automatic Translation of Norwegian Noun Compounds

ARNE - A tool for Namend Entity Recognition from Arabic Text

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Character Stream Parsing of Mixed-lingual Text

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017

Vocabulary Usage and Intelligibility in Learner Language

The Ups and Downs of Preposition Error Detection in ESL Writing

The College Board Redesigned SAT Grade 12

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Distant Supervised Relation Extraction with Wikipedia and Freebase

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Different Requirements Gathering Techniques and Issues. Javaria Mushtaq

Edinburgh Research Explorer

Florida Reading Endorsement Alignment Matrix Competency 1

Inteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial ISSN:

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Regression for Sentence-Level MT Evaluation with Pseudo References

Detecting English-French Cognates Using Orthographic Edit Distance

Transcription:

Improvement Issues in -Thai Speech Translation Chai Wutiwiwatchai, Thepchai Supnithi, Peerachet Porkaew, Nattanun Thatphithakkul Human Language Technology Laboratory, National Electronics and Computer Technology Center, 112 Pahonyothin Rd., Klong-luang, Pathumthani, 12120 Thailand {chai.wut, thepchai.sup, peerachet.por, nattanun.tha}@nectec.or.th Abstract The first -Thai speech translation web service has been developed at NECTEC, Thailand. The service is based on the STML standard protocol provided under the ASTAR consortium. In parallel, several research works have also been conducted to improve the system performance, including an exploration of better Thai word segmentation algorithms, translation modeling using word reordering within noun phrases, and Thai named entity detection using an existing named entity tagging tool. This article reports the status of the speech translation web service development, summarizes the mentioned research works, and discusses our research path. 1 Introduction Speech translation is an innovative technology that allows people to converse with each other by speaking in their own languages. This technology is useful in the present society where communication happens among people from most of regions in the world. In order to achieve effective systems, international collaborative efforts were established. The Negotiating through spoken language in e-commerce (NESPOLE!) project (Taddei et al., 2002) gathered a number of European countries to develop an e-commerce service using multilingual speech translation systems. Recently, the Asian Speech Translation Advanced Research (ASTAR) consortium 1 was set up with collaboration among many Asian countries to jointly develop infrastructures for building web service based speech translation systems. In Thailand, the National Electronics and Computer Technology Center (NECTEC) has investigated on this research and development 1 http://www.slc.atr.jp/astar/ area since 2007 thru the ASTAR consortium. The aim is to build components and resources required for -Thai speech translation focusing on the travel domain. A stand-alone prototype where major components; Automatic speech recognition (ASR), machine translation (MT), and text-to-speech synthesis (TTS), were simply connected was shown in 2007 (Wutiwiwatchai, 2007). At present, the engine has been ported to be available as a web service via an infrastructure provided under the consortium. However, there are still a number of works needed to complete and improve the system such as enlarging the training resource, improving the statistical machine translator (SMT) with language specific knowledge, treating named entities, and handling non-native speakers speaking, improving the quality and variety of synthesized speech, etc. This article aims to summarize our current focus in improving the baseline system. In parallel to the web service implementation, an - Thai parallel text corpus was prepared for improving ASR language models and SMT translation models. As the Thai language is written without word boundary markers, several word segmentation algorithms have been investigated as a basic tool for some development and implementation steps e.g. text preparation for language modeling and text pre-processing for SMT and TTS. A combinatory categorical grammar (CCG) based parser has been introduced to Thai text analysis (Boonkwan and Supnithi, 2008). The CCG parser is useful to enclose the noun phrase (NP), where the inside word order can be additionally constrained under the SMT process. For the travel domain, the named entity (NE) especially place and organization names occur frequently. To accelerate the annotation process for parallel corpus development, an automatic NE tagging tool available for is helpful for automatic tagging of Thai NEs.

The next section overviews our -Thai speech translation web service, followed by the improvement issues mentioned above. Section 4 discusses and concludes this article. 2 -Thai Speech Translation Web Service Figure 1 illustrates an overall structure of our -Thai speech translation web service. /Thai ASR -Thai SMT STML web service center /Thai TTS Figure 1. The architecture of the -Thai speech translation web service Table 1. Summary of basic tools and configuration used in each service component Component Tool Configuration ASR SPHINX 2 CMU acoustic model and n-gram language model trained by 100K sentences from BTEC Thai ASR ISPEECH Acoustic model trained by LOTUS and NECTEC- ATR and n-gram language model trained by 100K Thai sentences from BTEC - Thai SMT TTS MOSES Microsoft SAPI 3 Translation and language models trained by 160K sentence pairs from BTEC and dictionaries Embedded Speech API provided under Microsoft Windows Thai TTS VAJA Variable-length unit selection on 14-hour female speech corpus The web service follows a standard protocol provided under the ASTAR consortium called 2 CMU SPHINX, http://cmusphinx.sourceforge.net/ 3 Microsoft SAPI, http://www.microsoft.com/speech Speech Translation Marked-up Language (STML version 1.0). According to the Figure 1, the STML web service center functions to manage data exchange among client s and requested services. Currently, the services available include /Thai ASR, -to-thai and Thai-to- SMT, and /Thai TTS 4. Table 1 summarizes basic tools and configuration of each component. The ASR has been adopted from the SPHINX speech recognition engine provided by Carnegie Mellon University, which also provides a US acoustic model. A language model is based on n-grams trained by a combination of text data, one generated from a handcrafted regular grammar and the other from the side of the Basic Travel Expressions Corpus (BTEC) provided by ATR (NICT), Japan. The Thai ASR consists of the ISPEECH Thai speech recognizer built at NECTEC based on a similar architecture to the JULIUS 5 speech recognizer from Kyoto University. The ISPEECH is incorporated with a general Thai acoustic model trained on a number of training sets e.g. the LOTUS corpus (Kasuriya et al., 2002), the NECTEC-ATR corpus (Kasuriya et al., 2003) added with multiple noise signals for sake of robustness. The Thai language model is trained on the Thai side of the BTEC corpus with Thai translation. The MOSES tool (Koehn et al., 2007) is used to build the -Thai SMT engine. It is trained by a combined set of two parallel text corpora, more than 100,000 sample sentence pairs from Thai- dictionaries and 100,000 sentence pairs from the -Thai BTEC corpus. Since the Thai script has no word boundaries, automatic word segmentation is applied before the SMT conventional process. The same word segmentation tool must also be applied to the Thai part of the training parallel text so that the automatic word alignment tool available in the SMT builder can be performed as usual. VAJA is a general Thai TTS engine based on variable-length unit selection with a 14-hour female speech corpus. words written in the Roman alphabet can be speech synthesized by using a look-up dictionary that maps words to Thai pronunciations. Regarding TTS, the free TTS engine provided under the Microsoft Speech API (SAPI) is adopted. 4 At the time of writing this article, the Thai ASR and TTS are under-construction. 5 JULIUS, http://julius.sourceforge.jp/en_index.php

All the above components have been developed in a client/ framework which allows s to access to the engines via the STMLbased web service format. 3 Improvement Issues The speech translation web service is considered a baseline system where most of components are developed based on conventional tools. There are still many issues left to improve. In this section, three improvement issues recently investigated are described, including word segmentation, word re-ordering in noun phrases (NP), and named entity (NE) tagging. 3.1 Thai word segmentation A common problem in languages without word boundary markers is to segment text into word sequences. The difficulty of Thai language processing not only comes from missing syllable, word and sentence boundary markers, but also the ambiguity when the script is mixed with loan words and named entities written in Thai. Word segmentation in Thai has been widely researched for decades. However, it is still far from completion due to the lack of large training data and standard evaluation sets. Recently, a project named Benchmarks for Enhancing the Standard of Thai language processing (BEST) 6 has been initiated. As a result, a guideline for Thai word segmentation as well as an official large annotated text corpus for training and evaluation is available. NECTEC s traditional word segmentation tool is based on word/word-class n-gram, where the part-of-speech of words represents the word class. Recently, three new algorithms have been evaluated using the ORCHID (Charoenporn et al., 1997) and the BEST corpus. The first and second algorithms are based on Support vector machines (SVM) and Conditional random fields (CRF), which have been proven to be effective for many tasks. These algorithms decide at every character whether a word boundary will be placed right after that character. Character-level features have potential for word boundary prediction such as the character type, the possibility for the character to be at the beginning or the end of a word, etc. Another innovative algorithm applies the decoder of SMT for word segmentation. This is done by simply viewing a string of single characters as a source language and another 6 BEST project, http://www.hlt.nectec.or.th/best string of corresponding word units as a target language, and then the phrase-based SMT decoder can be adopted. Table 2 shows the comparative result of three above algorithms, SMT, SVM and CRF. Currently, the SMT decoder is used in the SMT preprocessor in our speech translation web service due to its simplicity. The CRF-based approach is now integrated in the VAJA speech synthesizer and will be integrated in the pre-processing step of the SMT in the near future. Table 2. Comparative results of three Thai word segmentation algorithms Algorithm F-measure SMT 75.5 SVM 90.7 CRF 95.4 3.2 Word re-ordering in noun phrases The state-of-the-art phrase-based SMT engine, trained by the large parallel text corpus mentioned in the Section 2, achieved as high BLEU score as 40.1% on 10-fold cross validation. Although this conventional system has yielded a promising result, its limitation lies on the large training database required and the difficulty to capture complex translation phenomena like word reordering. The different word order commonly found in -Thai translation happens frequently within noun phrases (NP). Figure 2 shows an example of word alignment in a sentence pair. To build -Thai SMT by using the phrase-based approach, the difference of word order in NP between the two languages becomes one of major issue to be solved. To overcome the problem, we introduced a set of word reordering rules within NPs as shown in Table 3, where the drop means deletion. The sentence in each training sentence pair was first NP detected by using our recent proposed Combinatory categorical grammar (CCG) based syntactic parser. Words within NP were reordered using the above rules and the resulting sentences together with their associated Thai sentences were used to train phrase-based SMT. To evaluate the proposed model, a training set containing approximately 160,000 sentence pairs, mostly from the -Thai BTEC corpus, were used to train SMT. Randomly selected 2,310 sentences from other sources were prepared for testing. The test sentences were also parsed and reordered before getting translated by

the trained translation model. Objectively measuring showed that the proposed model achieved 0.4 BLEU score improvement from the conventional model without word reordering. A subjective test revealed that, using the proposed model, 75% of test sentences were graded to be relatively higher translation quality, whereas only 13% were considered worse. Figure 2. An example of word alignment in -Thai sentence pair Table 3. Reordering rules within the NP NP structure Adjective Noun Determiner Noun Article Noun 3.3 Named entity tagging Thai reordering Noun Adjective Noun Determiner (drop) Noun Having developed the parallel corpus, another important issue is generating an automatic named entity (NE) tagging tool. NE words such as person, location, and organization names are generally an open set where new words can often be introduced. A common way to handle such NE words is to define them in a word class e.g. PER denoted a class of person names. Any new name can then be included in system s dictionary with only little effect to the class-based language model. NE words can also be specially treated in the translation process. For example, transliteration can be acceptable in translating NE words when the words are not found in the system s dictionary. In our speech translation implementation, automatic NE tagging is helpful for annotating NE words in the parallel text corpus. Some NE taggers, such as the Stanford named entity recognizer 7, are applicable for the -to- Thai translation side. However, at present, there is no NE tagger available for Thai. Our problem becomes how is it possible to tag Thai NE words in the parallel corpus whilst having NE 7 Stanford named entity recognizer, http://nlp.stanford.edu/software/crf-ner.shtml words tagged automatically by an NE tagging tool. The first solution is that for an sentence having NE, all words in the associated Thai sentence will be romanized automatically by using our romanization tool (Tarsaku et al., 2001). Then a Thai NE word can be detected if its romanized version is matched to the NE word. Figure 3 illustrates the overall process. It is noted that this idea relies on the fact that the parallel corpus was created by translating sentences and thus most of NE words were transliterated to Thai scripts. sentences NE tagging NE words Thai sentences Word segmentation & romanization NE word matching Recognized Thai NE words Romanized Thai sentences Figure 3. The process of Thai NE tagging by matching between romanized Thai words and automatically tagged NE words Table 4. Analysis and detection results of NE tagging using matching of transliterated words Aspect Result Total no. of test sentences 1,000 Total no. of NE words 116 No. of transliterated NE words 107 NE tagging F-measure - Person name - Location name - Organization name Thai NE tagging F-measure - Person name - Location name - Organization name 88% 77% 100% 83% 77% 100% Table 4 expresses some analyses and results from a preliminary test on a part of the - Thai BTEC corpus. This test shows that as high as 87.9% of NE words are translated to Thai on the basis of transliteration. Therefore as shown in the last two rows in the Table, tagging Thai NE words by matching the romanized version of Thai words to the detected NE

words is quite promising. Though a minority, the other Thai NE words not transliterated from their associated ones can be detected by other approaches. One possible approach is to search over the word-alignment result provided during the SMT training. 4 Conclusion and Future Aspects In this article, our progress in -Thai speech translation web service development was described. The system is currently available for translating from to Thai in the travel domain, while translating back from Thai to will be ready in the near future. This baseline system has opened a number of research issues for system improvement. The already developed -Thai BTEC corpus has been a good source for further studies as summarized in this article. It is noted that apart from the mentioned research issues, some other works related to this project have also been conducted, e.g. the development of bi-lingual TTS to support synthesizing Thai text mixed with words, and the way of handling out-of-vocabulary (OOV) words in Thai speech recognition. These research works can also contribute to improving the overall performance of the speech translation system. Finally, the use of the ASTAR standard protocol for speech translation web services, STML, allows all members in the consortium to share their services across countries. It is thus our future perspective to include the other languages in our speech translation service, such as Indonesian, Malay, Vietnamese, Chinese, Hindi, Korean, and Japanese. Kasuriya, S., Jitsuhiro, T., Kikui, G., Sagisaka, Y. 2002. Thai speech recognition by acoustic models mapped from Japanese. Proc. Joint International Conference of SNLP and O-COCOSDA, pp. 211 216. Kasuriya, S., Sornlertlamvanich, V., Cotsomrong, P., Jitsuhiro, T., Kikui, G., Sagisaka, Y. 2003. NEC- TEC-ATR Thai speech corpus. Proc. International Conference on Speech Databases and Assessments (O-COCOSDA 2003). Koehn, P., Hoang, H., Birch, A., Callison-Burch C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. Annual Meeting of the Association for Computational Linguistics (ACL 2007), Demonstration session, Prague, Czech Republic. Taddei, L., Costantini, E., Lavie, A. 2002. The NESPOLE! Multimodal Interface for Cross-lingual Communication - Experience and Lessons Learned. Proc. International Conference on Multimodal Interfaces (ICMI 2002), Pittsburgh, USA, pp. 14-16. Tarsaku, P., Sornlertlamvanich, V., Thongprasirt, R. 2001. Thai grapheme-to-phoneme using probabilistic GLR parser. Proc. European Conference on Speech Communication and Technology (EU- ROSPEECH), pp. 1057 1060. Wutiwiwatchai, C. 2007. Toward Thai- speech translation. Proc. International Symposium on Universal Communications (ISUC 2007), Japan, pp. 225-227. Acknowledgments The authors would like to thank the ATR, NICT for supporting the development of -Thai BTEC corpus, which has become a fruitful resource for our speech translation project. References Boonkwan, P., Supnithi, T. 2008. Memory-inductive categorial grammar: an approach to gap resolution in analytic-language translation. Proc. International Joint Conference on Natural Language Processing (IJCNLP 2008), Hyderabad. Charoenporn, T., Sornlertlamvanich, V., Isahara, H. 1997. Building a large Thai text corpus part-ofspeech tagged corpus: ORCHID. Proc. Natural Language Processing Pacific Rim Symposium (NLPRS).