Improvement Issues in English-Thai Speech Translation
|
|
- Rodger Underwood
- 6 years ago
- Views:
Transcription
1 Improvement Issues in -Thai Speech Translation Chai Wutiwiwatchai, Thepchai Supnithi, Peerachet Porkaew, Nattanun Thatphithakkul Human Language Technology Laboratory, National Electronics and Computer Technology Center, 112 Pahonyothin Rd., Klong-luang, Pathumthani, Thailand {chai.wut, thepchai.sup, peerachet.por, Abstract The first -Thai speech translation web service has been developed at NECTEC, Thailand. The service is based on the STML standard protocol provided under the ASTAR consortium. In parallel, several research works have also been conducted to improve the system performance, including an exploration of better Thai word segmentation algorithms, translation modeling using word reordering within noun phrases, and Thai named entity detection using an existing named entity tagging tool. This article reports the status of the speech translation web service development, summarizes the mentioned research works, and discusses our research path. 1 Introduction Speech translation is an innovative technology that allows people to converse with each other by speaking in their own languages. This technology is useful in the present society where communication happens among people from most of regions in the world. In order to achieve effective systems, international collaborative efforts were established. The Negotiating through spoken language in e-commerce (NESPOLE!) project (Taddei et al., 2002) gathered a number of European countries to develop an e-commerce service using multilingual speech translation systems. Recently, the Asian Speech Translation Advanced Research (ASTAR) consortium 1 was set up with collaboration among many Asian countries to jointly develop infrastructures for building web service based speech translation systems. In Thailand, the National Electronics and Computer Technology Center (NECTEC) has investigated on this research and development 1 area since 2007 thru the ASTAR consortium. The aim is to build components and resources required for -Thai speech translation focusing on the travel domain. A stand-alone prototype where major components; Automatic speech recognition (ASR), machine translation (MT), and text-to-speech synthesis (TTS), were simply connected was shown in 2007 (Wutiwiwatchai, 2007). At present, the engine has been ported to be available as a web service via an infrastructure provided under the consortium. However, there are still a number of works needed to complete and improve the system such as enlarging the training resource, improving the statistical machine translator (SMT) with language specific knowledge, treating named entities, and handling non-native speakers speaking, improving the quality and variety of synthesized speech, etc. This article aims to summarize our current focus in improving the baseline system. In parallel to the web service implementation, an - Thai parallel text corpus was prepared for improving ASR language models and SMT translation models. As the Thai language is written without word boundary markers, several word segmentation algorithms have been investigated as a basic tool for some development and implementation steps e.g. text preparation for language modeling and text pre-processing for SMT and TTS. A combinatory categorical grammar (CCG) based parser has been introduced to Thai text analysis (Boonkwan and Supnithi, 2008). The CCG parser is useful to enclose the noun phrase (NP), where the inside word order can be additionally constrained under the SMT process. For the travel domain, the named entity (NE) especially place and organization names occur frequently. To accelerate the annotation process for parallel corpus development, an automatic NE tagging tool available for is helpful for automatic tagging of Thai NEs.
2 The next section overviews our -Thai speech translation web service, followed by the improvement issues mentioned above. Section 4 discusses and concludes this article. 2 -Thai Speech Translation Web Service Figure 1 illustrates an overall structure of our -Thai speech translation web service. /Thai ASR -Thai SMT STML web service center /Thai TTS Figure 1. The architecture of the -Thai speech translation web service Table 1. Summary of basic tools and configuration used in each service component Component Tool Configuration ASR SPHINX 2 CMU acoustic model and n-gram language model trained by 100K sentences from BTEC Thai ASR ISPEECH Acoustic model trained by LOTUS and NECTEC- ATR and n-gram language model trained by 100K Thai sentences from BTEC - Thai SMT TTS MOSES Microsoft SAPI 3 Translation and language models trained by 160K sentence pairs from BTEC and dictionaries Embedded Speech API provided under Microsoft Windows Thai TTS VAJA Variable-length unit selection on 14-hour female speech corpus The web service follows a standard protocol provided under the ASTAR consortium called 2 CMU SPHINX, 3 Microsoft SAPI, Speech Translation Marked-up Language (STML version 1.0). According to the Figure 1, the STML web service center functions to manage data exchange among client s and requested services. Currently, the services available include /Thai ASR, -to-thai and Thai-to- SMT, and /Thai TTS 4. Table 1 summarizes basic tools and configuration of each component. The ASR has been adopted from the SPHINX speech recognition engine provided by Carnegie Mellon University, which also provides a US acoustic model. A language model is based on n-grams trained by a combination of text data, one generated from a handcrafted regular grammar and the other from the side of the Basic Travel Expressions Corpus (BTEC) provided by ATR (NICT), Japan. The Thai ASR consists of the ISPEECH Thai speech recognizer built at NECTEC based on a similar architecture to the JULIUS 5 speech recognizer from Kyoto University. The ISPEECH is incorporated with a general Thai acoustic model trained on a number of training sets e.g. the LOTUS corpus (Kasuriya et al., 2002), the NECTEC-ATR corpus (Kasuriya et al., 2003) added with multiple noise signals for sake of robustness. The Thai language model is trained on the Thai side of the BTEC corpus with Thai translation. The MOSES tool (Koehn et al., 2007) is used to build the -Thai SMT engine. It is trained by a combined set of two parallel text corpora, more than 100,000 sample sentence pairs from Thai- dictionaries and 100,000 sentence pairs from the -Thai BTEC corpus. Since the Thai script has no word boundaries, automatic word segmentation is applied before the SMT conventional process. The same word segmentation tool must also be applied to the Thai part of the training parallel text so that the automatic word alignment tool available in the SMT builder can be performed as usual. VAJA is a general Thai TTS engine based on variable-length unit selection with a 14-hour female speech corpus. words written in the Roman alphabet can be speech synthesized by using a look-up dictionary that maps words to Thai pronunciations. Regarding TTS, the free TTS engine provided under the Microsoft Speech API (SAPI) is adopted. 4 At the time of writing this article, the Thai ASR and TTS are under-construction. 5 JULIUS,
3 All the above components have been developed in a client/ framework which allows s to access to the engines via the STMLbased web service format. 3 Improvement Issues The speech translation web service is considered a baseline system where most of components are developed based on conventional tools. There are still many issues left to improve. In this section, three improvement issues recently investigated are described, including word segmentation, word re-ordering in noun phrases (NP), and named entity (NE) tagging. 3.1 Thai word segmentation A common problem in languages without word boundary markers is to segment text into word sequences. The difficulty of Thai language processing not only comes from missing syllable, word and sentence boundary markers, but also the ambiguity when the script is mixed with loan words and named entities written in Thai. Word segmentation in Thai has been widely researched for decades. However, it is still far from completion due to the lack of large training data and standard evaluation sets. Recently, a project named Benchmarks for Enhancing the Standard of Thai language processing (BEST) 6 has been initiated. As a result, a guideline for Thai word segmentation as well as an official large annotated text corpus for training and evaluation is available. NECTEC s traditional word segmentation tool is based on word/word-class n-gram, where the part-of-speech of words represents the word class. Recently, three new algorithms have been evaluated using the ORCHID (Charoenporn et al., 1997) and the BEST corpus. The first and second algorithms are based on Support vector machines (SVM) and Conditional random fields (CRF), which have been proven to be effective for many tasks. These algorithms decide at every character whether a word boundary will be placed right after that character. Character-level features have potential for word boundary prediction such as the character type, the possibility for the character to be at the beginning or the end of a word, etc. Another innovative algorithm applies the decoder of SMT for word segmentation. This is done by simply viewing a string of single characters as a source language and another 6 BEST project, string of corresponding word units as a target language, and then the phrase-based SMT decoder can be adopted. Table 2 shows the comparative result of three above algorithms, SMT, SVM and CRF. Currently, the SMT decoder is used in the SMT preprocessor in our speech translation web service due to its simplicity. The CRF-based approach is now integrated in the VAJA speech synthesizer and will be integrated in the pre-processing step of the SMT in the near future. Table 2. Comparative results of three Thai word segmentation algorithms Algorithm F-measure SMT 75.5 SVM 90.7 CRF Word re-ordering in noun phrases The state-of-the-art phrase-based SMT engine, trained by the large parallel text corpus mentioned in the Section 2, achieved as high BLEU score as 40.1% on 10-fold cross validation. Although this conventional system has yielded a promising result, its limitation lies on the large training database required and the difficulty to capture complex translation phenomena like word reordering. The different word order commonly found in -Thai translation happens frequently within noun phrases (NP). Figure 2 shows an example of word alignment in a sentence pair. To build -Thai SMT by using the phrase-based approach, the difference of word order in NP between the two languages becomes one of major issue to be solved. To overcome the problem, we introduced a set of word reordering rules within NPs as shown in Table 3, where the drop means deletion. The sentence in each training sentence pair was first NP detected by using our recent proposed Combinatory categorical grammar (CCG) based syntactic parser. Words within NP were reordered using the above rules and the resulting sentences together with their associated Thai sentences were used to train phrase-based SMT. To evaluate the proposed model, a training set containing approximately 160,000 sentence pairs, mostly from the -Thai BTEC corpus, were used to train SMT. Randomly selected 2,310 sentences from other sources were prepared for testing. The test sentences were also parsed and reordered before getting translated by
4 the trained translation model. Objectively measuring showed that the proposed model achieved 0.4 BLEU score improvement from the conventional model without word reordering. A subjective test revealed that, using the proposed model, 75% of test sentences were graded to be relatively higher translation quality, whereas only 13% were considered worse. Figure 2. An example of word alignment in -Thai sentence pair Table 3. Reordering rules within the NP NP structure Adjective Noun Determiner Noun Article Noun 3.3 Named entity tagging Thai reordering Noun Adjective Noun Determiner (drop) Noun Having developed the parallel corpus, another important issue is generating an automatic named entity (NE) tagging tool. NE words such as person, location, and organization names are generally an open set where new words can often be introduced. A common way to handle such NE words is to define them in a word class e.g. PER denoted a class of person names. Any new name can then be included in system s dictionary with only little effect to the class-based language model. NE words can also be specially treated in the translation process. For example, transliteration can be acceptable in translating NE words when the words are not found in the system s dictionary. In our speech translation implementation, automatic NE tagging is helpful for annotating NE words in the parallel text corpus. Some NE taggers, such as the Stanford named entity recognizer 7, are applicable for the -to- Thai translation side. However, at present, there is no NE tagger available for Thai. Our problem becomes how is it possible to tag Thai NE words in the parallel corpus whilst having NE 7 Stanford named entity recognizer, words tagged automatically by an NE tagging tool. The first solution is that for an sentence having NE, all words in the associated Thai sentence will be romanized automatically by using our romanization tool (Tarsaku et al., 2001). Then a Thai NE word can be detected if its romanized version is matched to the NE word. Figure 3 illustrates the overall process. It is noted that this idea relies on the fact that the parallel corpus was created by translating sentences and thus most of NE words were transliterated to Thai scripts. sentences NE tagging NE words Thai sentences Word segmentation & romanization NE word matching Recognized Thai NE words Romanized Thai sentences Figure 3. The process of Thai NE tagging by matching between romanized Thai words and automatically tagged NE words Table 4. Analysis and detection results of NE tagging using matching of transliterated words Aspect Result Total no. of test sentences 1,000 Total no. of NE words 116 No. of transliterated NE words 107 NE tagging F-measure - Person name - Location name - Organization name Thai NE tagging F-measure - Person name - Location name - Organization name 88% 77% 100% 83% 77% 100% Table 4 expresses some analyses and results from a preliminary test on a part of the - Thai BTEC corpus. This test shows that as high as 87.9% of NE words are translated to Thai on the basis of transliteration. Therefore as shown in the last two rows in the Table, tagging Thai NE words by matching the romanized version of Thai words to the detected NE
5 words is quite promising. Though a minority, the other Thai NE words not transliterated from their associated ones can be detected by other approaches. One possible approach is to search over the word-alignment result provided during the SMT training. 4 Conclusion and Future Aspects In this article, our progress in -Thai speech translation web service development was described. The system is currently available for translating from to Thai in the travel domain, while translating back from Thai to will be ready in the near future. This baseline system has opened a number of research issues for system improvement. The already developed -Thai BTEC corpus has been a good source for further studies as summarized in this article. It is noted that apart from the mentioned research issues, some other works related to this project have also been conducted, e.g. the development of bi-lingual TTS to support synthesizing Thai text mixed with words, and the way of handling out-of-vocabulary (OOV) words in Thai speech recognition. These research works can also contribute to improving the overall performance of the speech translation system. Finally, the use of the ASTAR standard protocol for speech translation web services, STML, allows all members in the consortium to share their services across countries. It is thus our future perspective to include the other languages in our speech translation service, such as Indonesian, Malay, Vietnamese, Chinese, Hindi, Korean, and Japanese. Kasuriya, S., Jitsuhiro, T., Kikui, G., Sagisaka, Y Thai speech recognition by acoustic models mapped from Japanese. Proc. Joint International Conference of SNLP and O-COCOSDA, pp Kasuriya, S., Sornlertlamvanich, V., Cotsomrong, P., Jitsuhiro, T., Kikui, G., Sagisaka, Y NEC- TEC-ATR Thai speech corpus. Proc. International Conference on Speech Databases and Assessments (O-COCOSDA 2003). Koehn, P., Hoang, H., Birch, A., Callison-Burch C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E Moses: Open Source Toolkit for Statistical Machine Translation. Annual Meeting of the Association for Computational Linguistics (ACL 2007), Demonstration session, Prague, Czech Republic. Taddei, L., Costantini, E., Lavie, A The NESPOLE! Multimodal Interface for Cross-lingual Communication - Experience and Lessons Learned. Proc. International Conference on Multimodal Interfaces (ICMI 2002), Pittsburgh, USA, pp Tarsaku, P., Sornlertlamvanich, V., Thongprasirt, R Thai grapheme-to-phoneme using probabilistic GLR parser. Proc. European Conference on Speech Communication and Technology (EU- ROSPEECH), pp Wutiwiwatchai, C Toward Thai- speech translation. Proc. International Symposium on Universal Communications (ISUC 2007), Japan, pp Acknowledgments The authors would like to thank the ATR, NICT for supporting the development of -Thai BTEC corpus, which has become a fruitful resource for our speech translation project. References Boonkwan, P., Supnithi, T Memory-inductive categorial grammar: an approach to gap resolution in analytic-language translation. Proc. International Joint Conference on Natural Language Processing (IJCNLP 2008), Hyderabad. Charoenporn, T., Sornlertlamvanich, V., Isahara, H Building a large Thai text corpus part-ofspeech tagged corpus: ORCHID. Proc. Natural Language Processing Pacific Rim Symposium (NLPRS).
The NICT Translation System for IWSLT 2012
The NICT Translation System for IWSLT 2012 Andrew Finch Ohnmar Htun Eiichiro Sumita Multilingual Translation Group MASTAR Project National Institute of Information and Communications Technology Kyoto,
More informationMULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY
MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract
More informationCross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels
Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels Jörg Tiedemann Uppsala University Department of Linguistics and Philology firstname.lastname@lingfil.uu.se Abstract
More informationThe KIT-LIMSI Translation System for WMT 2014
The KIT-LIMSI Translation System for WMT 2014 Quoc Khanh Do, Teresa Herrmann, Jan Niehues, Alexandre Allauzen, François Yvon and Alex Waibel LIMSI-CNRS, Orsay, France Karlsruhe Institute of Technology,
More informationDomain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling
Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling Pratyush Banerjee, Sudip Kumar Naskar, Johann Roturier 1, Andy Way 2, Josef van Genabith
More informationNCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches
NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches Yu-Chun Wang Chun-Kai Wu Richard Tzong-Han Tsai Department of Computer Science
More informationSpoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers
Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers Chad Langley, Alon Lavie, Lori Levin, Dorcas Wallace, Donna Gates, and Kay Peterson Language Technologies Institute Carnegie
More informationExploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data
Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data Maja Popović and Hermann Ney Lehrstuhl für Informatik VI, Computer
More informationEvaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment
Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment Akiko Sakamoto, Kazuhiko Abe, Kazuo Sumita and Satoshi Kamatani Knowledge Media Laboratory,
More informationSpeech Translation for Triage of Emergency Phonecalls in Minority Languages
Speech Translation for Triage of Emergency Phonecalls in Minority Languages Udhyakumar Nallasamy, Alan W Black, Tanja Schultz, Robert Frederking Language Technologies Institute Carnegie Mellon University
More informationThe MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation
The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation AUTHORS AND AFFILIATIONS MSR: Xiaodong He, Jianfeng Gao, Chris Quirk, Patrick Nguyen, Arul Menezes, Robert Moore, Kristina Toutanova,
More informationarxiv: v1 [cs.cl] 2 Apr 2017
Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,
More informationInitial approaches on Cross-Lingual Information Retrieval using Statistical Machine Translation on User Queries
Initial approaches on Cross-Lingual Information Retrieval using Statistical Machine Translation on User Queries Marta R. Costa-jussà, Christian Paz-Trillo and Renata Wassermann 1 Computer Science Department
More informationTarget Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data
Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se
More informationDisambiguation of Thai Personal Name from Online News Articles
Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online
More information3 Character-based KJ Translation
NICT at WAT 2015 Chenchen Ding, Masao Utiyama, Eiichiro Sumita Multilingual Translation Laboratory National Institute of Information and Communications Technology 3-5 Hikaridai, Seikacho, Sorakugun, Kyoto,
More informationhave to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,
A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994
More informationOverview of the 3rd Workshop on Asian Translation
Overview of the 3rd Workshop on Asian Translation Toshiaki Nakazawa Chenchen Ding and Hideya Mino Japan Science and National Institute of Technology Agency Information and nakazawa@pa.jst.jp Communications
More informationThe Karlsruhe Institute of Technology Translation Systems for the WMT 2011
The Karlsruhe Institute of Technology Translation Systems for the WMT 2011 Teresa Herrmann, Mohammed Mediani, Jan Niehues and Alex Waibel Karlsruhe Institute of Technology Karlsruhe, Germany firstname.lastname@kit.edu
More informationLearning Methods in Multilingual Speech Recognition
Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex
More informationCross Language Information Retrieval
Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................
More informationExperiments with Cross-lingual Systems for Synthesis of Code-Mixed Text
Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text Sunayana Sitaram 1, Sai Krishna Rallabandi 1, Shruti Rijhwani 1 Alan W Black 2 1 Microsoft Research India 2 Carnegie Mellon University
More informationSpeech Recognition at ICSI: Broadcast News and beyond
Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI
More informationUsing dialogue context to improve parsing performance in dialogue systems
Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,
More informationLEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano
LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES Judith Gaspers and Philipp Cimiano Semantic Computing Group, CITEC, Bielefeld University {jgaspers cimiano}@cit-ec.uni-bielefeld.de ABSTRACT Semantic parsers
More informationRole of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation
Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation Vivek Kumar Rangarajan Sridhar, John Chen, Srinivas Bangalore, Alistair Conkie AT&T abs - Research 180 Park Avenue, Florham Park,
More informationBYLINE [Heng Ji, Computer Science Department, New York University,
INFORMATION EXTRACTION BYLINE [Heng Ji, Computer Science Department, New York University, hengji@cs.nyu.edu] SYNONYMS NONE DEFINITION Information Extraction (IE) is a task of extracting pre-specified types
More informationThe NICT/ATR speech synthesis system for the Blizzard Challenge 2008
The NICT/ATR speech synthesis system for the Blizzard Challenge 2008 Ranniery Maia 1,2, Jinfu Ni 1,2, Shinsuke Sakai 1,2, Tomoki Toda 1,3, Keiichi Tokuda 1,4 Tohru Shimizu 1,2, Satoshi Nakamura 1,2 1 National
More informationEdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,
More informationSemi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.
Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link
More informationTrend Survey on Japanese Natural Language Processing Studies over the Last Decade
Trend Survey on Japanese Natural Language Processing Studies over the Last Decade Masaki Murata, Koji Ichii, Qing Ma,, Tamotsu Shirado, Toshiyuki Kanamaru,, and Hitoshi Isahara National Institute of Information
More informationROSETTA STONE PRODUCT OVERVIEW
ROSETTA STONE PRODUCT OVERVIEW Method Rosetta Stone teaches languages using a fully-interactive immersion process that requires the student to indicate comprehension of the new language and provides immediate
More informationApplications of memory-based natural language processing
Applications of memory-based natural language processing Antal van den Bosch and Roser Morante ILK Research Group Tilburg University Prague, June 24, 2007 Current ILK members Principal investigator: Antal
More informationParsing of part-of-speech tagged Assamese Texts
IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal
More informationNamed Entity Recognition: A Survey for the Indian Languages
Named Entity Recognition: A Survey for the Indian Languages Padmaja Sharma Dept. of CSE Tezpur University Assam, India 784028 psharma@tezu.ernet.in Utpal Sharma Dept.of CSE Tezpur University Assam, India
More informationOCR for Arabic using SIFT Descriptors With Online Failure Prediction
OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,
More informationSTUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH
STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH Don McAllaster, Larry Gillick, Francesco Scattone, Mike Newman Dragon Systems, Inc. 320 Nevada Street Newton, MA 02160
More informationAQUA: An Ontology-Driven Question Answering System
AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.
More informationLinking Task: Identifying authors and book titles in verbose queries
Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,
More informationNoisy SMS Machine Translation in Low-Density Languages
Noisy SMS Machine Translation in Low-Density Languages Vladimir Eidelman, Kristy Hollingshead, and Philip Resnik UMIACS Laboratory for Computational Linguistics and Information Processing Department of
More information21st Century Community Learning Center
21st Century Community Learning Center Grant Overview This Request for Proposal (RFP) is designed to distribute funds to qualified applicants pursuant to Title IV, Part B, of the Elementary and Secondary
More informationSmall-Vocabulary Speech Recognition for Resource- Scarce Languages
Small-Vocabulary Speech Recognition for Resource- Scarce Languages Fang Qiao School of Computer Science Carnegie Mellon University fqiao@andrew.cmu.edu Jahanzeb Sherwani iteleport LLC j@iteleportmobile.com
More informationModeling function word errors in DNN-HMM based LVCSR systems
Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford
More informationEffect of Word Complexity on L2 Vocabulary Learning
Effect of Word Complexity on L2 Vocabulary Learning Kevin Dela Rosa Language Technologies Institute Carnegie Mellon University 5000 Forbes Ave. Pittsburgh, PA kdelaros@cs.cmu.edu Maxine Eskenazi Language
More informationModeling function word errors in DNN-HMM based LVCSR systems
Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford
More informationUNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen
UNIVERSITY OF OSLO Department of Informatics Dialog Act Recognition using Dependency Features Master s thesis Sindre Wetjen November 15, 2013 Acknowledgments First I want to thank my supervisors Lilja
More informationLanguage Model and Grammar Extraction Variation in Machine Translation
Language Model and Grammar Extraction Variation in Machine Translation Vladimir Eidelman, Chris Dyer, and Philip Resnik UMIACS Laboratory for Computational Linguistics and Information Processing Department
More informationUsing Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing
Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing Pallavi Baljekar, Sunayana Sitaram, Prasanna Kumar Muthukumar, and Alan W Black Carnegie Mellon University,
More informationIndian Institute of Technology, Kanpur
Indian Institute of Technology, Kanpur Course Project - CS671A POS Tagging of Code Mixed Text Ayushman Sisodiya (12188) {ayushmn@iitk.ac.in} Donthu Vamsi Krishna (15111016) {vamsi@iitk.ac.in} Sandeep Kumar
More informationThe Smart/Empire TIPSTER IR System
The Smart/Empire TIPSTER IR System Chris Buckley, Janet Walz Sabir Research, Gaithersburg, MD chrisb,walz@sabir.com Claire Cardie, Scott Mardis, Mandar Mitra, David Pierce, Kiri Wagstaff Department of
More informationTHE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING
SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,
More informationProduct Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments
Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &
More informationSpeech Emotion Recognition Using Support Vector Machine
Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,
More informationImproving the Quality of MT Output using Novel Name Entity Translation Scheme
Improving the Quality of MT Output using Novel Name Entity Translation Scheme Deepti Bhalla Department of Computer Science Banasthali University Rajasthan, India deeptibhalla0600@gmail.com Nisheeth Joshi
More informationTraining and evaluation of POS taggers on the French MULTITAG corpus
Training and evaluation of POS taggers on the French MULTITAG corpus A. Allauzen, H. Bonneau-Maynard LIMSI/CNRS; Univ Paris-Sud, Orsay, F-91405 {allauzen,maynard}@limsi.fr Abstract The explicit introduction
More informationSearch right and thou shalt find... Using Web Queries for Learner Error Detection
Search right and thou shalt find... Using Web Queries for Learner Error Detection Michael Gamon Claudia Leacock Microsoft Research Butler Hill Group One Microsoft Way P.O. Box 935 Redmond, WA 981052, USA
More informationA heuristic framework for pivot-based bilingual dictionary induction
2013 International Conference on Culture and Computing A heuristic framework for pivot-based bilingual dictionary induction Mairidan Wushouer, Toru Ishida, Donghui Lin Department of Social Informatics,
More informationThe IDN Variant Issues Project: A Study of Issues Related to the Delegation of IDN Variant TLDs. 20 April 2011
The IDN Variant Issues Project: A Study of Issues Related to the Delegation of IDN Variant TLDs 20 April 2011 Project Proposal updated based on comments received during the Public Comment period held from
More informationPOS tagging of Chinese Buddhist texts using Recurrent Neural Networks
POS tagging of Chinese Buddhist texts using Recurrent Neural Networks Longlu Qin Department of East Asian Languages and Cultures longlu@stanford.edu Abstract Chinese POS tagging, as one of the most important
More informationA study of speaker adaptation for DNN-based speech synthesis
A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, Simon King The Centre for Speech Technology Research (CSTR) University of Edinburgh,
More informationProgram Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading
Program Requirements Competency 1: Foundations of Instruction 60 In-service Hours Teachers will develop substantive understanding of six components of reading as a process: comprehension, oral language,
More informationWord Segmentation of Off-line Handwritten Documents
Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department
More informationMulti-Lingual Text Leveling
Multi-Lingual Text Leveling Salim Roukos, Jerome Quin, and Todd Ward IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 {roukos,jlquinn,tward}@us.ibm.com Abstract. Determining the language proficiency
More informationA Neural Network GUI Tested on Text-To-Phoneme Mapping
A Neural Network GUI Tested on Text-To-Phoneme Mapping MAARTEN TROMPPER Universiteit Utrecht m.f.a.trompper@students.uu.nl Abstract Text-to-phoneme (T2P) mapping is a necessary step in any speech synthesis
More informationProblems of the Arabic OCR: New Attitudes
Problems of the Arabic OCR: New Attitudes Prof. O.Redkin, Dr. O.Bernikova Department of Asian and African Studies, St. Petersburg State University, St Petersburg, Russia Abstract - This paper reviews existing
More informationTwitter Sentiment Classification on Sanders Data using Hybrid Approach
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders
More informationPhonological Processing for Urdu Text to Speech System
Phonological Processing for Urdu Text to Speech System Sarmad Hussain Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, B Block, Faisal Town, Lahore,
More informationChunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.
NLP Lab Session Week 8 October 15, 2014 Noun Phrase Chunking and WordNet in NLTK Getting Started In this lab session, we will work together through a series of small examples using the IDLE window and
More informationChinese Language Parsing with Maximum-Entropy-Inspired Parser
Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art
More informationAN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)
B. PALTRIDGE, DISCOURSE ANALYSIS: AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC. 2012. PP. VI, 282) Review by Glenda Shopen _ This book is a revised edition of the author s 2006 introductory
More informationIterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages
Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Nuanwan Soonthornphisaj 1 and Boonserm Kijsirikul 2 Machine Intelligence and Knowledge Discovery Laboratory Department of Computer
More informationMultilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities
Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities Soto Montalvo GAVAB Group URJC Raquel Martínez NLP&IR Group UNED Arantza Casillas Dpt. EE UPV-EHU Víctor Fresno GAVAB
More informationCorrespondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy
1 Desired Results Developmental Profile (2015) [DRDP (2015)] Correspondence to California Foundations: Language and Development (LLD) and the Foundations (PLF) The Language and Development (LLD) domain
More informationAn Interactive Intelligent Language Tutor Over The Internet
An Interactive Intelligent Language Tutor Over The Internet Trude Heift Linguistics Department and Language Learning Centre Simon Fraser University, B.C. Canada V5A1S6 E-mail: heift@sfu.ca Abstract: This
More informationUnvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition
Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Hua Zhang, Yun Tang, Wenju Liu and Bo Xu National Laboratory of Pattern Recognition Institute of Automation, Chinese
More informationEnsemble Technique Utilization for Indonesian Dependency Parser
Ensemble Technique Utilization for Indonesian Dependency Parser Arief Rahman Institut Teknologi Bandung Indonesia 23516008@std.stei.itb.ac.id Ayu Purwarianti Institut Teknologi Bandung Indonesia ayu@stei.itb.ac.id
More informationUniversiteit Leiden ICT in Business
Universiteit Leiden ICT in Business Ranking of Multi-Word Terms Name: Ricardo R.M. Blikman Student-no: s1184164 Internal report number: 2012-11 Date: 07/03/2013 1st supervisor: Prof. Dr. J.N. Kok 2nd supervisor:
More informationEye Movements in Speech Technologies: an overview of current research
Eye Movements in Speech Technologies: an overview of current research Mattias Nilsson Department of linguistics and Philology, Uppsala University Box 635, SE-751 26 Uppsala, Sweden Graduate School of Language
More informationWhat Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017
What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017 Supervised Training of Neural Networks for Language Training Data Training Model this is an example the cat went to
More informationAutomatic Translation of Norwegian Noun Compounds
Automatic Translation of Norwegian Noun Compounds Lars Bungum Department of Informatics University of Oslo larsbun@ifi.uio.no Stephan Oepen Department of Informatics University of Oslo oe@ifi.uio.no Abstract
More informationARNE - A tool for Namend Entity Recognition from Arabic Text
24 ARNE - A tool for Namend Entity Recognition from Arabic Text Carolin Shihadeh DFKI Stuhlsatzenhausweg 3 66123 Saarbrücken, Germany carolin.shihadeh@dfki.de Günter Neumann DFKI Stuhlsatzenhausweg 3 66123
More informationCLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction
CLASSIFICATION OF PROGRAM Critical Elements Analysis 1 Program Name: Macmillan/McGraw Hill Reading 2003 Date of Publication: 2003 Publisher: Macmillan/McGraw Hill Reviewer Code: 1. X The program meets
More informationTowards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la
Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing Grzegorz Chrupa la A dissertation submitted in fulfilment of the requirements for the award of Doctor of Philosophy (Ph.D.)
More informationCharacter Stream Parsing of Mixed-lingual Text
Character Stream Parsing of Mixed-lingual Text Harald Romsdorfer and Beat Pfister Speech Processing Group Computer Engineering and Networks Laboratory ETH Zurich {romsdorfer,pfister}@tik.ee.ethz.ch Abstract
More informationSINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)
SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,
More informationThe RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017
The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017 Jan-Thorsten Peter, Andreas Guta, Tamer Alkhouli, Parnia Bahar, Jan Rosendahl, Nick Rossenbach, Miguel
More informationVocabulary Usage and Intelligibility in Learner Language
Vocabulary Usage and Intelligibility in Learner Language Emi Izumi, 1 Kiyotaka Uchimoto 1 and Hitoshi Isahara 1 1. Introduction In verbal communication, the primary purpose of which is to convey and understand
More informationThe Ups and Downs of Preposition Error Detection in ESL Writing
The Ups and Downs of Preposition Error Detection in ESL Writing Joel R. Tetreault Educational Testing Service 660 Rosedale Road Princeton, NJ, USA JTetreault@ets.org Martin Chodorow Hunter College of CUNY
More informationThe College Board Redesigned SAT Grade 12
A Correlation of, 2017 To the Redesigned SAT Introduction This document demonstrates how myperspectives English Language Arts meets the Reading, Writing and Language and Essay Domains of Redesigned SAT.
More informationEli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology
ISCA Archive SUBJECTIVE EVALUATION FOR HMM-BASED SPEECH-TO-LIP MOVEMENT SYNTHESIS Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano Graduate School of Information Science, Nara Institute of Science & Technology
More informationDistant Supervised Relation Extraction with Wikipedia and Freebase
Distant Supervised Relation Extraction with Wikipedia and Freebase Marcel Ackermann TU Darmstadt ackermann@tk.informatik.tu-darmstadt.de Abstract In this paper we discuss a new approach to extract relational
More informationELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading
ELA/ELD Correlation Matrix for ELD Materials Grade 1 Reading The English Language Arts (ELA) required for the one hour of English-Language Development (ELD) Materials are listed in Appendix 9-A, Matrix
More informationSpecification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments
Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,
More informationDifferent Requirements Gathering Techniques and Issues. Javaria Mushtaq
835 Different Requirements Gathering Techniques and Issues Javaria Mushtaq Abstract- Project management is now becoming a very important part of our software industries. To handle projects with success
More informationEdinburgh Research Explorer
Edinburgh Research Explorer Personalising speech-to-speech translation Citation for published version: Dines, J, Liang, H, Saheer, L, Gibson, M, Byrne, W, Oura, K, Tokuda, K, Yamagishi, J, King, S, Wester,
More informationFlorida Reading Endorsement Alignment Matrix Competency 1
Florida Reading Endorsement Alignment Matrix Competency 1 Reading Endorsement Guiding Principle: Teachers will understand and teach reading as an ongoing strategic process resulting in students comprehending
More informationInteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial ISSN:
Inteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial ISSN: 1137-3601 revista@aepia.org Asociación Española para la Inteligencia Artificial España Lucena, Diego Jesus de; Bastos Pereira,
More informationWeb as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics
(L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes
More informationRegression for Sentence-Level MT Evaluation with Pseudo References
Regression for Sentence-Level MT Evaluation with Pseudo References Joshua S. Albrecht and Rebecca Hwa Department of Computer Science University of Pittsburgh {jsa8,hwa}@cs.pitt.edu Abstract Many automatic
More informationDetecting English-French Cognates Using Orthographic Edit Distance
Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National
More information