A Transformation-Based Learning Method on Generating Korean Standard Pronunciation *

Similar documents
have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Phonological Processing for Urdu Text to Speech System

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Phonological and Phonetic Representations: The Case of Neutralization

First Grade Curriculum Highlights: In alignment with the Common Core Standards

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

Learning Methods in Multilingual Speech Recognition

Florida Reading Endorsement Alignment Matrix Competency 1

**Note: this is slightly different from the original (mainly in format). I would be happy to send you a hard copy.**

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Mandarin Lexical Tone Recognition: The Gating Paradigm

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

TEKS Comments Louisiana GLE

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

A Syllable Based Word Recognition Model for Korean Noun Extraction

CELTA. Syllabus and Assessment Guidelines. Third Edition. University of Cambridge ESOL Examinations 1 Hills Road Cambridge CB1 2EU United Kingdom

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Correspondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy

English Language and Applied Linguistics. Module Descriptions 2017/18

Teacher: Mlle PERCHE Maeva High School: Lycée Charles Poncet, Cluses (74) Level: Seconde i.e year old students

Lexical phonology. Marc van Oostendorp. December 6, Until now, we have presented phonological theory as if it is a monolithic

Linguistics 220 Phonology: distributions and the concept of the phoneme. John Alderete, Simon Fraser University

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Underlying Representations

LING 329 : MORPHOLOGY

Stages of Literacy Ros Lugg

Linking Task: Identifying authors and book titles in verbose queries

Generating Test Cases From Use Cases

Guidelines on how to use the Learning Agreement for Studies

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Speech Recognition at ICSI: Broadcast News and beyond

Proceedings of Meetings on Acoustics

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

To appear in The TESOL encyclopedia of ELT (Wiley-Blackwell) 1 RECASTING. Kazuya Saito. Birkbeck, University of London

Reading Horizons. A Look At Linguistic Readers. Nicholas P. Criscuolo APRIL Volume 10, Issue Article 5

(Sub)Gradient Descent

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

SOUND STRUCTURE REPRESENTATION, REPAIR AND WELL-FORMEDNESS: GRAMMAR IN SPOKEN LANGUAGE PRODUCTION. Adam B. Buchwald

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Universal contrastive analysis as a learning principle in CAPT

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

NAME: East Carolina University PSYC Developmental Psychology Dr. Eppler & Dr. Ironsmith

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

More Morphology. Problem Set #1 is up: it s due next Thursday (1/19) fieldwork component: Figure out how negation is expressed in your language.

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Infants learn phonotactic regularities from brief auditory experience

Markedness and Complex Stops: Evidence from Simplification Processes 1. Nick Danis Rutgers University

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Extending Place Value with Whole Numbers to 1,000,000

An Introduction to the Minimalist Program

Using computational modeling in language acquisition research

On-Line Data Analytics

DOWNSTEP IN SUPYIRE* Robert Carlson Societe Internationale de Linguistique, Mali

Graduate Program in Education

Program in Linguistics. Academic Year Assessment Report

A Bayesian Model of Stress Assignment in Reading

Coast Academies Writing Framework Step 4. 1 of 7

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

Curriculum Vitae. Sara C. Steele, Ph.D, CCC-SLP 253 McGannon Hall 3750 Lindell Blvd., St. Louis, MO Tel:

Arabic Orthography vs. Arabic OCR

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

HDR Presentation of Thesis Procedures pro-030 Version: 2.01

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Cambridgeshire Community Services NHS Trust: delivering excellence in children and young people s health services

Early Warning System Implementation Guide

South Carolina College- and Career-Ready Standards for Mathematics. Standards Unpacking Documents Grade 5

Parsing of part-of-speech tagged Assamese Texts

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Information Session 13 & 19 August 2015

The influence of orthographic transparency on word recognition. by dyslexic and normal readers

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Considerations for Aligning Early Grades Curriculum with the Common Core

Learning Methods for Fuzzy Systems

Integrating Common Core Standards and CASAS Content Standards: Improving Instruction and Adult Learner Outcomes

Grade 11 Language Arts (2 Semester Course) CURRICULUM. Course Description ENGLISH 11 (2 Semester Course) Duration: 2 Semesters Prerequisite: None

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

Phonological Encoding in Sentence Production

Word Stress and Intonation: Introduction

Problems of the Arabic OCR: New Attitudes

Figuration & Frequency: A Usage-Based Approach to Metaphor

Introduction to Simulation

Applications of memory-based natural language processing

GACE Computer Science Assessment Test at a Glance

Innovative Methods for Teaching Engineering Courses

NORTH CAROLINA VIRTUAL PUBLIC SCHOOL IN WCPSS UPDATE FOR FALL 2007, SPRING 2008, AND SUMMER 2008

Lecture 1: Machine Learning Basics

Chapter 5. The Components of Language and Reading Instruction

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

Rendezvous with Comet Halley Next Generation of Science Standards

Contrastiveness and diachronic variation in Chinese nasal codas. Tsz-Him Tsui The Ohio State University

Transcription:

A Transformation-Based Learning Method on Generating Korean Standard Pronunciation * Kim Dong-Sung a and Chang-Hwa Roh a a Department of Linguistics and Cognitive Science Hankuk University of Foreign Studies San89 Wansanri Mohyunmeon Yonginsi, Kyunggido Korea {dsk202, rayr}@hufs.ac.kr Abstract. In this paper, we propose a Transformation-Based Learning (TBL) method on generating the Korean standard pronunciation. Previous studies on the phonological processing have been focused on the phonological rule applications and the finite state automata (Johnson 1984; Kaplan and Kay 1994; Koskenniemi 1983; Bird 1995). In case of Korean computational phonology, some former researches have approached the phonological rule based pronunciation generation system (Lee et al. 2005; Lee 1998). This study suggests a corpus-based and data-oriented rule learning method on generating Korean standard pronunciation. In order to substituting rule-based generation with corpusbased one, an aligned corpus between an input and its pronunciation counterpart has been devised. We conducted an experiment on generating the standard pronunciation with the TBL algorithm, based on this aligned corpus. Keywords: Transformation-Based Learning, Computational Phonology, Data-oriented Processing, Corpus-based Learning, Pronunciation Generation 1. Introduction This paper presents a Transformation-Based Learning (TBL) method on generating Korean standard pronunciation. Previous studies on the phonological processing have been focused on the computation of the phonological rule application and the representation of the finite state automata (Johnson 1984; Kaplan and Kay 1994; Koskenniemi 1983; Bird 1995). In case of Korean computational phonology, some former researches have approached the pronunciation generation based on the phonological rules (Lee et al. 2005; Lee 1998) 1. Unlike previous works, this study suggests a standard Korean pronunciation generation method on the basis of corpusbased and data-oriented TBL learning. The role of the computational phonology is to generate a legitimate output counterpart of the underlying phonological input. Phonological rules are involved in the process of phonological generation. The SPE style operations on the computational phonology have used the rewriting rule ordering or the finite state transducer (Bird 1995; Bird and Ellison 1994; Gildea and Jurafsky 1996; Kaplan and Kay 1994). Those approaches, however, should reduce complicated * This paper was supported by the Second Brain Korea 21. 1 Copyright 2007 by Kim Dong-Sung and Chang-Hwa Roh Anyone can visit the website of Lee et al. (2005) and generate standard pronunciation at http://urimal.cs.pusan.ac.kr. 241

orderings because of huge amount of rewriting rules and rule orderings among themselves (Gildea and Jurafsky 1996). Other differently motivated approaches have suggested the dataoriented models, using a pronunciation corpus to derive legitimate outputs (Daelemans, Gillis and Durieux 1994; Johnson 1984). In this study, we use the learning method of TBL that was proposed by Brill (1995). We design a set of templates and abstract transformations of possible pronunciations. For the experiments, we set up an aligned corpus between the text based on the Korean standard orthography and the text based on the Korean standard pronunciation. We conducted an experiment on generating the standard pronunciation with the TBL algorithm, using this corpus. We use the phonotactic constraints to reduce the complexity of TBL process. As noticed in Hayes and Wilson (forthcoming), the phonological feature constraints can reduce the complication of phonotactics. We set up a list of constraints on the phonotatics, which is derived from the phonological features. The rest of the paper is composed of three parts: Section 2 is to introduce the TBL method into the phonological operation. Section 3 describes the experiment on Korean pronunciation. Section 4 deals with the experiment discussions. 2. TBL Application on the Pronunciation Handling Rule-oriented processing in phonology has been represented with context-sensitive rewrite rules. For example, Korean underlying stops are realized as unreleased voiceless stops in the word final position. The following example shows the rule application on the voiceless stop /t/. (1) t t /_# 2 The most popular way of formalizing the phonological rule is to induce two-level formalism in Koskenniemi (1984) and Karttunen (1993), or finite state transducer of Kaplan and Kay (1994). The basic intuition on these operations is that a rule rewrites an underlying string as a surface string, which can be implemented as a transducer that reads a lexical input and writes to a surface tape. [Figure 1] shows an example of this operation using the rule in (1). Figure 1: Rule based operation on the phonology Phonological derivation-based method ought to have the complicated rule ordering systems. A phonological input has the chance of different output realization(s), depending on rule orderings. Computation based on the finite state transducer is so complicated that the processing mechanisms are varied among the researchers. Gildea and Jurafsky (1996) suggest a method to reduce complicated rule ordering. Another different method is to use the data-oriented approach. Daelemans, Gillis and Durieux (1994) suggest a stochastic method to assign stress, the supra-segmental feature. This approach utilizes stochastic gain of information from a corpus. TBL is known as learning the most approximate tagging rules from the corpus. TBL is a dataoriented method. It considers every possible transformations of the tagging, using a limited set 2 For Korean sound and feature system, see Figure 4 and Figure 7 in the next section. 242

of transformations. The algorithm of TBL needs a small set of templates, abstracted transformations. A phonological input can be transformed into a phonological output. In Korean, the voiceless stop /t/ is varied among [t], [d], and [t], depending environments. Consider the following templates that transform the phonological input. If the preceding phonological environment is #, then /t/ becomes [t]. If the preceding phonological environment is Vowel, then /t/ becomes [d]. If the following phonological environment is Consonant, then /t/ becomes [t]. If the following phonological environment is #, then /t/ becomes [t ]. Figure 2: TBL application on the phonological change TBL method learns the phonological environment, by instantiating the incoming items in the templates. Every possible phonological environment in the template is iteratively tested by filling in every specific phonological input. This method transforms an input into an output, following the list in the template. In some sense, this approach is similar with one in two-level formalism, matching an input and an output. However, TBL needs a learning text (corpus). As Brill (1995) notes, a small amount of training data can resolve a large amount of processing data. Templates in TBL method have the list of environment which the phonological change must follow. The environment is conceptually the same as a context window in the KeyWord In Context (KWIC). In Figure 3, an example of context window is given. Figure 3: Context Windows in TBL The phonological features are inter-related with the phonotactic constraints. As Hayes and Wilson (forthcoming) insist, phonological features reduce the phonotactic constraints. Following such idea, we set up constraints on phonotactics, combining the phonological feature systems. This simplifies the search mechanism of TBL processing. 3. Experiments For the experiment, we set up the corpus which aligns the spoken data from the Sejong corpus and its standard pronunciation. The spoken data has 14,500 ejeols 3 (approximately 60,000 morphemes), which is composed of the transcription in the Korean standard orthography. We converted the data into the standard pronunciation, using Korean standard IPA converter of Lee et al. (2006). For instance, (2a) is converted into (2b) with Korean standard IPA converter. 3 Ejeol is the similar with bunsetsu in Japanese. Ejeol is the terminology for the chunks between spaces in a sentence. For more information, see Sohn (1999). 243

(2) a. Na-nun cip-e ka-n-ta. I-Top house-loc go-asp-end 4 b. Na-nWn tsi-be ka-n-da 5 Following this process, we gathered the aligned corpus as follows. (3) Na-nun {N.a-n.W.n} cip-e {ts.i-p.e} kan-ta {k.a.n-d.a} In (3) the convention - and. split intra syllables and inner syllable structures (onset-rhymecoda), respectively. The ejeol initial position is marked with { and the ejeol final one with }. The statistics of the standard pronunciation corpus is that the total phonemes are 106,478 with 7 phonemes per an ejeol and 1.78 phonemes per a morpheme. For the phonetic purpose, 19 consonants and 10 vowels are used for the Korean pronunciation as follows: Figure 4: ARPabet for Korean pronunciation Depending on the word positions (either ejeol initial or ejeol final) and syllable positions (onset-rhyme-coda), we gather 600 different phonemic types for the context window. These types are used to induce the template of TBL. The following is an example of TBL template with the very first preceding and the very next following. If the preceding phonological environment is #, then /t/ becomes [t]. If the preceding phonological environment is V, the /t/ becomes [d]. If the following phonological environment is #, the /t/ becomes [t ]. Figure 5: TBL templates 4 Top: Topical marker, Loc: Locative marker, Asp: Aspectual, End: Ending 5 For the font and other convenient issue, we adopt convention of ARPabet phoneset transcription. IPA fonts have the trouble when one can deal with the text processing. 244

As noted in Figure 3, the phonological environment is similar with a context window. If we enlarge the context window by 4, Figure 5 is changed into Figure 6. If the phonological environment with 4 context window is {#,#,_,Vowel Rhyme }, then /t/ becomes /t/. If the phonological environment with 4 context window is {Vowel Rhyme,-,_,Vowel Rhyme }, then /t/ becomes /d/. Figure 6: Example of 4 context windows TBL We randomly gathered 1,000 ejeols from the Sejong corpus for the test purpose. Using the aligned corpus, we converted the test material into its pronunciation. We have tested 20, 10, 5, 4, 3, or 2 context windows to see if there is any difference in accuracy. Brill (1995) suggests that TBL also reduce the training size of tagging. We also checked the total training, increasing training size by 1,000 until we reached the total size of the aligned corpora. Hayes and Wilson (forthcoming) claim that English phonotactics is explainable with 24 different constraints based on phonological features. Phonotactic constraints can reduce the search space of the TBL templates. Because the phonotactic constraint stops to search ill-formed phonotactics in the templates, the wrongly-predicted pronunciation is eradicated. In Korean, 29 phonemes in Figure 4 have the constraints on the phonotactic placements. The moderate phonological feature set of Korean is as follows. Table 1: Korean phonological feature system 6 p p* ph t t* th k k* kh s s* ts ts* tsh m n G l H sonornat - - - - - - - - - - - - - - + + + + - Major Class Features consonant + + + + + + + + + + + + + + + + + + + syllable - - - - - - - - - - - - - - - - - - - continuent - - - - - - - - - + + - - - - - - - + Manner delayed - - - - - - - - - + + + - - - Features release Consonant lateral - - - - - - - - - + Place coronal - - - + + + - - - + + + + + - + - + - Features Features anterior + + + + + + - - - + + - - - + + - + - Subsidiary tense - + + - + + - + + - + - + + - - - - - Features aspiraion - - + - - + - - + - - - - + - - - - + i E W A u o a y w Y Major Class Features Tongue Body sonornat + + + + + + + + + + consonant - - - - - - - - - - syllable + + + + + + + - - - high + - + - + - - + + + 6 The feature map is from Shin and Cha (2004). 245

Features Features low - - - - - - + - - - back - - + + + + + - + + Rounded round - - - - + + - - + - What the feature map in Figure 5 specifies is any clusters with consonant and diphthong [ye] cannot be placed next to each other. This cluster is predictable with the phonological feature of *[+cons][-back,-rnd,-syl][+syl]. The constraint restricts under the system in Figure 7. This restriction stops the searching mechanism of TBL since any restricted item is found. We build up the restriction list of 20 constraints. 7 Generally, the morphological information is pre-requisite for the phonological handling. The phonological change depends on the morphological information; such as irregular verbs, grammatical functions, word classes, etc. Our assumption on the morphological issue is that larger context windows in TBL include more morphological information. We doubted that such information can be replaceable with the size of context window. If the context window is larger, such morphological information is possibly included. We considered two groups of experiments; one with morphological information and the other is without morphological information. We compared the accuracy rate of two groups. 4. Discussion We use 20, 10, 5, 4, 3, or 2 context windows in the template to see the change in the precision. This test did not contain the morphological information, but only aligned corpus was used. 90 85 80 75 70 20 window 10 window 5 window 4 window 3 window 2 window Figure 7: Difference in precision rate without phonological information The result shows that the context window becomes larger and the precision rate goes up. Consider that an ejeol contains an average of 7 phonemes and each phoneme has an average of 1.78 morpheme(s). 10 and 20 context window contains more than 2 ejeols, which show the better precision rate. This reflects that the morphological information across an ejeol is reflected in the larger windows. With the morphological information, the precision rate of the experiment is as follows. 7 We used the hand-written constraints. To handle with the phonological feature system, very different computational mechanism is required. Bird (1995) and Bird and Ellison (1994) present a way to compute such features in the logical way. The problem for handling the feature system is to cope with the very complexity of feature systems. Gildea and Jurafsky (1996) use a decision tree to simply handle the feature geometry in phonology, as a way of simplifying the feature systems. We kept it for the future study. 246

90 88 86 84 82 20 window 10 window 5 window 4 window 3 window 2 window Figure 8: Difference in precision rate with the morphological information The morphological information provides the phonological process with more information. Thus, there is a rise in the precision rate in case of smaller context windows. Morphological information seems to appropriately contribute on the phonological processing. In case of phonotactic constraints, there is only 0.2~0.3% rise in the precision rate. The processing time with the phonotactic constraints is shorter than the processing time without it. Like Brill (1995) experimented on the learning size. We have tested the relationship between the precision rate and the size of the training data. We found out that the precision rate is stable with more than 4,000 training data. 88 86 precision rate 84 82 80 78 76 1000 3000 5000 7000 9000 11000 13000 15000 data size Figure 9: Training data size and precision rate 5. Conclusion In this paper, we suggest that the TBL method generates the standard Korean pronunciation. We used a corpus and data-oriented transformation method. We found out that the larger context windows in TBL carry more morphological information. The importance of this study lies in the speech technology. The study of the phonological change is the main topic in the domain of computational phonology. Also, the pronunciation generation is prerequisite to the speech-related technology. In text-to-speech system, the pronunciation generation mechanism provides a system with a better accurate mechanism. Also in speech recognition area, the more advanced pronunciation prediction reveals better recognition results. 247

The phonological information is related with the morphological encodings; regular vs. irregular, word classes, tagging information of previous word, etc. Such information is essential for the phonological processing. In this study, the concept on context window can cope with morphological information. But this idea needs further exploration. References Bird S. 1995. Computational Phonology. Cambridge: Cambridge University Press. Bird S. and T. M. Ellison. 1994. One-level phonology. Computational Linguistics, 20(1), 55-90. Brill E. 1995. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging. Computational Linguistics, 21(4), 543-565. Daelemans W., S. Gillis and G. Durieux. 1994. The acquisition of stress: A data-oriented approach. Computational Linguistics, 20(3), 421-451. Gildea D. and D. Jurafsky. 1996. Learning bias and phonological-rule induction. Computational Linguistics, 22(4), 497-530. Johnson M. 1984. A discovery procedure for certain phonological rules. Proceedings of the 10th International Conference on Computational Linguistics and 22nd Annual Meeting of the Association for Computational Linguistics, pp 344-347. Hayes B. and Wilson C. forthcoming. A maximum entropy model of phonotactics and phonotactic learning. Linguistic Inquiry. Kaplan R. and M. Kay. 1994. Regular models of phonological rule system. Computational Linguistics, 20(3), 331-378. Karttunen L. 1993. Finite-state constraints. In Goldsmith, ed., The Last Phonological Rule, pp 173-194. University of Chicago Press, Chicago. Karttunen L. 1998. The Proper Treatment of Optimality in Computational Phonology. Proceedings of the International Workshop on Finite State Methods in Natural Language Processing, pp 1-12. Koskenniemi, K. 1983. Two-level morphology. Ph.D. thesis, Department of General Linguistics, University of Helsinki. Lee G. 1998. Desing and implementation of vocal sound variation rules for Korean Language. Journal of Korean Informational Society, 5(3), 851-861. Lee E. et al. 2005. IPA Converter of Korean Standard Pronunciation. Proceedings of the Conference of Korean Cognitive Society, pp 206-211. Sohn Ho-Min. 1999. The Korean Language. Cambridge: Cambridge University Press. Shin J. and J. Cha. 2005. Korean Sound System. Seoul: Hanuk-Munwha-Sa. 248