CHINESE TIMT: A TIMIT-LIKE CORPUS OF STANDARD CHINESE

Size: px

Start display at page:

Download "CHINESE TIMT: A TIMIT-LIKE CORPUS OF STANDARD CHINESE"

Alisha Gregory
6 years ago
Views:

1 CHINESE TIMT: A TIMIT-LIKE CORPUS OF STANDARD CHINESE Jiahong Yuan 1, Hongwei Ding 2, Sishi Liao 2, Yuqing Zhan 2, and Mark Liberman 1 1 Linguistic Data Consortium, University of Pennsylvania 2 Institute of Cross-Linguistic Processing and Cognition, Shanghai Jiao Tong University ABSTRACT This paper describes an effort to build a TIMIT-like corpus in Standard Chinese, which is part of our Global TIMIT project. Three steps are involved and detailed in the paper: selection of sentences; speaker recruitment and recording; and phonetic segmentation. The corpus consists of 6000 sentences read by 50 speakers (25 females and 25 males). Phonetic segmentation obtained from forced alignment is provided, which has 93.2% agreement (of phone boundaries) within 20 ms compared to manual segmentation on 50 randomly selected sentences. Statistics on the number of tokens and mean duration of phones and tones in the corpus are also reported. Males have shorter phones/tones but more and longer utterance internal silences than females, demonstrating that males in this dataset speak faster but pause more frequently and longer. Index Terms TIMIT, Forced alignment, Maximum coverage, Standard Chinese 1. INTRODUCTION Since it was created three decades ago, the TIMIT speech corpus has been widely used in speech science and speech technology development [1-3]. The great success of TIMIT prompted the ongoing effort at the Linguistic Data Consortium to create Global TIMIT a series of TIMITlike corpora in a number of languages [4]. The original TIMIT dataset contains a total of 6300 sentence tokens, 10 sentences spoken by each of 630 speakers from eight major dialect regions of the United States. The sentence prompts include 2 dialect Shibboleth sentences (SA), 450 phonetically-compact sentences (SX), and 1890 phonetically-diverse sentences (SI). The dialect Shibboleth and phonetically-compact sentences were elaborately designed whereas the phonetically-diverse sentences were selected from existing text sources. The design of Global TIMIT adopts a scheme different from that of the original TIMIT. Instead of having 630 speakers and 10 sentences per speaker, the new design has 50 speakers and 120 sentences per speaker. This makes the corpus size comparable to the original TIMIT but requires much less time and effort for recruiting and recording. Among the 120 sentences read by a speaker, 20 are Calibration sentences, read by all speakers; 40 are Shared sentences, read by 10 speakers; and 60 are Unique sentences, read by only one speaker. The total number of sentence types is, therefore, *(50/10) + 60*50 = The design is summarized in Table 1. Table 1: The design of Global TIMIT. Sentence Type #Sentences #Speakers /Sentence Total #Sentences /Speaker Calibration Shared Unique Total The creation of a TIMIT-like corpus consists of three steps: design or selection of sentences; speaker recruitment and recording; and phonetic transcription and segmentation. This paper describes our effort to build Chinese TIMIT in these steps Candidate sentences 2. SENTENCE SELECTION All sentences were selected from the corpus of Chinese Gigaword Fifth Edition [5], which is a comprehensive archive of newswire text data from Chinese news sources candidate sentences were selected from the corpus by the following steps: 1. Extract sentences that are characters long, excluding those containing characters that are not on the list of the 3500 most frequently used Chinese characters ( 现代汉语常用字表 ); 2. Manually go through the list of extracted sentences in a random order, to remove those with uncommon words (e.g., person or place names) or inappropriate meaning (e.g., politically sensitive viewpoints), and also to segment the sentences into words. This was done until a pool of 5000 candidate sentences was generated, which contain approximately 6600 unique words and 2200 unique characters. Calibration, Shared, and Unique sentences were selected from the candidate pool using computer algorithms. A pronouncing dictionary was made for sentence selection and

phonetic segmentation. The dictionary and the sentence selection procedure are described in the following selections. 2.

2 phonetic segmentation. The dictionary and the sentence selection procedure are described in the following selections Pronouncing dictionary The pronouncing dictionary only transcribes the canonical pronunciation of a word as appeared in the dataset. Only a few words have more than one pronunciation, for which all pronunciations were listed. Hanyu Pinyin was used to transcribe the pronunciation, including initials, finals, and tone. A final in Mandarin Chinese may consist of one or more vowels (or vowels and glides, depending on the adopted phonological analysis), with or without a nasal coda. Because /o/ and /uo/ occur in complementary distribution and the acoustic difference between the two finals is negligible [6], they were treated as the same final. /i/ has three pronunciation variants, often transcribed as [ɿ] (when appearing after an alveolar fricative/affricate), [ʅ] (when appearing after a retroflex fricative/affricate), and [i] (in all other contexts). The three variants were treated as different finals, /i/ for [i], /ii/ for [ɿ], and /iii/ for [ʅ]. In total, there were 21 initials and 36 finals. Tones were marked on the finals, including Tone1 through Tone4, and Tone0 for the neutral tone. The phonetic labels are listed in Table 2. Table 2: Phonetic labels (in Pinyin). Initials b, p, m, f, d, t, n, l, g, k, h, j, q, x, zh, ch, sh, r, z, c, s Finals a, ai, an, ang, ao e, ei, en, eng, er i, ii, iii, ia, ian, iang, iao, ie, in, ing, iong, iu ong, ou u, ua, uai, uan, uang, ui, un, uo v, van, ve, vn * Tones 1, 2, 3, 4, 0 Silence sil * v represents ü in Pinyin, ii is for [ɿ], and iii is for [ʅ] Selecting sentences Twenty Calibration sentences were selected from the candidate pool to cover the maximum number of (toneindependent) syllable types in the language. This problem is known to be NP-Hard, but it can be approximately solved using greedy approximation [7]: Greedy Approximation: 1: cover ed s et i s empt y 2: Re pe at 3: Pi ck t he s ent ence wi t h t he maxi mum number of s yl l abl e t ype s not i n t he cover ed s et 4: Add s yl l abl e t ypes i n t he chos en s ent ence int o t he cover ed s et 5: Unt i l 20 s ent ences ar e s el ect ed As illustrated in Figure 1, we randomized the candidate sentences before the selection, and repeated the procedure 1000 times to obtain 1000 sets of 20 sentences. The set that contains the most number of tone-independent syllable types was used as Calibration sentences. Figure 1: Procedure for selecting Calibration sentences. Shared sentences were selected to cover the maximum number of tones and (within-word) tonal combinations. We need five sets of Shared sentences: each set has 40 sentences and will be read by 10 speakers. The first 20 sentences were selected to have at least five occurrences for each of the mono- and bi- tones. The second 20 sentences were selected to cover the maximum number of three- and four- tone combinations. The procedure was similar to that used for selecting Calibration sentences. Unique sentences were randomly selected from the remaining sentences in the candidate pool. 50 sets of 60 sentences were selected, each to be read by one speaker only. 3. SPEAKER RECRUITMENT AND RECORDING 50 college students at Shanghai Jiao Tong University, 25 females and 25 males, were recruited to read the sentences. All of them speak Standard Chinese. As a criterion to determine whether a subject speaks Standard Chinese, his/her spoken Mandarin proficiency assessed by Putonghua Shuiping Ceshi (which is the national standard Mandarin proficiency test) was used. There are seven levels of proficiency assessed by the test, which are, from highest to lowest: Class 1 Level 1, Class 1 Level 2, Class 2 Level 1, Class 2 Level 2, Class 3 Level 1, Class 3 Level 2, and Failed. In order to qualify for teaching K-12, one must pass Class 2 Level 2. The speakers recruited for the experiment all achieved Class 2 Level 1 or better on Putonghua Shuiping Ceshi. The recording was made in a sound-treated recording booth at Shanghai Jiao Tong University, using the SpeechRecorder Software [8]. The sentences were displayed on a computer screen for subjects to read, one at a time, controlled by the person who monitored the recording. A total of 6000 utterances were recorded, 120 utterances for each speaker.

3 4. PHONETIC SEGMENTATION 4.1. Forced Alignment HMM/GMM-based forced alignment was applied to obtain phonetic segmentation. In prior work [9,10], we demonstrated that employing explicit phone boundary models within the HMM framework could significantly improve forced alignment accuracy for both English and Mandarin Chinese. The phone boundary models were a special 1-state HMM (as shown in Figure 2), in which the state cannot repeat itself: Figure 2: Special 1-state HMM for phone boundaries with transition probabilities a 01 = a 12 = 1. Therefore, a boundary can have one and only one state occurrence, i.e., aligned with only one frame. The special 1- state phone boundary HMMs were combined with standard monophone HMMs. Given a phonetic transcription, phone boundaries were inserted between phones. For example, sil i g e sil becomes sil sil_i i i_g g g_e e e_sil sil. The boundary states were tied through decision-tree based clustering, similar to triphone state tying developed in speech recognition. We started with the acoustic models trained on Hub4 Mandarin Broadcast News Speech [11], and retrained the models by combining the Broadcast News Speech data and our recordings (Training on the combined data sets had better results than training on Chinese TIMIT data only). Toneindependent models were employed. The acoustic features were the standard 39 PLPs extracted with 25 ms Hamming window and 10 ms frame rate. Initials, monophthong finals (/a, e, i, ii, iii, u, v/), and silence were 3-state HMMs, all other finals (including diphthongs, triphthongs, and nasalcoda finals) were 5-state HMMs. Each state had 2 Gaussian mixture components with diagonal covariance matrices. The system was built using the HTK Toolkit [12] Evaluation of segmentation accuracy To evaluate segmentation accuracy, 50 randomly selected sentences were manually corrected by three of the authors. Excluding the boundaries between silence and a stop or an affricate, where the boundary cannot be determined because of the stop closure, there are 1431 boundaries in the 50 sentences. 93.2% of the boundaries (1333 boundaries) have an agreement of within 20 ms between forced alignment and manual segmentation, which is on par with state-of-the-art results in terms of accuracy of automatic phonetic segmentation. 5. STATISTICS OF THE CORPUS 5.1. Statistics of phones Based on the phonetic segmentation of the corpus, we calculated the total number of occurrences of every phone and its mean duration. The results are listed in Table 3, in which males and females are calculated separately. Table 3: Number of tokens and mean duration of phones in the corpus. Male Female Phone #tokens # duration # duration (all) (sec.) (sec.) /b/ /p/ /m/ /f/ /d/ /t/ /n/ /l/ /g/ /k/ /h/ /j/ /q/ /x/ /zh/ /ch/ /sh/ /r/ /z/ /c/ /s/ /a/ /e/ /i/ /ii/ /iii/ /u/ /v/ /ai/ /ao/

/ei/ 1368 686 0.1066 682 0.119 /er/ 291 141 0.1788 150 0.1896 /ia/ 1036 532 0.1394 504 0.1531 /iao/ 1879 942 0.1398 937 0.1507 /ie/ 1915 977 0.1271 938 0.1388 /iu/ 2281 1148 0.1428 1133 0.

4 /ei/ /er/ /ia/ /iao/ /ie/ /iu/ /ou/ /ua/ /uai/ /ui/ /uo/ /ve/ /an/ /ang/ /en/ /eng/ /ian/ /iang/ /in/ /ing/ /iong/ /ong/ /uan/ /uang/ /un/ /van/ /vn/ Pause (all) Pause (Calibration) Interestingly, we can see from the table that males have a shorter duration across phones than females. Paired-saples t-test shows that the difference is statistically significant (p < 0.001). This result suggests that males speak faster than females. On the other hand, however, males made more pauses (976 vs. 754) and longer pauses ( sec. vs sec.) than females in the corpus (Utterance internal silences that are longer than 50 ms were counted as pauses). Because textual factors such as sentence length and syntactic complexity affect pause production, we also calculated pauses in the Calibration sentences only to remove the effects of those factors on the difference between males and females (they read the same sentences). The result is listed at the end of Table 3. For the Calibration sentences only, still, males made more pauses (179 vs. 147) and longer pauses ( sec. vs sec.) than females Statistics of tones The number of tokens and mean duration of tones (entire syllables) are listed in Table 4 and shown in Figure 3. We can see that Tone0 is the shortest; Tone1 and Tone2 are longer than Tone3 and Tone4. And again, males have a shorter duration on every tone than females. Table 4: Number of tokens and mean duration of tones in the corpus. Male Female Tone #tokens # duration # duration (all) (sec.) (sec.) T T T T T Figure 3: Mean duration of tones in the corpus. 6. CONCLUSION In this paper, we detailed the development of a TIMIT-like corpus in Standard Chinese. A simple analysis of the corpus shows that males speak faster but pause more frequently and longer than females. This result is consistent with our previous investigation of this topic based on telephone conversations and monologue speech [13, 14]. Along with Chinese TIMIT, we have also created an L2 English TIMIT, for which the same 50 speakers read easy sentences selected from the original TIMIT. We plan to extend the effort to L2 Chinese and L1 English, to make a basis for four-way comparison between L1 and L2 and between Chinese and English.

5 6. REFERENCES [1] V. Zue, Speech Database Development, Final Technical Report submitted to the Defense Advanced Research Projects Agency (for Contract # C-0341, June June 1987), [2] V. Zue, S. Seneff, and J. Glass, Speech database development at MIT: TIMIT and beyond, Speech Communication 9(4), pp , [3] Garofolo, J., et al., TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1), Linguistic Data Consortium, [4] N. Chanchaochai, J. Yuan, J. Wright, C. Cieri, and M. Liberman, Global TIMIT: Towards Creating TIMIT-analogous Speech Corpora, manuscript. [5] Parker, R., et al., Chinese Gigaword Fifth Edition (LDC2011T13), Linguistic Data Consortium, [6] J. Yuan, The spectral dynamics of vowels in Mandarin Chinese, Proceedings of Interspeech 2013, pp , [7] U. Feige, A Threshold of ln n for Approximating Set Cover, J. of the ACM 45(5), pp , [8] C. Draxler and K. Jänsch, SpeechRecorder - a Universal Platform Independent Multi-Channel Audio Recording Software, Proceedings of LREC, pp , [9] J. Yuan, N. Ryant, M. Liberman, A. Stolcke, V. Mitra, and W. Wang, Automatic phonetic segmentation using boundary models, Proceedings of Interspeech 2013, pp , [10] J. Yuan, N. Ryant, and M. Liberman, Automatic phonetic segmentation in Mandarin Chinese: Boundary models, glottal features and tone, Proceedings of ICASSP 2014, pp , [11] Huang, S., et al., 1997 Mandarin Broadcast News Speech (HUB4-NE) (LDC98S73), Linguistic Data Consortium, [12] Young, S., et al., The HTK Book, Web Download. [13] J. Yuan, M. Liberman, and C. Cieri, Towards an integrated understanding of speaking rate in conversation, Proceedings of Interspeech 2006, pp , [14] J. Yuan, X. Xu, W. Lai, and M. Liberman, Pauses and Pause Fillers in Mandarin Monologue Speech: The Effects of Sex and Proficiency, Proceedings of Speech Prosody 2016, pp , 2016.

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Hua Zhang, Yun Tang, Wenju Liu and Bo Xu National Laboratory of Pattern Recognition Institute of Automation, Chinese