An Automatic Syllable Segmentation Method for Mandarin Speech Runshen Cai 1 1 Computer Science & Information Engineering College, Tianjin University of Science and Technology, Tianjin, China crs@tust.edu.cn Abstract. An automatic syllable segmentation method for mandarin speech is proposed. There are five features and the corresponding phonetic transcriptions used in the method. Firstly, the speech signals are pre-filtered. Secondly, the speech signal pre-filtered is segmented into 30 ms long segments and the five features of each segment are computed. Finally, syllable segmentation performs based on the phonetic transcriptions and computed values of the features. The performance of the method has been evaluated using a large speech database. The method is shown to perform well in the cases of both clean and noisedegraded speech. Keywords: Signal processing, Speech analysis, Mandarin speech, Syllable segmentation 1 Introduction Syllables have long been regarded as robust units of speech processing. In speech research and development, there is a great need for syllable segmentation. Syllable segmentation are an important access to extract structure and content of speech, and are a basis for further speech analysis. When building a mandarin speech database, speech signals should be segmented and labeled. Manual segmentation and labeling, however, is extremely time consuming and tiresome. The process is both laborious and tedious, in that it requires extensive listening and spectrogram interpretation. Therefore, automatic procedures for segmenting speech into syllables are investigated in this paper. There are many methods for speech segmentation based on different features, such as wavelet transform, autocorrelation, short-time energy and short-time zero-crossing rate, MEL frequency [1] and so on. Most of the procedures have followed one of two basic approaches to the problem. One approach does not require any explicit information, but utilizes only the acoustical information, the other approach is to utilize the explicit information that is known previously, such as the correct phonetic transcription of utterance. In this paper, a new automatic segmentation method is presented for the mandarin speech which is labeled with the corresponding phonetic transcriptions. Therefore, the corresponding phonetic transcriptions are utilized in syllable segmentation. By 1
analysis and comparison of the common methods, the syllable segmentation method proposed by this paper is based on five features. The thresholds of the features are determined by the corresponding phonetic transcriptions. The five features we used in this paper are listed as follows: 1. Short-time average energy 2. Short-time zero-crossing rate 3. Product of the previous two features 4. Ratio of the first feature to the second feature 5. Ratio of low frequency average energy to total average energy The performance of the proposed method has been evaluated using a mandarin speech database. The result has been proved to be good. 2 The Proposed Syllable Segmentation Method Fig. 1, gives a flow diagram representation of the syllable segmentation method. Input Speech Pre-filtering Segmentation 3-level DWT Computing of per Band Energy Computing shorttime zero-crossing rate Computing short-time average energy Computing ratio of 500Hz below energy to total energy Computing product of energy and ZCR Computing Ratio of energy to ZCR Segmentation based on computed values of the above features and the phonetic transcriptions Fig. 1. Flow Diagram of the Method. 2
First the input speech signal, sampled at 8 khz, is denoised by pre-filtering. Then the signal is segmented into 30 ms long segments with 20ms overlap. After the segmentation stage, the following analysis and feature-extraction processes are implemented for each segment. Finally, syllable segmentation based on the computed features and the corresponding phonetic transcriptions is performed. 2.1 Pre-filtering For noisy speech, noise often has very bad effect and can t be neglected. In order to decrease the influences from high frequency noise on speech signals to improved the result of syllable segmentation, it is necessary to pre-filter noisy speech signals by a low-pass filter. According to the scope of pitch frequency of speech, a 5-order lowpass elliptic filter[2] whose cut-off frequency is 800Hz is used. The transfer function of this filter is given by Eq. 1, H(z)= - - (1) (1-3.6868z + 5.8926z -5.0085z + 2.2518z - 0.4271z ) -1-2 -3-4 -5 (0.008233 0.004879z + 0.007632z + 0.007632z 0.004879z + 0.008233z ) -1-2 -3-4 -5 2.2 Calculate features. Short-time average energy: The average energy of the i-th speech signal segment, defined as Eq. 2: N-1 2 x i(n) ) N n=0 E i= ( (2) It provides a convenient representation that reflects the variations of the amplitude of the speech signal[3]. The average energy of non-speech segments is generally much lower than that of speech segments, and for speech segments, that of unvoiced segments is generally much lower than that of voiced segments. Furthermore, average energy is always becoming lower at the syllable boundary than in the syllable. Short-time zero-crossing rate(zcr): In the context of discrete-time signals, a zerocrossing occurs if successive samples have different algebraic signs. The zerocrossing rate is a measure of frequency content in the signal. Unvoiced speech exhibits a higher zero crossing rate than voiced speech or silence. The sampling frequency of the speech signal also determines the time resolution of the zerocrossing measurements. The zero-crossing rate corresponding to the i-th segment of the speech is computed as Eq. 3: N 1 (3) ZCR = sgn[ x ( n)] sgn[ x ( n 1)] i i i n= 0 3
Where N=240, corresponding to 30 ms, denotes the length of the speech segment, xi(n). Product of ZCR and average energy: For both of ZCR and average energy are considered simultaneously, product of them are calculated as Eq. 4: Ai = Ei ZCRi (4) A i is always becoming lower at the syllable boundary than in the syllable. Ratio of ZCR to energy: There is another parameter, ratio of E i to ZCR i, should be calculated to considering both of ZCR and average energy simultaneously, which is calculated as Eq. 5: Bi = Ei ZCRi (5) B i of unvoiced segment is generally much lower than that of voiced segments. Ratio of low frequency average energy to total average energy: Each speech segment is decomposed into four different bands using a 3-level dyadic DWT[4], and the average energy of each band is computed. In general, an unvoiced speech segment should show energy concentration in the high frequency bands, while a voiced segment should show energy concentration in its fundamental frequency bands of the wavelet domain. Because the fundamental frequency of voiced segments is ranged from 50-500Hz, the ratio of 500Hz below energy to total energy is computed and used in the method as the last parameter in voiced/unvoiced judging. Let E H, be the high frequency(500hz above) energy of a speech segment and E L be the low frequency(500hz below) energy of a speech segment. Let E j, be the energy in wavelet band j. We can compute E H and E L as Eq. 6, Eq. 7: E H 3 = E (6) j= 1 j EL = E (7) Ratio of 500Hz below energy to total energy can be computed as (8): 4 ( ) R = E / E + E (8) i L H L So R i represents the ratio of 500Hz below energy to total energy of the i-th segment of the speech. 2.3 Syllable segmentation based on the features and the phonetic transcriptions The pre-filtered speech segments should be deal with in sequence from the very beginning. For each segment, the computed features should be used to be compared 4
with the thresholds determined by the corresponding phonetic transcriptions to get its type. There are eight types a segment should be one of them. The eight types is nonspeech, transition sections between syllables, the first type of consonants, the second type of consonants, vowels, vowel endings, transition sections between consonant and vowel, transition sections between two vowels[5]. The first two types represent that the segment is at the boundary of syllables, and the rest of them represent that the segment is in the syllable. The first type of consonants is represent all consonants except for the sonorants and the second type of consonants is represent sonorants. Vowel endings is represent the ending of vowels such as n, ng, i, u and so on. There are no vowel endings in pure vowels. Fig. 2, shows the possible neighbor relationships of the types. non-speech /transitions between syllables the first and second type of consonants vowels endings vowels transition sections between two vowels transition sections between consonant and vowel Fig. 2. Possible Neighbor Relationships of the Eight Types. Each arrowhead in Fig. 2 presents a possible neighbor relationship. The arrow is from the previous type and point to the next type. Each type has its own character, so we define several features to be tested for each type when a segment is being determined its type. Each syllable is consist of several parts, each parts should be one of the last six types. So we defined a list of types in sequence for each different syllable, And we also give the corresponding thresholds of the testing features for each type in the list. These thresholds maybe different in different syllables for the same type. We get these type lists and corresponding thresholds of the syllables by analyzing a large mandarin speech database and are also being adjusted for better performance. For the first two types, we define the thresholds of their testing features independent of syllables. Fig. 3, shows the flow of the syllable segmentation based on the features and the phonetic transcriptions. 5
Pre-filtered speech segments with features yes Type 1 or 2 and next segment no no Is current segment start type of the syllable yes Next segment Is current segment to be type 1 or 2 no Is next type of the syllable exists? yes Is current segment next type of the syllable no yes Next type Fig. 3. Flow of the Syllable Segmentation based on the Features and the Phonetic Transcriptions. In Fig. 3, We can see there are four judgment in the method flow. Is current segment start type of the syllable? To get correct result of this judgment, we used the features of the segment and thresholds of the first type of the syllable which should be get from the phonetic transcriptions. If the features match up to the thresholds, the segment should be the start type of the syllable. Otherwise, it should be type 1 or 2 by comparing with the thresholds of type 1 and 2. Is next type of the syllable exists? This judgment is easy to get result. Just by examine the type list of the syllable, we can know whether the next syllable exists or not. Is current segment next type of the syllable? This judgment can be get the result similar to the first judgment. Here we just use the thresholds of the next type replace for the first type. 6
Is current segment to be type 1 or 2? This judgment is used to determine if the segment reach the end of the syllable. By comparing with the thresholds of type 1 and 2, we can get the result. By the steps are repeated in turn according to the flow in Fig. 3, each segment should have been determined its type. So the boundaries of syllables and the boundaries between consonants and vowels have been determined clearly. 3 Experimental Results In this section, we evaluate the results of the proposed syllable segmentation method. Fig. 4 shows the performance of the proposed method to speech signals. We can see the boundaries of syllables and the boundaries between consonants and vowels and even the boundaries between vowels and vowel endings are determined correctly. Fig. 4. Syllable Segmentation of Speech Signal gong1cheng2bing1. The tests were performed using a large database comprising a wide variety of speech records, for different speakers and utterances. Two female and two male speakers were recorded reading Chinese words and sentences. The signals were ranging from 2-10s in length, and organized into 100 speech files corresponding to a total of 30000 speech segments. The database includes reference files containing phonetic transcriptions and syllable segmentations. The proposed method was tested in varying noise conditions. White Gaussian noise of different intensity was added. The accuracy of the syllable segmentation obtained by the method was evaluated using an objective error measure, Err%, which represents the percentage of the total number of erroneous boundaries to overall number of boundaries in the speech signal. Table 1 shows analysis results for male and female speakers at different SNRs(Signal to Noise Ratio) by adding a white Gaussian noise to the clean speech. It is clear that our method performs well with both clean and noise-degraded speech. 7
Table 1. Performance of the proposed method. SNR(db) Err% Male Female Total Clean 0.65 0.89 0.77 20 0.93 1.33 1.13 10 1.35 1.97 1.66 5 1.98 2.72 2.35 The thresholds and parameters used in the method are adjusted mainly through testing some speech of a male speaker, so the experimental results of male is better than that of female. 4 Conclusions The problem of syllable segmentation was addressed in this paper. An automatic syllable segmentation method for mandarin speech has been described. The method is based on the phonetic transcriptions and five features extracting from the input speech: short-time average energy, short-time zero-crossing rate, product of the previous two features, ratio of the first feature to the second feature, ratio of low frequency average energy to total average energy. The method determines the boundaries of syllables and the boundaries between consonants and vowels clearly. The performance of the method was evaluated under different noisy conditions using a large database of a variety of speech signals from male and female speakers. Reported results have shown that the proposed method is robust to additive noise. Acknowledgments. This work has been supported by Tianjin Science and Technology Development Foundation of High School (20090805) and Research Foundation of Tianjin University of Science and Technology (20100204). References 1. F. Pan, N. Ding: Speech Denoising and Syllable Segmentation Based on Fractal Dimension. In: Proc. 2010 International Conference on Measuring Technology and Mechatronics Automation(ICMTMA2010), pp. 433--436. IEEE Computer Society(2010) 2. Ru-wei LI, Chang-chun BAO, Hui-jing DOU: Pitch Detection Method for Noisy Speech Signals Based on Pre-Filter and Weighted Wavelet coefficients. In: International Conference on Signal Processing Proceedings(ICSP 2008), pp. 530--533. IEEE Inc.(2008) 3. D. Arifianto: Dual Parameters for Voiced-unvoiced Speech Signal Determination. In: ICASSP07, pp. IV 749--752. Institute of Electrical and Electronics Engineers Inc.(2007) 4. D. Charalampidis, V. B. Kura: Novel Wavelet-based Pitch Estimation and Segmentation of Non-stationary Speech. In: 2005 8th International Conference on Information Fusion, vol. 7, pp. 1--5(2005) 5. Ming-Tzaw Lin, Ching-Kuenl Lee, Chin-Yi1 Lin: Consonant/vowel Segmentation for Mandarin Syllable Recognition. Computer Speech and Language. 13, 207 222(1999) 8