An Automatic Syllable Segmentation Method for Mandarin Speech

Similar documents
Speech Emotion Recognition Using Support Vector Machine

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Learning Methods in Multilingual Speech Recognition

Mandarin Lexical Tone Recognition: The Gating Paradigm

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

WHEN THERE IS A mismatch between the acoustic

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

A study of speaker adaptation for DNN-based speech synthesis

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speech Recognition at ICSI: Broadcast News and beyond

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Human Emotion Recognition From Speech

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Automatic intonation assessment for computer aided language learning

Word Segmentation of Off-line Handwritten Documents

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Calibration of Confidence Measures in Speech Recognition

On the Formation of Phoneme Categories in DNN Acoustic Models

Segregation of Unvoiced Speech from Nonspeech Interference

THE RECOGNITION OF SPEECH BY MACHINE

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Rhythm-typology revisited.

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Proceedings of Meetings on Acoustics

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Speaker Recognition. Speaker Diarization and Identification

Body-Conducted Speech Recognition and its Application to Speech Support System

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Voice conversion through vector quantization

SARDNET: A Self-Organizing Feature Map for Sequences

Investigation on Mandarin Broadcast News Speech Recognition

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Author's personal copy

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Speaker recognition using universal background model on YOHO database

Australian Journal of Basic and Applied Sciences

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Automatic segmentation of continuous speech using minimum phase group delay functions

Modeling function word errors in DNN-HMM based LVCSR systems

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Disambiguation of Thai Personal Name from Online News Articles

Modeling function word errors in DNN-HMM based LVCSR systems

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Lecture 1: Machine Learning Basics

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Application of Visualization Technology in Professional Teaching

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Using EEG to Improve Massive Open Online Courses Feedback Interaction

FOR TEACHERS ONLY. The University of the State of New York REGENTS HIGH SCHOOL EXAMINATION PHYSICAL SETTING/PHYSICS

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Using dialogue context to improve parsing performance in dialogue systems

Journal of Phonetics

Circuit Simulators: A Revolutionary E-Learning Platform

Rule Learning With Negation: Issues Regarding Effectiveness

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Software Maintenance

Pedagogical Content Knowledge for Teaching Primary Mathematics: A Case Study of Two Teachers

Reduce the Failure Rate of the Screwing Process with Six Sigma Approach

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Data Fusion Models in WSNs: Comparison and Analysis

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Simulation of Multi-stage Flash (MSF) Desalination Process

An Acoustic Phonetic Account of the Production of Word-Final /z/s in Central Minnesota English

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Building Text Corpus for Unit Selection Synthesis

L1 Influence on L2 Intonation in Russian Speakers of English

Affective Classification of Generic Audio Clips using Regression Models

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Edinburgh Research Explorer

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

Transcription:

An Automatic Syllable Segmentation Method for Mandarin Speech Runshen Cai 1 1 Computer Science & Information Engineering College, Tianjin University of Science and Technology, Tianjin, China crs@tust.edu.cn Abstract. An automatic syllable segmentation method for mandarin speech is proposed. There are five features and the corresponding phonetic transcriptions used in the method. Firstly, the speech signals are pre-filtered. Secondly, the speech signal pre-filtered is segmented into 30 ms long segments and the five features of each segment are computed. Finally, syllable segmentation performs based on the phonetic transcriptions and computed values of the features. The performance of the method has been evaluated using a large speech database. The method is shown to perform well in the cases of both clean and noisedegraded speech. Keywords: Signal processing, Speech analysis, Mandarin speech, Syllable segmentation 1 Introduction Syllables have long been regarded as robust units of speech processing. In speech research and development, there is a great need for syllable segmentation. Syllable segmentation are an important access to extract structure and content of speech, and are a basis for further speech analysis. When building a mandarin speech database, speech signals should be segmented and labeled. Manual segmentation and labeling, however, is extremely time consuming and tiresome. The process is both laborious and tedious, in that it requires extensive listening and spectrogram interpretation. Therefore, automatic procedures for segmenting speech into syllables are investigated in this paper. There are many methods for speech segmentation based on different features, such as wavelet transform, autocorrelation, short-time energy and short-time zero-crossing rate, MEL frequency [1] and so on. Most of the procedures have followed one of two basic approaches to the problem. One approach does not require any explicit information, but utilizes only the acoustical information, the other approach is to utilize the explicit information that is known previously, such as the correct phonetic transcription of utterance. In this paper, a new automatic segmentation method is presented for the mandarin speech which is labeled with the corresponding phonetic transcriptions. Therefore, the corresponding phonetic transcriptions are utilized in syllable segmentation. By 1

analysis and comparison of the common methods, the syllable segmentation method proposed by this paper is based on five features. The thresholds of the features are determined by the corresponding phonetic transcriptions. The five features we used in this paper are listed as follows: 1. Short-time average energy 2. Short-time zero-crossing rate 3. Product of the previous two features 4. Ratio of the first feature to the second feature 5. Ratio of low frequency average energy to total average energy The performance of the proposed method has been evaluated using a mandarin speech database. The result has been proved to be good. 2 The Proposed Syllable Segmentation Method Fig. 1, gives a flow diagram representation of the syllable segmentation method. Input Speech Pre-filtering Segmentation 3-level DWT Computing of per Band Energy Computing shorttime zero-crossing rate Computing short-time average energy Computing ratio of 500Hz below energy to total energy Computing product of energy and ZCR Computing Ratio of energy to ZCR Segmentation based on computed values of the above features and the phonetic transcriptions Fig. 1. Flow Diagram of the Method. 2

First the input speech signal, sampled at 8 khz, is denoised by pre-filtering. Then the signal is segmented into 30 ms long segments with 20ms overlap. After the segmentation stage, the following analysis and feature-extraction processes are implemented for each segment. Finally, syllable segmentation based on the computed features and the corresponding phonetic transcriptions is performed. 2.1 Pre-filtering For noisy speech, noise often has very bad effect and can t be neglected. In order to decrease the influences from high frequency noise on speech signals to improved the result of syllable segmentation, it is necessary to pre-filter noisy speech signals by a low-pass filter. According to the scope of pitch frequency of speech, a 5-order lowpass elliptic filter[2] whose cut-off frequency is 800Hz is used. The transfer function of this filter is given by Eq. 1, H(z)= - - (1) (1-3.6868z + 5.8926z -5.0085z + 2.2518z - 0.4271z ) -1-2 -3-4 -5 (0.008233 0.004879z + 0.007632z + 0.007632z 0.004879z + 0.008233z ) -1-2 -3-4 -5 2.2 Calculate features. Short-time average energy: The average energy of the i-th speech signal segment, defined as Eq. 2: N-1 2 x i(n) ) N n=0 E i= ( (2) It provides a convenient representation that reflects the variations of the amplitude of the speech signal[3]. The average energy of non-speech segments is generally much lower than that of speech segments, and for speech segments, that of unvoiced segments is generally much lower than that of voiced segments. Furthermore, average energy is always becoming lower at the syllable boundary than in the syllable. Short-time zero-crossing rate(zcr): In the context of discrete-time signals, a zerocrossing occurs if successive samples have different algebraic signs. The zerocrossing rate is a measure of frequency content in the signal. Unvoiced speech exhibits a higher zero crossing rate than voiced speech or silence. The sampling frequency of the speech signal also determines the time resolution of the zerocrossing measurements. The zero-crossing rate corresponding to the i-th segment of the speech is computed as Eq. 3: N 1 (3) ZCR = sgn[ x ( n)] sgn[ x ( n 1)] i i i n= 0 3

Where N=240, corresponding to 30 ms, denotes the length of the speech segment, xi(n). Product of ZCR and average energy: For both of ZCR and average energy are considered simultaneously, product of them are calculated as Eq. 4: Ai = Ei ZCRi (4) A i is always becoming lower at the syllable boundary than in the syllable. Ratio of ZCR to energy: There is another parameter, ratio of E i to ZCR i, should be calculated to considering both of ZCR and average energy simultaneously, which is calculated as Eq. 5: Bi = Ei ZCRi (5) B i of unvoiced segment is generally much lower than that of voiced segments. Ratio of low frequency average energy to total average energy: Each speech segment is decomposed into four different bands using a 3-level dyadic DWT[4], and the average energy of each band is computed. In general, an unvoiced speech segment should show energy concentration in the high frequency bands, while a voiced segment should show energy concentration in its fundamental frequency bands of the wavelet domain. Because the fundamental frequency of voiced segments is ranged from 50-500Hz, the ratio of 500Hz below energy to total energy is computed and used in the method as the last parameter in voiced/unvoiced judging. Let E H, be the high frequency(500hz above) energy of a speech segment and E L be the low frequency(500hz below) energy of a speech segment. Let E j, be the energy in wavelet band j. We can compute E H and E L as Eq. 6, Eq. 7: E H 3 = E (6) j= 1 j EL = E (7) Ratio of 500Hz below energy to total energy can be computed as (8): 4 ( ) R = E / E + E (8) i L H L So R i represents the ratio of 500Hz below energy to total energy of the i-th segment of the speech. 2.3 Syllable segmentation based on the features and the phonetic transcriptions The pre-filtered speech segments should be deal with in sequence from the very beginning. For each segment, the computed features should be used to be compared 4

with the thresholds determined by the corresponding phonetic transcriptions to get its type. There are eight types a segment should be one of them. The eight types is nonspeech, transition sections between syllables, the first type of consonants, the second type of consonants, vowels, vowel endings, transition sections between consonant and vowel, transition sections between two vowels[5]. The first two types represent that the segment is at the boundary of syllables, and the rest of them represent that the segment is in the syllable. The first type of consonants is represent all consonants except for the sonorants and the second type of consonants is represent sonorants. Vowel endings is represent the ending of vowels such as n, ng, i, u and so on. There are no vowel endings in pure vowels. Fig. 2, shows the possible neighbor relationships of the types. non-speech /transitions between syllables the first and second type of consonants vowels endings vowels transition sections between two vowels transition sections between consonant and vowel Fig. 2. Possible Neighbor Relationships of the Eight Types. Each arrowhead in Fig. 2 presents a possible neighbor relationship. The arrow is from the previous type and point to the next type. Each type has its own character, so we define several features to be tested for each type when a segment is being determined its type. Each syllable is consist of several parts, each parts should be one of the last six types. So we defined a list of types in sequence for each different syllable, And we also give the corresponding thresholds of the testing features for each type in the list. These thresholds maybe different in different syllables for the same type. We get these type lists and corresponding thresholds of the syllables by analyzing a large mandarin speech database and are also being adjusted for better performance. For the first two types, we define the thresholds of their testing features independent of syllables. Fig. 3, shows the flow of the syllable segmentation based on the features and the phonetic transcriptions. 5

Pre-filtered speech segments with features yes Type 1 or 2 and next segment no no Is current segment start type of the syllable yes Next segment Is current segment to be type 1 or 2 no Is next type of the syllable exists? yes Is current segment next type of the syllable no yes Next type Fig. 3. Flow of the Syllable Segmentation based on the Features and the Phonetic Transcriptions. In Fig. 3, We can see there are four judgment in the method flow. Is current segment start type of the syllable? To get correct result of this judgment, we used the features of the segment and thresholds of the first type of the syllable which should be get from the phonetic transcriptions. If the features match up to the thresholds, the segment should be the start type of the syllable. Otherwise, it should be type 1 or 2 by comparing with the thresholds of type 1 and 2. Is next type of the syllable exists? This judgment is easy to get result. Just by examine the type list of the syllable, we can know whether the next syllable exists or not. Is current segment next type of the syllable? This judgment can be get the result similar to the first judgment. Here we just use the thresholds of the next type replace for the first type. 6

Is current segment to be type 1 or 2? This judgment is used to determine if the segment reach the end of the syllable. By comparing with the thresholds of type 1 and 2, we can get the result. By the steps are repeated in turn according to the flow in Fig. 3, each segment should have been determined its type. So the boundaries of syllables and the boundaries between consonants and vowels have been determined clearly. 3 Experimental Results In this section, we evaluate the results of the proposed syllable segmentation method. Fig. 4 shows the performance of the proposed method to speech signals. We can see the boundaries of syllables and the boundaries between consonants and vowels and even the boundaries between vowels and vowel endings are determined correctly. Fig. 4. Syllable Segmentation of Speech Signal gong1cheng2bing1. The tests were performed using a large database comprising a wide variety of speech records, for different speakers and utterances. Two female and two male speakers were recorded reading Chinese words and sentences. The signals were ranging from 2-10s in length, and organized into 100 speech files corresponding to a total of 30000 speech segments. The database includes reference files containing phonetic transcriptions and syllable segmentations. The proposed method was tested in varying noise conditions. White Gaussian noise of different intensity was added. The accuracy of the syllable segmentation obtained by the method was evaluated using an objective error measure, Err%, which represents the percentage of the total number of erroneous boundaries to overall number of boundaries in the speech signal. Table 1 shows analysis results for male and female speakers at different SNRs(Signal to Noise Ratio) by adding a white Gaussian noise to the clean speech. It is clear that our method performs well with both clean and noise-degraded speech. 7

Table 1. Performance of the proposed method. SNR(db) Err% Male Female Total Clean 0.65 0.89 0.77 20 0.93 1.33 1.13 10 1.35 1.97 1.66 5 1.98 2.72 2.35 The thresholds and parameters used in the method are adjusted mainly through testing some speech of a male speaker, so the experimental results of male is better than that of female. 4 Conclusions The problem of syllable segmentation was addressed in this paper. An automatic syllable segmentation method for mandarin speech has been described. The method is based on the phonetic transcriptions and five features extracting from the input speech: short-time average energy, short-time zero-crossing rate, product of the previous two features, ratio of the first feature to the second feature, ratio of low frequency average energy to total average energy. The method determines the boundaries of syllables and the boundaries between consonants and vowels clearly. The performance of the method was evaluated under different noisy conditions using a large database of a variety of speech signals from male and female speakers. Reported results have shown that the proposed method is robust to additive noise. Acknowledgments. This work has been supported by Tianjin Science and Technology Development Foundation of High School (20090805) and Research Foundation of Tianjin University of Science and Technology (20100204). References 1. F. Pan, N. Ding: Speech Denoising and Syllable Segmentation Based on Fractal Dimension. In: Proc. 2010 International Conference on Measuring Technology and Mechatronics Automation(ICMTMA2010), pp. 433--436. IEEE Computer Society(2010) 2. Ru-wei LI, Chang-chun BAO, Hui-jing DOU: Pitch Detection Method for Noisy Speech Signals Based on Pre-Filter and Weighted Wavelet coefficients. In: International Conference on Signal Processing Proceedings(ICSP 2008), pp. 530--533. IEEE Inc.(2008) 3. D. Arifianto: Dual Parameters for Voiced-unvoiced Speech Signal Determination. In: ICASSP07, pp. IV 749--752. Institute of Electrical and Electronics Engineers Inc.(2007) 4. D. Charalampidis, V. B. Kura: Novel Wavelet-based Pitch Estimation and Segmentation of Non-stationary Speech. In: 2005 8th International Conference on Information Fusion, vol. 7, pp. 1--5(2005) 5. Ming-Tzaw Lin, Ching-Kuenl Lee, Chin-Yi1 Lin: Consonant/vowel Segmentation for Mandarin Syllable Recognition. Computer Speech and Language. 13, 207 222(1999) 8