A STUDY ON THE EFFECT OF THE NEIGHBOR PHONEMES IN NATURAL SYNTHESIS OF SPEECH

Size: px

Start display at page:

Download "A STUDY ON THE EFFECT OF THE NEIGHBOR PHONEMES IN NATURAL SYNTHESIS OF SPEECH"

Jonathan Shields
5 years ago
Views:

1 Ceylon Journal of Science (Physical Sciences) 18 (2014) Computer Science A STUDY ON THE EFFECT OF THE NEIGHBOR PHONEMES IN NATURAL SYNTHESIS OF SPEECH H.M.L.N.K Herath 1 and J.V. Wijayakulasooriya 2 1Postgraduate Institute of Science, University of Peradeniya, Sri Lanka. 2Department of Electronic and Electrical Engineering, Faculty of Engineering, University of Peradeniya, Sri Lanka (*Corresponding author s 1 lakminiherath0@gmail.com 2 jan@ee.pdn.ac.lk). (Received: 13 January 2014 / Accepted after revision: 16 June 2014) ABSTRACT Natural synthesis of speech needs to identify the minute variations in phoneme during reproduction, which is affected by many factors. This paper presents an empirical study on the correlations between consequent phonemes in a speech signal. Short /a/ phoneme was selected for the study. In order to examine the effect of neighboring phonemes more clearly, words which consist of three or four phonemes were chosen. Then, the correlations between all possible pairs were calculated by comparing one cycle of each /a/ sound, which are starting from the same phonemes. Furthermore, one cycle taken from three different places, start, middle and end of the /a/ phoneme were selected and correlations between different pairs were calculated. The correlation values have clearly shown that the middle phoneme follows the preceding phoneme s energy to build the articulation between two phonemes, smoothly as well as within the /a/ phoneme itself. University of Peradeniya 2014 INTRODUCTION Speech synthesis is the artificial production of human speech. One of the main focus areas in speech synthesis research is to reduce the amount of data needed to synthesize the speech while maintaining an acceptable quality. During recent past, more emphasis is given to improve the naturalness of the synthesized speech. In this regard, many methods from low bit rate methods and high bit rate methods have been proposed (Bristow-Johnson, 1996). However, the holy grail of natural synthesis of speech is still remaining a challenging task, particularly for low bit rate applications. There are two main computer based speech synthesizing techniques: concatenative synthesis (Wavetable synthesis in music) of speech, which stored raw waveforms corresponding to each phoneme in a database called wavetable and concatenate them according to the phonemes to be synthesized (Holmes and Holmes, 2001; Smith, 2006). Although this method produces more natural speech than the mathematical coding based models, the high capacity needed for storing the speech and high bit rates involved in transmission of the speech are main concerns. In contrast, the mathematical coding based technique such as Linear Predictive Coding (LPC), which is based on Auto Regressive (AR) modeling of speech, significantly reduces the bit rate. However, the speech is modeled as a response of a Linear Time Invariant (LTI) system to an input excitation signal. The problem with Linear Time Invariant (LTI) system is the occurrence of audible discontinuities at phoneme boundaries, which leads to unnaturalness of synthetic speech. Time varying nature of phonemes Speech does not simply consist of a string of target articulations linked by simple movement between them(ohala 1993). In fact, articulation of individual sound segments or phonemes is almost always influenced by the articulation of neighboring

2 segments, often to the point of considerable overlapping of articulator activities (Ohala 1993). A phoneme is the smallest contrastive unit in the sound system of a language. Phonemes are combined with other phonemes to form meaningful units such as words or morphemes. Without appropriate transition between phonemes, the resulting speech sounds are unnatural and is hard to understand. In 1933, Menzerath and Lacerda [Hardcastle W. J et al. 1999] populated the term co-articulation. It was coined to denote instance where two successive sounds were articulated together. Many decades of experimental phonetic research have produced a large literature on the topic. The elementary fact highlighted here is that coarticulation is manifested in a temporal overlap between any two channels recruited by different phonemes. In the most basic model of articulatory by Locus (Delattre, 1969), each phoneme has a single ideal articulatory target for each contrastive articulator independent of the neighboring phonemes(phung et al. 2011). Under effects of neighboring phonemes, the transition between two phonemes is described as the movement between the two ideal targets of the phonemes. The Kozhevnikov-Chistovich model shows co- articulation within syllable but not across syllables(phung et al.,2011). Although there are many co-articulation models have been proposed there is still a lack of simple models, which are easy to be implemented in speech applications, and directly performed with acoustic data (Phung et al.,2011). Most of the mathematical speech synthesis models assume that the changes between the phonemes are time invariant. In other words, the parameter of the phoneme does not change with time. Linear systems in reality produce their outputs as a linear combination of its current and previous inputs and its pervious outputs(tatham et al., 2005). But the nature of the transition between phonemes is time variant. Figures 1 and 2 show that how the formant values change from one phoneme to another phoneme in time variant and time invariant systems. If the changes between phonemes are time invariant then the formant contours should be constant throughout the duration of a phoneme as shown in figure 1. However, in natural speech, the phonemes vary from one phoneme to another as well as within the phoneme as shown in Figure 2. The objective of this study is to find the effect of the neighboring phonemes in linear time variant nature by calculating the Pearson s correlation between phonemes. Figure 1: Formant values in time invariant system Figure 2: Formant values in time variant system METHOD Out of nearly forty four phonemes in English language, short /a/ phoneme was studied in this research. Recording phoneme sounds separately was infeasible, so that words which include short /a/ sound were selected for the recording. To examine the effect of neighboring phonemes more clearly, words which consist of three or four phonemes were chosen. From recorded words, /a/ phoneme was extracted separately. The segmentation process for the short /a/ was conducted manually by looking at the time wave and listening to the segmented phoneme. Then the Pearson s correlation coefficient (Wikipedia, 2014)between all possible pairs of different words were calculated by comparing one 46

3 cycle of each /a/ sound. In this case, pairs of words starting with the same phoneme as well as pairs of words starting with different phonemes were considered. In addition to that, one cycle taken from three different places, start, middle and end of the /a/ phoneme were selected and correlation between different pairs was calculated. A hypothesis test was conducted to find the significance of the correlation values. Sound processing and the statistical calculations were done by using the MATLAB software. RESULTS AND DISCUSSION Following correlation values were obtained by comparing, short /a/ sounds which are starting from the same phoneme (table 1). Same procedure was conducted by changing the starting phoneme and similar results have been obtained. As shown in table 1, each and every word which are starting from same phoneme has a significant correlation value greater than 0.75 and all pairs obtained p-values closer to 0. In the Pearson s correlation statistical hypothesis tests, all pairs of /a/ phoneme obtained p- values closer to 0. This shows all calculated pair wise correlations are statistically significant. Same experiment has been conducted by changing the first phoneme of the word but without changing the last phoneme. According to figure 3, /a/ sounds extracted from the words which are starting from different phonemes, but the same ending phoneme t, the correlation values are less than The p-values obtained for these pair wise correlations are also closer to 0. This interprets that there are moderate positive correlations between the words which are starting from different phonemes. Several experiments have been conducted by changing the last phoneme and similar results were obtained. It points out, those /a/ sound wave forms of words which are starting from same phoneme, have more correlation than the /a/ sound wave forms of words which are starting with different phoneme. So there was a significant relationship between the first phoneme and the following phoneme (vowel) of a word with compared to the relationship between the middle phoneme (vowel) and the next phoneme. The short /a/ phoneme wave form depends on the previous phoneme. That is previous letter have a clear impact on the following phoneme sound. Figure 3: Correlation values of comparing short /a/ sounds which are starting with different phonemes and ending with phoneme t. According to the figure 4, correlation values between /a/ sound of the word Bad with short /a/ sounds of other words which are stating from letter B were more than 0.7. That means the similarities between waveforms (one cycle) are greater than 50%. Most of them have correlation values more than 0.85.That means the similarities of some of wave forms were exceeding 75%. But when considering the relationship between the words which are starting with different phonemes, correlation values are less than 0.8. Some of those values are less than 0.5. This means that the relationship between /a/ sounds depends on the preceding phoneme. Figure 4:Correlation values of comparing Bad /a/ sound with short /a/ sounds, which are starting with letter B and different letters Figure 5 shows the average correlation values of different words by considering three cycles of /a/ phoneme taken from different places. One cycle near to the first let- 47

4 ter, middle cycle and a cycle form the end of the /a/ phoneme. Table 1: Pearson s correlation values of comparing short /a/ sound words, which start from phoneme B bad 1 bad bag ban bat back band bank batch badge bask bang bash bag ban bat back band bank batch badge bask bang bash neighboring phonemes as well as within the phoneme. CONCLUSION Figure 5: Average Correlation value of /a/ phoneme of different words extracting the cycles from three different places When compared with the cycles taken from different places, figure 5 shows the starting cycle average correlation value was always less than the middle cycle average correlation value, which implies that front cycles of the /a/ sounds have a clear impact from the previous phoneme. It is because the staring cycle lies within the transition region between the two neighboring phonemes. But when it comes to the middle cycle /a/ sound wave form was stabilized, so the average correlation value was much greater than previous values. Then the transit to the next phoneme, the correlation values vary from word to word, but all the values were less than middle correlation values. Figure 5 indicates that there is a time variant linear relationship between the The underlined approach is to investigate the effect of correlation between consequent phonemes in natural synthesis of speech. This study illustrates when the starting phoneme changes, the proceeding phoneme correlation values also change significantly. Therefore, there is a smooth linear time variant transition between consequent phonemes. In addition to that, the study also points out that the middle phoneme has a different correlation values within the phoneme when compared to the start, middle and end wave forms. It shows there is a smooth variation within the /a/ phoneme itself. Thus, the correlation values have clearly shown that the middle phoneme follows the preceding phoneme energy to build the articulation between two phonemes smoothly. The study concludes that the time variant nature of neighboring phonemes as well as within the phoneme should be strongly considered when modeling more natural speech in mathematical coding based low bit rate models. REFERENCES 48

5 Alan O Cinn éide(2008) Linear Prediction The Technique, Its Solution and Application to Speech. Published in DIT Internal Technical Report Bristow-Johnson, R.(1996) Wavetable Synthesis 101, A Fundamental Perspective, In 101st AES Convention (Los Angeles, California), Audio Engineering Society (AES), Preprint No Delattre, P. (1969)Coarticulation and The Locus Theory, StudiaLinguistica 23(1) 1 26, Holmes, J., and Holmes, W.(2001)Speech Synthesis and Recognition, Second Edition,Taylor & Francis, London, UK. 287 Hardcastle W. J. and, Hewlett N. (1999)Coarticulation: Theory, Data and Techniques, Cambridge university press. Ohala J.J.(1993)Coarticulation and phonology- university of Alberta and university of California Berkeley, language and speech 36; Phung, T., Luong, M. C. and Akagi, M.(2012) On the Stability of Spectral Targets under Effects of Coarticulation,International Journal of Computer and Electrical Engineering, Vol. 4, No. 4, ( ) Phung, T., Luong, M. C., and Akagi, M.(2011), An Investigation on Perceptual Line Spectral Frequency (PLP-LSF) Target Stability against the Vowel Neutralization Phenomenon, 3rd International Conference on Signal Acquisition and Processing (ICSAP 2011): Rabiner, L. and Juang, B. H. (1993)Fundamentals of speech Recognition, Prentice Hall International,497 Smith, J.(2006) History and Practice of Digital Sound Synthesis, CCRMA, Stanford University, Lectures notes in AES 2006 Shannon M, Zen H, Byrne W,(2013)Autoregressive Models for Statistical Parametric Speech Synthesis, IEEE transactions on audio, speech, and language processing, vol. 21 (3); ( ) Tatham, M., Morton K. (2005), Development in speech synthesis. John Wiley & Sons Ltd, England, Chapter 4, pg Taylor P.(2009)Text-to-Speech Synthesis, Cambridge University Press. (total pages) Phones, Phonemes, Allophones and Phonological Rules, accessed nd_dependence, accessed in

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Tomi Kinnunen and Ismo Kärkkäinen University of Joensuu, Department of Computer Science, P.O. Box 111, 80101 JOENSUU,