Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Size: px

Start display at page:

Download "Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech"

Blaze Carson
6 years ago
Views:

1 Dept. for Speech, Music and Hearing Quarterly Progress and Status Report VCV-sequencies in a preliminary text-to-speech system for female speech Karlsson, I. and Neovius, L. journal: STL-QPSR volume: 35 number: 1 year: 1994 pages:

4 STL-QPSR been added. The LF source pulse is defined by four parameters: the shape parameters RG, RK, FA and the excitation energy EE. The noiselvoice source interaction is specified by the parameters NA and NM. NA specifies the amount of glottal flow modulated noise added to the glottal pulse. NM specifies the degree of glottal flow modulation of the noise source. The higher formants, which was realised as a higher-pole correction in male speech produced with the OVE m model, are now implemented as three separate formants (F5-F7), with constant relative distance. These new features have been proven to be essential for generating natural-sounding female speech. They have also improved the production of emotive synthetic speech (Carlson et al., 1992). A further extension of the GLOVE synthesiser has been tested in this study. The extension consists of a polelzero pair (FP,BP/FZ,BZ) that has been included in the vocal branch. The pole is implemented as a formant filter. The zero consists of an inverse formant filter. Both are controlled by specifying mid frequency and bandwidth. When the polelzero pair is not needed, both are placed at the same frequency and given a very low Q-value and will thus cancel each other. The experiments reported in this study were run on a version of the GLOVE model in a digital signal processor (DSP) simulator implementation (Carlsson & Neovius, 1990). The simulator runs as a separate server in our HPlApollo computer network, as a part of our experimental dialogue system (Blomberg et al., 1993). It can be directly accessed from the RULSYS speech synthesis development program, using the speech command in, for example, HISYS edit mode. Speech stimuli are generated at close to real-time rate on an HPIApollo 730. The simulator can easily be transferred to a standalone application with a floating point DSP for real-time performance in a text-tospeech device. Analysis Detailed analysis of female speech, for example using inverse filtering technique, has improved our knowledge of voiced speech segments (Karlsson, 1990; Karlsson, 1992). The main objective of inverse filtering is to eliminate the vocal tract filter function and to obtain a representation of the glottal flow in speech, but this technique has also proved to be valuable for detailed descriptions of voice source behaviour and of vocal tract characteristics. The technique is to cancel each formant in the speech wave with an inverse formant filter, with the corresponding mid-frequency and bandwidth, see Figure 1. To attain a good inverse filtered voice pulse, it was necessary to cancel extra pole/zero pairs for many consonants and adjoining parts of vowels. These polelzero pairs can have different origins. In nasals and laterals they are due to the geometry of the vocal tract (Nord, 1976). Voiced fricatives contain polelzero pairs that have their origins in the cavity behind the frication source. In aspirated articulations, the vocal cords never close completely. Accordingly, the cavities below the glottis are not acoustically isolated from the vocal tract. This will create polelzero pairs similar to what occurs in leaky or breathy voices (Klatt & Klatt, 1990). For the /h/ shown in Figure 1, the polelzero pair has the frequencies of 1697 Hz and Hz, respectively.

5 STL-QPSR Example of inverse filtering of the consonant /h/ in a VCV sequence. time in seconds speech inverse filtered speech integrated inverse filtered speech (equivalent to glottal air flow) Inverse filter specification the upper value gives the mid-frequency, the lower value the bandwidth in Hz spectrum of one fundamental period of the speech wave spectrum of the same period after inverse filtering khz both spectra are lifted by 6 dbloctave Figure 1. Representation of the formant matching component of the interactive inverse filtering program. The upper window displays the current glottal pulse. The lower window shows the spectrum of the wave before and after inverse filtering. After the initial LPC estimate of poles, the poles (down arrows) and the zeroes (up arrows) are positioned by hand. Mid-frequency is specified along the x-axis and bandwidth along the y-axis. Analysis data on voice source and vocal tract behaviour for this study was accordingly collected using inverse filtering technique with an interactive program. Preliminary formant frequencies and bandwidths were calculated automatically using the Linear Prediction auto correlation method. The formants and bandwidths were then finely tuned by hand, using both time and frequency representations of the speech wave before and after filtering. Zeroes and extra poles in the voice spectrum were also determined. The LF voice source model was then fitted to the inverse filtered voice pulse by hand, see Figure 2. The parameters of the LF-model for consecutive voice pulses were collected in tables together with formant, pole and zero mid-frequency and bandwidth values. These parameter tables were interpreted and formalised to a small number of data points for each VCV speech segment. The parameters were then used to produce handedited synthetic speech stimuli. The quality of these stimuli was modified by interactive editing and listening. Special effort was invested in the parameter transitions. The parameter data were further reduced to make it feasible to write new rules for the speech synthesis. The results of the proposed rules were continuously checked by generating text-to-speech stimuli.

6 time in seconds inverse filtered speech wave with the LF-model superimposed LF-model parameter values: FA= 180Hz FO = 250 Hz RG = 95% RK = 45% EE = 60 db, not calibrated but of interest when comparing different segments spectrum of one fundamental period of the inverse filtered speech wave spectrum of one pulse of the LF-model both spectra are lifted by 6 dbloctave khz Figure 2. Representation of the glottal flow pulse derivative matching component of the interactive inverse filtering program. The upper window displays the glottal source after the current setting of the inverse filter overlaid by the best approximation of the LF-pulse. The lower window displays the spectrum of these pulses. Results The analysis data used for the present experiment are taken from an ongoing and only partly published study of the acoustics of consonants in female speech (Karlsson, 1992). The speech material for that study consists of all Swedish consonants produced in VCV sequences in a carrier phrase. The vowels were short /a/, /u/ or /i/ and the initial and final vowel was the same. The initial vowel was always stressed, Swedish accent 1. In Figure 3, an example of the sequence /ili/ is shown together with two synthesised versions of the same sequence. Data for some consonants in different vowel contexts are given in Table 1. I Glottal parameters for consonants The LF-model (Fant et al., 1985) approximates the derivative of the glottal flow pulse using four parameters. The parameter EE decides the excitation strength, i.e. the amplitude. RG decides the balance between the first and second harmonic, and is normally higher in a more tense spealung mode. RK is lower in a tense voice than in a soft voice. FA is typically low for a soft voice and high for a tense voice. In Table 1 some examples of LF-parameter values for consonants are given. These values are fairly typical for all consonants investigated so far. The excitation strength, EE, was about equal for the nasals and weaker in /h/ and 111 compared to the preceding

7 Figure 3. Spectrograms and spectral sections of /ili/produced by a female speaker and by old and new synthesis rules. The old synthesis does not contain the pole/zero pair, all other parameters are the same for the old and the new synthesis. The spectral sections are taken from the middle of the /l/. stressed vowels. The consonant /V shows a characteristic pattern. During the first few pulses, there was an additional 1-3 db dip in the EE level. This can be seen in the spectrogram of natural speech in Figure 3. The RG parameter is not included in the table as the variation in this parameter was very small and did not seem to be related to the specific segment. The parameter RK was generally higher for the consonants compared to adjacent vowels. In Table 1 the largest differences in RK values are found in /h/ and in /rn/ in /a/-context. A high RK value means that the first harmonic is comparatively strong. The parameter FA was lower, i.e. the return time was longer, in all the consonants shown in Table 1 than in the surrounding vowels. This implies that the voice source in consonants contained relatively less high frequency energy. The lowest FA values were found in the nasals and the voiced plosives. FA was also low in nasalised parts of vowels; compare vowels before nasal in Table 1 with the other vowels. There is also a consistent difference between vowels; high front vowels always have lower FA values than open vowels. In Swedish, /h/ is voiced and aspirated in fully voiced contexts. The higher formants are to a large degree excited by aspirative noise. FA for /h/ was decided by studying the frequency content below approximately 2 khz as FA pertains to properties of the voiced source and, in /h/, the energy above 2 khz is mainly aspiration. Polelzero analysis Polelzero pairs were identified in the voiced consonants and in transitions from vowels to consonants, see Figure 1. Examples of the most distinct poles and zeros in some consonants are listed in Table 1. The polelzero pairs in these consonants are of different origins.

9 STL-QPSR compared to the other vowel contexts. This may be due to a different articulation of Ill. The polelzero pair in Iil before Ill is presumably due to incomplete glottal closure. In a transition from vowel to nasal, the passage through the nose is opened before the oral passage is closed (Nord, 1976). This results in poles and zeros in the vowel that are slightly different from those in the nasal. As can be seen in Table 1,1111 and /rn/ have similar polelzero data. The difference in articulation is manifested in F2 and F3 targets, and in back rounded vowel context the target for the zero. In earlier experiments (Nord, 1976), a polelzero pair at about 500 Hz in nasals was identified. In the synthesis experiments performed for this study, this polelzero pair was simulated by a broadening of the first formant bandwidth in combination with a low FA. The Nord study and the present experiment show that the polelzero pair between 1000 and 2000 Hz has a perceptual impact. Text-to-speech rules The analysis data were formulated as definitions and rules for the KTH text-to-speech synthesis, following the guidelines below. Some examples of parameter settings achieved by these rules are shown in Table 2. The rules was used in listening test 1, that will be described later in this paper. /V Basic setting After back rounded vowel FP BP FZ BZ EE EE Iml Basic setting in nasalised vowel in /m/ In back rounded nasalised vowel in /m/ after back rounded vowel FP B P FZ BZ In/ Basic setting in nasalised vowel in In/ In high front nasalised vowel in In/ after high front vowel FP B P FZ BZ Table 2. Values of some synthesis parameters decided by rules. Frequency and bandwidth values for the pole/zero pair (FP,BP / FZ,BZ) are given in Hz. The LF source parameter EE is given in db relative to EE in the preceding vowel. EEl gives the initial value, EE2 the value after 20 ms. For /h/, a polelzero pair was introduced. The frequencies and bandwidths were the same for all vowel contexts, according to analysis data. The LF parameter EE was decreased considerably in Iil and Iul context, less in /a/ context. RK was increased and FA was set to a low value. The aspiration was produced using NA, that introduces a modulation of the noise source. For Ill, two different aspects were formalised. The polelzero pair was specified in one data point. The specifications were the same for lil and la/ contexts, for Iul-context the polelzero pair as well as the formant values were lower. The excitation energy was reduced for the first 20 ms of the segment, see also Figure 3, which improved the naturalness notably.

10 STL-QPSR For the nasals I d and In/, three data points for the polelzero pair were specified. The first and last of these define the nasalisation of the preceding and succeeding vowels, the second defines the nasal. Separate rules modifying the polelzero pair had to be formulated for I d adjacent to a back, rounded vowel and for In/ adjacent to a high, front vowel. The polelzero values for the different contexts are given in Table 2. The nasal branch was not used in this experiment. The excitation energy was set to about 7 db less than for the preceding, stressed vowel. For In/ in Iil-context the excitation energy was set equal to the vowel value. FA was decreased to a low value, typically 400 Hz, for the nasalised part of the vowels, and even lower in the nasal. Listening tests Two tests were performed on different versions of the preliminary rule system. In the first test, a subset of the Swedish consonants were tested. The subjects were asked to mark the better of two versions of a consonant in a VCV context and to rate the difference. We believe that the test gives an estimate on both quality and naturalness. The test was an ABAB-test where A and B were different synthesis-by-rule versions of the same VCV. The intended VCV-stimuli was given on the answering sheet. The stimuli pairs were produced using an old and a new set of synthesis rules. The differences between the old and the new version consisted mainly of inclusion of the polelzero pair and modifications of the source parameters. Six subjects participated in this listening test. They had some experience of male speech synthesis. Test stimuli The consonants I d, In/, Iql, Ill, /h/, /v/ Id/ and If1 in /a/, Iil and Iul-contexts were tested. The context vowels were short and the first vowel was stressed, Swedish accent 1. The differences within the nasal pairs was that the old version contained nasal murmur shaped by the nasal branch and the new nasal was produced using only the vocal branch with the polelzero pair. The new versions of Ill and /v/ contained the polelzero pair as opposed to the old versions. For /h/, in the new version a different aspiration source, NA, was used in combination with the polelzero pair. In the new Id/, the voice bar was produced by setting the LF-source parameters FA very low and RK high, in the old Id/ the nasal branch produced the voice bar. There was no difference between the old and the new If/, the pairs were included to verify the test. Test results The results were encouraging, even though the old version was sometimes preferred. Subjects commented on a high degree of naturalness of the female speech. The new version of I11 was clearly preferred by the subjects. For Ivl there was a weak but clear tendency for a preference of the new version. The nasals showed less clear results. The new I d and In/ were preferred in four context, but for two pairs the old version was strongly preferred. The old Iql was preferred as was the old /h/, but the difference was small. The new Id/ was not as good as the old one. The If1 pairs got equal ratings and some subjects remarked that they were the same.

11 STL-QPSR The results of the first test of the new synthesis version were promising, but improvements both in rules and synthesiser were indicated. A close inspection of the new version nasals showed that many of these contained a large amount of noise that was caused by a less-than-optimal timing of parameter changes. This may in some cases explain the strong preference for the old version. The noise disappeared after a modification of the timing. In the new /dl, the amplitude change was too slow, which meant that it sounded more as a sonorant. After a speeding up of the parameter changes, the new Id/ is greatly improved. In the new /h/, the spectral content of the NA source need to be optimised. Second test Before the second test was run, a major revision of the rule system was performed. The work was focused on getting a complete inventory of consonants for the female textto-speech system as opposed to producing a limited set of VCV-sequences. The modifications indicated by listening test 1 were included in the rules. In the second test, the participants were asked to correctly identify the consonant in a VCV stimuli, following the SAM procedure (Goldstein & Till, 1992). To get data on the complete consonant inventory, the retroflex allophones where included. This resulted in 23 consonant phonemes. Each consonant was presented in three different vowel contexts, /a-a/, /i-i/, /u-u/. In this test, the stimuli were produced with Swedish accent 2. The vowels were short. Stimuli were presented once in a forced-choice test and the vowel context was indicated in he answering form. Ten subjects participated in the test. Overall error rate was 48% for correct identification. When only manner of articulation was considered, the error rate was 20%. Some consonants gave a very high confusion rate, in particular /h/ and the retroflex consonants. One possible reason for the low intelligibility of the consonant /h/ is that we so far lack knowledge on how to control the NA parameter by rule. The retroflex consonants are comparatively rare in the language and the voiced retroflexes hardly ever occur after a short vowel. They are often not included in this type of test. When the retroflexes were excluded, overall error rate decreased to 3 1 % for correct identification. The corresponding overall error rate figure for natural male speech is 5.6%, and for the latest revision of the KTH male speech synthesis 8.7% (Goldstein & Till, 1992). Measuring manner of articulation, the error rate was 18% on average. Only a few comments about the results will be made as this test is meant to be a starting point for future work. The use of glottal source control to produce the voice bar in the voiced plosives gives the desired responses. Very few confusions with other manners of articulation occurred. The production of the voice bar using glottal parameters makes it possible to generate more or less reduced occlusions. The introduction of a polelzero pair and an amplitude dip gave a convincing Ill. There was a trend to perceive the unvoiced plosives as voiced, which was particularly strong for /a/-context. A possible explanation is that the voice off-set of the preceding vowel is too slow. The nasals, that in test 1 were rated equally good in quality to the old nasals, appears to have lost their nasality. Confusions with non-nasals are particularly frequent for the /u/ context. It seems as if the transfer of the hand-

12 STL-QPSR edited rules tested in test 1 into general definitions and rules tested in test 2 has not always been completely successful. Conclusions In this report we have presented a framework for future text-to-speech development for generating more natural sounding synthetic speech. The inclusion of an extended GLOVE model within the RULSYSIHISYS environment provides a powerful tool for generating improved synthetic speech. Inverse filtering with matching against the LFmodel supplies adequate analysis data for the formulation of synthesis rules. The experiments have focused on female speech as the quality of the synthetic female speech can not be improved from the present status without inclusion of a versatile voice source and more complex articulation possibilities. Analysis data on consonants and vowel-consonant transitions suitable for the synthesis model has been presented. The analysis data have been formalised and incorporated within the rule system. The inclusion of an extra polelzero pair in the vocal branch give us a possibility to generate more natural synthetic speech. Results from a listening test with hand-edited VCV-sequences verify this. Formalising the data to rules is still at an initial stage. The second, diagnostic test reported in this study indicates where future efforts are needed. Acknowledgement This work has been supported by FRN, NUTEK, HSFR, KTH and the European CEC- ESPRIT project SPEECH MAPS. References Blomberg, M., Carlson, R., Elenius, K., Granstrom, B., Gustafson, J., Hunnicutt, S., Lindell, R. & Neovius, L. (1993). "An experimental dialogue system: Waxholm", Proc. of Eurospeech '93, pp Carlson, R., Granstrom, B. & Hunnicutt, S. (1990). "Multilingual text-to-speech development and applications", Advances in speech, hearing and language processing (B Ainsworth, ed.), Vol. 1, London: JAI Press Ltd., pp Carlson, R., Granstrom, B. & Karlsson, I. (1991). "Experiments with voice modelling in speech synthesis", Speech Communication, Vol. 10, pp Carlson, R., Granstrom, B. & Nord, L. (1992). "Experiments with emotive speech - acted utterances and synthesized replicas", Proc. of ICSLP '92, pp Carlson, R. & Nord, L. (1991). "Positional variants of some Swedish sonorants in an analysissynthesis scheme", Journal of Phonetics, Vol. 19, pp Carlsson, G. & Neovius, L. (1990). "Implementations of synthesis models for speech and singing." STL-QPSR 2-3/1990, pp Chafcouloff, M. (1985). "The spectral characteristics of the lateral I11 in French", Travaux de I'lnstitut de Phone'tique d'aix, Vol. 10, pp Cranen, B. & Boves, L. (1987). "On subglottal formant analysis"; JASA, Vol. 81, pp Fant, G., Liljencrants, J. & Lin, Q. (1985). "A four-parameter model of glottal flow.", STL- QPSR , pp I!

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Dept. for Speech, Music and Hearing Quarterly Progress and Status Report Voiced-voiceless distinction in alaryngeal speech - acoustic and articula Nord, L. and Hammarberg, B. and Lundström, E. journal: