Voice conversion through vector quantization

J. Acoust. Soc. Jpn.(E)11, 2 (1990) Voice conversion through vector quantization Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano, and Hisao Kuwabara A TR Interpreting Telephony Research Laboratories, Sanpeidani, Inuidani, Seika-cho, Souraku-gun, Kyoto, 619-02 Japan (Received 13 May 1989) A new voice conversion technique through vector quantization and spectrum mapping is proposed. This technique is based on mapping codebooks which represent the cor respondencebetween different speakers' codebooks. The mapping codebooks for spectrum parameters, power values, and pitch frequencies are separately generated using training utterances. This technique makes it possible to precisely control voice individuality. The performance of this technique is confirmed by spectrum distortion and pitch frequency difference. To evaluate the overall performance of this technique, listening tests are carried out on two kinds of voice conversions:one between male and female speakers, the other between male speakers. In the male-to-female conversion experiment, all converted utterances are judged as female, and in the male-to-male conversion, 57% of them are identified as the target speaker. PACS number:43. 72. Ja 1. INTRODUCTION In daily communication, voice individuality is one of the most important aspects of human speech. It is especially important for identifying other person in a telephone conversation. A technique to control speech individuality, therefore, has an important role and offers many applications. Our present study is concerned with converting voice quality from one speaker to another and developing a technique which enables us to give individuality to synthesized speech. One system goal we can imagine is shown in Fig. 1. Using a voice conversion system as a post-proces sorfor a synthesis-by-rule system, various kinds of speech, such as a particular person's voice, a child likevoice, husky voice, etc., can be synthesized. For voice conversion, as shown in Fig. 1, it is neces saryto have a database of voice individuality. Speech individuality generally consists of two major factors:acoustic features and prosodic fea tures.as the first step in this research, we are trying to control the acoustic features. 1, 2) According to previous studies, the acoustic features that con tributeto speech individuality are distributed among various parameters, such as formant frequencies, formant bandwidths, spectral tilt, and glottal wave forms.3, 4) Because speech individuality is deter minedby all of these, it is difficult to control it by modifying each parameter independently. On the other hand, codebooks used in vector quantization represent all of these parameters altogether. There fore,speech individuality of a speaker is represented by the code-vectors in a codebook of the speaker. A conversion of acoustic features from one speaker to another is reduced to the problem of finding a correspondence between the codebooks of the two speakers. The basic problem is, therefore, to find mapping function from one codebook to another, which we will call a 'mapping codebook.' This is the basic idea of this conversion technique. In our proposed technique, the mapping codebooks which represent the correspondence between different speakers' codebooks provide the database of speech individuality in Fig. 1. In section 2, a method of making mapping code booksand a synthesis procedure are described. In section 3, the mapping codebooks are evaluated by

J. Acoust. Soc. Jpn.(E)11, 2 (1990) Fig. 1 Synthesis-by-rule system with voice conversion. measuring distortion. In section 4, the overall performance of this technique is evaluated by listen ingtests. 2. VOICE CONVERSION THROUGH VECTOR QUANTIZATION Our voice conversion technique consists of two steps:a learning step and a conversion-synthesis step. The learning step generates the mapping codebooks, and the conversion-synthesis step synthe sizesspeech using the mapping codebooks. 2.1 Learning Step The mapping codebooks are codebooks that describe a mapping function between the vector spaces of two speakers. Figure 2 illustrates the block diagram of the procedure for generating a mapping codebook for spectrum parameters. 1. Two speakers, A and B, pronounce a learning word set. Using these learning utterances, a codebook is generated for each speaker. Next, a word uttered by speaker A is vector quantized using his/her codebook. The same word uttered by speaker B is also vector quantized by the same way. 2. The correspondence between vectors of the same words from the two speakers is deter minedusing Dynamic Time Warping (DTW). 3. The vector correspondences between two speakers are accumulated as histograms. Histograms of vector correspondence are made by applying the same procedure to other learning words. 4. Using the histogram for each codevector of speaker A as a weighting function, a mapping codebook from speaker A to B is defined as a linear combination of speaker B's vectors. 5. The codevectors of speaker A are replaced with codevectors of a mapping codebook. 6. If the decrease in average distortion through DTW over all training words is not converged, steps 2, 3, 4 and 5 are repeated to refine the mapping codebook. Pitch frequencies and power values contribute a great deal to speech individuality. Mapping code booksfor these parameters are also generated at the same time using almost the same procedure men tionedabove. The differences are: 1. Pitch frequencies and power values are each scalar-quantized. 2. The mapping codebook for pitch frequencies is defined based on the maximum occurrence in the histogram. 2.2 Conversion-synthesis Step Figure 3 shows the block diagram of the conver sion-synthesis step. First speaker A's speech is analyzed by the linear prediction method. Then the spectrum parameters are vector-quantized using his/her codebook, and parameters for pitch frequencies and power values are scalar-quantized using his/her codebooks. Next, all parameters are decoded using mapping codebooks between speakers A and B. Finally, speech is synthesized by LPC vocoder. The output speech will have the voice individuality of speaker B. Fig. 2 Method for generating a mapping codebook.

M. ABE et al.:voice CONVERSION THROUGH VECTOR QUANTIZATION Fig. 3 Block diagram of voice conversion from speaker A to speaker B. 3. CONVERSION EXPERIMENTS To evaluate the performance of the conversion technique, distortion measurements were carried out for the spectrum parameters as well as the pitch frequencies. 3.1 Spectrum Conversion Experiments Experiment conditions are listed in Table 1. A set of 100 phonetically-balanced words was used for learning to produce mapping codebooks. Spectrum conversions were made between female Table 1 Experiment conditions. Table 2 Spectrum distortion. and male voices, between male and male and between female and female voices. Six speakers (3 male and 3 female speakers, all professional announcers) were used to provide speech material. Table 2 shows the results of the open test. After vector quantization, two kinds of spectrum distortions between two speech samples were cal culated:between the input and target speaker's (before conversion in Table 2), and the converted and target speaker's speech (after conversion in Table 2). In the female-to-female conversion, the distortion decreased by 27% compared to non con version,the male-to-male conversion by 49%, and the male-to-female conversion by 66%. These results show that this conversion technique is highly effective when there is a large enough difference between two speakers's voices. 3.2 Pitch Frequency Conversion Experiments Pitch frequency conversion was also carried out through the same process described in 3.1. The experiment results are shown in Fig. 4. This figure shows the relationship between the number of learning words and the average pitch frequency differences after conversion. The value at the point where the number of learning words is 0 shows the natural average pitch frequency dif ferencebetween the two speakers. According to this figure, 60 words are considered to be enough to make a mapping codebook for pitch frequency regardless of speaker combinations, and the average pitch frequency difference decreases to less than 15 Hz. 4. EVALUATION BY LISTENING TEST To evaluate overall performance of this technique, three kinds of listening tests were carried out. The first experiment deals with the male-to-female con versionand the other two experiments deal with the

J.Acoust.Soc.Jpn.(E) 11, 2 (1990) loud speaker in a sound proof room. Twelve listeners were asked to rate the similarity of each pair in five categories:"similar," "slightly similar," "difficult to decide," "slightly dissimilar," "dis similar." 4.1.2 Experiment 2 Experiment 2 was designed to evaluate the con versionbetween two male speakers by the so called ABX method. Stimuli A and B are vector-quantiz edoriginal speech tokens for speakers N and M respectively. X takes either the converted token (N M or M N) or the vector-quantized original token (N or M). Four different words were used for the conversions and each triad was a combina tionof 3 different words. A total of 96 speech triads were presented to the listeners as described number of learning words number of learning words 1. male male conversion 2. female conversion Fig. 4 Pitch frequency differences for num berof learning words. male-to-male conversion. 4.1 Experiment Procedure 4.1.1 Experiment 1 Experiment 1 was designed to evaluate the voice quality for male-to-female voice conversion by a pair-comparison listening test. In addition to the fully converted speech, conversion was also done for pitch and spectrum parameters separately in order to examine the individual contribution of these parameters to speech individuality. The following is a list of 5 different speech conversions performed in this experiment. 1. vector-quantized original male speech (m) 2. male-to-female converted speech:pitch fre quencyconversion only (mp-fp) 3. male-to-female converted speech:spectrum conversion only (ms-fs) 4. male-to-female converted speech:all parame ters(m-f) 5. vector-quantized original female speech which is the target for the conversions (f) In order to avoid unnecessary cues for the judge mentof voice quality, 2 different words were used to make speech pairs for the listening test. A set of speech pairs consists of all possible combinations of stimuli from the 5 different conversions, 40 in total. They were presented to listeners through a above. The listeners were required to select the stimulus (A or B) which more closely resembled the stimulus X. 4.1.3 Experiment 3 Experiment 3 was designed to evaluate the con versionbetween male speakers in the same way as in 4.1.1. However, conversions for pitch frequencies alone and spectrum parameters alone were excluded. The following is a list of the 4 conversions. 1. vector-quantized male speech (male 1) 2. same as 1 but for another male speaker (male 2) 3. converted speech from male 1 to male 2 (male 1 male 2) 4. converted speech from male 2 to male 1 (male 2 male 1) A total of 72 speech pairs were generated using the same procedures in experiment 1. 4.2 Experiment Results 4.2.1 Evaluation of male-to-female conversion (Results of Experiment 1) Hayashi's fourth method of quantification 5) was applied to the experimental data obtained by the listening test. This method places stimuli in a space according to the similarities between every two stimuli. Its formulation minimizes the measure Q, Q=-SUM[e(i,j){x(i)-x(j)}2] where e(i, j) denotes the similarity between stimuli i and j, and x(i) represents the location of stimulus i in the space. The projection onto a two-dimensional space is shown in Fig. 5.

M.ABE et al.:voice CONVERSION THROUGH VECTOR QUANTIZATION correctly identify the speaker even if the original speaker's speech is used as stimuli X; i.e. the correct answer is about 85% in the male 1-male 2 pair, and about 60% in the male 1-male 3 pair. Judging from these scores, the average score 72.2% in the male 1-male 2 conversions shows satisfactory per formance.as total performance, 57% of the con vertedutterance is judged correctly. A relatively poor performance for the male 1-male 3 conversion stems from the fact that male l's voice quality is very close to male 3's voice. To confirm this, the "distance" between the original speaker and target speaker is shown in Fig. 6. This figure shows spectrum distance and pitch frequency distance before and after conversion. From this figure, the pitch frequency range of male Fig. 5 Distribution of psychological dis tancesfor the male-to-female voice con version. This figure represents the relative similarity distancebetween stimuli. In this figure, converted 1 and male 3 is found to be very close. Figure 7 represents the result for Experiment 3 analyzed by the same method as in 4.2.1. It is observed that the converted speech samples, "male 1 male 2" and "male 2 male 1," are both placed closer to their target speech. This indicates speech "m f" is placed very close to the speech "f." This indicates that this technique properly converted the male speech to the target female speech. Judging from the positions of "mp fp" and "ms fs," it is observed that the first and second axes roughly correspond to pitch frequency and spectrum differences, respectively. The result in dicatesthat neither pitch frequency nor spectrum carries enough information about speech indi viduality,and both are necessary. 4.2.2 Evaluation of male-to-male conversion (Re sultsof Experiments 2 and 3) The results of Experiment 2 are shown in Table 3. spectrum distortion Fig. 6 Speaker distance in spectrum distor tionand pitch frequency difference. The numbers in this table represent the percentage of responses where stimuli X are judged correctly. According to the table, listeners can't always Table 3 Percentage of correct responses. Fig. 7 Distribution of psychological dis tancesfor the male-to-male voice con version.

J.Acoust. Soc.Jpn.(E) 11, 2 (1990) that the proposed technique can also convert speech individuality between same-sex speakers. 5. CONCLUSION A new voice conversion technique through vector quantization and spectrum mapping was proposed. The advantage of this technique is summarized as follows: 1. The mapping codebooks which make it possi bleto give an individuality to synthesized speech are generated from a limited number of word utterances. 2. The mapping codebooks enable voice con versionof high quality between any two speakers. 3. The synthesis process requires few computa tionand produces speech in real time. The performance of this technique is confirmed by spectrum distortion and pitch frequency dif ference.the spectrum distortion between original speech and target speech decreased by a range of 27% to 66%. Pitch frequency difference decreased to less than 15 Hz. The overall performance of this technique is also confirmed by listening tests. It can be concluded that the converted speech has a voice quality very close to the target speaker's. ACKNOWLEDGMENTS We are grateful to Dr. Kurematsu, president of ATR Interpreting Telephony Research Laboratories, for his continuous support of this work. We wish to also thank Dr. Tohkura, head of Hearing& Speech Perception Department, for helpful discus sion. REFERENCES 1 ) K.Shikano, K.Lee, and R.Reddy,"Speaker ad aptationthrough vector quantization," ICASSP 86, 2643-2646 (1986). 2 ) S.Nakamura and K.Shikano,"Spectrogram nor malizationbased on vector quantization," Tech. Rep. Speech Acoust. Soc.Jpn.SP87-17, 9-16 (1987)(in Japanese). 3 ) H.Kuwabara and T.Takagi,"Quality control of speech by modifying formant frequencies and bandwidth," 11th Int.Congr. Phonetic Sciences, 281-284, August (1987). 4 ) D.G.Childers, B.Yegnanarayana, and K.Wu, "Voice conversion:factors responsible for quality," ICASSP 85, 748-751 (1985). 5 ) C.Hayashi,"Recent theoretical and methodological developments in multidimensional scaling and its related methods in Japan," Behaviormetrika No.18 (1985).