Voice conversion through vector quantization

Similar documents
Mandarin Lexical Tone Recognition: The Gating Paradigm

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Speech Emotion Recognition Using Support Vector Machine

Human Emotion Recognition From Speech

A study of speaker adaptation for DNN-based speech synthesis

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speech Recognition at ICSI: Broadcast News and beyond

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Speaker recognition using universal background model on YOHO database

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Body-Conducted Speech Recognition and its Application to Speech Support System

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

WHEN THERE IS A mismatch between the acoustic

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Automatic Pronunciation Checker

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Rhythm-typology revisited.

Developing a College-level Speed and Accuracy Test

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Dimensions of Classroom Behavior Measured by Two Systems of Interaction Analysis

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Major Milestones, Team Activities, and Individual Deliverables

Robot manipulations and development of spatial imagery

Probabilistic Latent Semantic Analysis

PROJECT MANAGEMENT AND COMMUNICATION SKILLS DEVELOPMENT STUDENTS PERCEPTION ON THEIR LEARNING

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Phonological and Phonetic Representations: The Case of Neutralization

Cross Language Information Retrieval

Grade 6: Correlated to AGS Basic Math Skills

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Expressive speech synthesis: a review

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Journal of Phonetics

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

Lecture 9: Speech Recognition

Houghton Mifflin Online Assessment System Walkthrough Guide

Artificial Neural Networks written examination

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

NCEO Technical Report 27

Python Machine Learning

Learners Use Word-Level Statistics in Phonetic Category Acquisition

Learning Methods in Multilingual Speech Recognition

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

(Includes a Detailed Analysis of Responses to Overall Satisfaction and Quality of Academic Advising Items) By Steve Chatman

Calibration of Confidence Measures in Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Statistical Parametric Speech Synthesis

10.2. Behavior models

Self-Supervised Acquisition of Vowels in American English

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

Radius STEM Readiness TM

Measurement. When Smaller Is Better. Activity:

arxiv: v1 [math.at] 10 Jan 2016

Segregation of Unvoiced Speech from Nonspeech Interference

Course Law Enforcement II. Unit I Careers in Law Enforcement

Constructing a support system for self-learning playing the piano at the beginning stage

Lesson M4. page 1 of 2

VIEW: An Assessment of Problem Solving Style

Self-Supervised Acquisition of Vowels in American English

AP Statistics Summer Assignment 17-18

Building Text Corpus for Unit Selection Synthesis

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Ansys Tutorial Random Vibration

Speaker Recognition. Speaker Diarization and Identification

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Edinburgh Research Explorer

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Introduction to Information System

EQuIP Review Feedback

SARDNET: A Self-Organizing Feature Map for Sequences

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

LEGO MINDSTORMS Education EV3 Coding Activities

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

Transcription:

J. Acoust. Soc. Jpn.(E)11, 2 (1990) Voice conversion through vector quantization Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano, and Hisao Kuwabara A TR Interpreting Telephony Research Laboratories, Sanpeidani, Inuidani, Seika-cho, Souraku-gun, Kyoto, 619-02 Japan (Received 13 May 1989) A new voice conversion technique through vector quantization and spectrum mapping is proposed. This technique is based on mapping codebooks which represent the cor respondencebetween different speakers' codebooks. The mapping codebooks for spectrum parameters, power values, and pitch frequencies are separately generated using training utterances. This technique makes it possible to precisely control voice individuality. The performance of this technique is confirmed by spectrum distortion and pitch frequency difference. To evaluate the overall performance of this technique, listening tests are carried out on two kinds of voice conversions:one between male and female speakers, the other between male speakers. In the male-to-female conversion experiment, all converted utterances are judged as female, and in the male-to-male conversion, 57% of them are identified as the target speaker. PACS number:43. 72. Ja 1. INTRODUCTION In daily communication, voice individuality is one of the most important aspects of human speech. It is especially important for identifying other person in a telephone conversation. A technique to control speech individuality, therefore, has an important role and offers many applications. Our present study is concerned with converting voice quality from one speaker to another and developing a technique which enables us to give individuality to synthesized speech. One system goal we can imagine is shown in Fig. 1. Using a voice conversion system as a post-proces sorfor a synthesis-by-rule system, various kinds of speech, such as a particular person's voice, a child likevoice, husky voice, etc., can be synthesized. For voice conversion, as shown in Fig. 1, it is neces saryto have a database of voice individuality. Speech individuality generally consists of two major factors:acoustic features and prosodic fea tures.as the first step in this research, we are trying to control the acoustic features. 1, 2) According to previous studies, the acoustic features that con tributeto speech individuality are distributed among various parameters, such as formant frequencies, formant bandwidths, spectral tilt, and glottal wave forms.3, 4) Because speech individuality is deter minedby all of these, it is difficult to control it by modifying each parameter independently. On the other hand, codebooks used in vector quantization represent all of these parameters altogether. There fore,speech individuality of a speaker is represented by the code-vectors in a codebook of the speaker. A conversion of acoustic features from one speaker to another is reduced to the problem of finding a correspondence between the codebooks of the two speakers. The basic problem is, therefore, to find mapping function from one codebook to another, which we will call a 'mapping codebook.' This is the basic idea of this conversion technique. In our proposed technique, the mapping codebooks which represent the correspondence between different speakers' codebooks provide the database of speech individuality in Fig. 1. In section 2, a method of making mapping code booksand a synthesis procedure are described. In section 3, the mapping codebooks are evaluated by

J. Acoust. Soc. Jpn.(E)11, 2 (1990) Fig. 1 Synthesis-by-rule system with voice conversion. measuring distortion. In section 4, the overall performance of this technique is evaluated by listen ingtests. 2. VOICE CONVERSION THROUGH VECTOR QUANTIZATION Our voice conversion technique consists of two steps:a learning step and a conversion-synthesis step. The learning step generates the mapping codebooks, and the conversion-synthesis step synthe sizesspeech using the mapping codebooks. 2.1 Learning Step The mapping codebooks are codebooks that describe a mapping function between the vector spaces of two speakers. Figure 2 illustrates the block diagram of the procedure for generating a mapping codebook for spectrum parameters. 1. Two speakers, A and B, pronounce a learning word set. Using these learning utterances, a codebook is generated for each speaker. Next, a word uttered by speaker A is vector quantized using his/her codebook. The same word uttered by speaker B is also vector quantized by the same way. 2. The correspondence between vectors of the same words from the two speakers is deter minedusing Dynamic Time Warping (DTW). 3. The vector correspondences between two speakers are accumulated as histograms. Histograms of vector correspondence are made by applying the same procedure to other learning words. 4. Using the histogram for each codevector of speaker A as a weighting function, a mapping codebook from speaker A to B is defined as a linear combination of speaker B's vectors. 5. The codevectors of speaker A are replaced with codevectors of a mapping codebook. 6. If the decrease in average distortion through DTW over all training words is not converged, steps 2, 3, 4 and 5 are repeated to refine the mapping codebook. Pitch frequencies and power values contribute a great deal to speech individuality. Mapping code booksfor these parameters are also generated at the same time using almost the same procedure men tionedabove. The differences are: 1. Pitch frequencies and power values are each scalar-quantized. 2. The mapping codebook for pitch frequencies is defined based on the maximum occurrence in the histogram. 2.2 Conversion-synthesis Step Figure 3 shows the block diagram of the conver sion-synthesis step. First speaker A's speech is analyzed by the linear prediction method. Then the spectrum parameters are vector-quantized using his/her codebook, and parameters for pitch frequencies and power values are scalar-quantized using his/her codebooks. Next, all parameters are decoded using mapping codebooks between speakers A and B. Finally, speech is synthesized by LPC vocoder. The output speech will have the voice individuality of speaker B. Fig. 2 Method for generating a mapping codebook.

M. ABE et al.:voice CONVERSION THROUGH VECTOR QUANTIZATION Fig. 3 Block diagram of voice conversion from speaker A to speaker B. 3. CONVERSION EXPERIMENTS To evaluate the performance of the conversion technique, distortion measurements were carried out for the spectrum parameters as well as the pitch frequencies. 3.1 Spectrum Conversion Experiments Experiment conditions are listed in Table 1. A set of 100 phonetically-balanced words was used for learning to produce mapping codebooks. Spectrum conversions were made between female Table 1 Experiment conditions. Table 2 Spectrum distortion. and male voices, between male and male and between female and female voices. Six speakers (3 male and 3 female speakers, all professional announcers) were used to provide speech material. Table 2 shows the results of the open test. After vector quantization, two kinds of spectrum distortions between two speech samples were cal culated:between the input and target speaker's (before conversion in Table 2), and the converted and target speaker's speech (after conversion in Table 2). In the female-to-female conversion, the distortion decreased by 27% compared to non con version,the male-to-male conversion by 49%, and the male-to-female conversion by 66%. These results show that this conversion technique is highly effective when there is a large enough difference between two speakers's voices. 3.2 Pitch Frequency Conversion Experiments Pitch frequency conversion was also carried out through the same process described in 3.1. The experiment results are shown in Fig. 4. This figure shows the relationship between the number of learning words and the average pitch frequency differences after conversion. The value at the point where the number of learning words is 0 shows the natural average pitch frequency dif ferencebetween the two speakers. According to this figure, 60 words are considered to be enough to make a mapping codebook for pitch frequency regardless of speaker combinations, and the average pitch frequency difference decreases to less than 15 Hz. 4. EVALUATION BY LISTENING TEST To evaluate overall performance of this technique, three kinds of listening tests were carried out. The first experiment deals with the male-to-female con versionand the other two experiments deal with the

J.Acoust.Soc.Jpn.(E) 11, 2 (1990) loud speaker in a sound proof room. Twelve listeners were asked to rate the similarity of each pair in five categories:"similar," "slightly similar," "difficult to decide," "slightly dissimilar," "dis similar." 4.1.2 Experiment 2 Experiment 2 was designed to evaluate the con versionbetween two male speakers by the so called ABX method. Stimuli A and B are vector-quantiz edoriginal speech tokens for speakers N and M respectively. X takes either the converted token (N M or M N) or the vector-quantized original token (N or M). Four different words were used for the conversions and each triad was a combina tionof 3 different words. A total of 96 speech triads were presented to the listeners as described number of learning words number of learning words 1. male male conversion 2. female conversion Fig. 4 Pitch frequency differences for num berof learning words. male-to-male conversion. 4.1 Experiment Procedure 4.1.1 Experiment 1 Experiment 1 was designed to evaluate the voice quality for male-to-female voice conversion by a pair-comparison listening test. In addition to the fully converted speech, conversion was also done for pitch and spectrum parameters separately in order to examine the individual contribution of these parameters to speech individuality. The following is a list of 5 different speech conversions performed in this experiment. 1. vector-quantized original male speech (m) 2. male-to-female converted speech:pitch fre quencyconversion only (mp-fp) 3. male-to-female converted speech:spectrum conversion only (ms-fs) 4. male-to-female converted speech:all parame ters(m-f) 5. vector-quantized original female speech which is the target for the conversions (f) In order to avoid unnecessary cues for the judge mentof voice quality, 2 different words were used to make speech pairs for the listening test. A set of speech pairs consists of all possible combinations of stimuli from the 5 different conversions, 40 in total. They were presented to listeners through a above. The listeners were required to select the stimulus (A or B) which more closely resembled the stimulus X. 4.1.3 Experiment 3 Experiment 3 was designed to evaluate the con versionbetween male speakers in the same way as in 4.1.1. However, conversions for pitch frequencies alone and spectrum parameters alone were excluded. The following is a list of the 4 conversions. 1. vector-quantized male speech (male 1) 2. same as 1 but for another male speaker (male 2) 3. converted speech from male 1 to male 2 (male 1 male 2) 4. converted speech from male 2 to male 1 (male 2 male 1) A total of 72 speech pairs were generated using the same procedures in experiment 1. 4.2 Experiment Results 4.2.1 Evaluation of male-to-female conversion (Results of Experiment 1) Hayashi's fourth method of quantification 5) was applied to the experimental data obtained by the listening test. This method places stimuli in a space according to the similarities between every two stimuli. Its formulation minimizes the measure Q, Q=-SUM[e(i,j){x(i)-x(j)}2] where e(i, j) denotes the similarity between stimuli i and j, and x(i) represents the location of stimulus i in the space. The projection onto a two-dimensional space is shown in Fig. 5.

M.ABE et al.:voice CONVERSION THROUGH VECTOR QUANTIZATION correctly identify the speaker even if the original speaker's speech is used as stimuli X; i.e. the correct answer is about 85% in the male 1-male 2 pair, and about 60% in the male 1-male 3 pair. Judging from these scores, the average score 72.2% in the male 1-male 2 conversions shows satisfactory per formance.as total performance, 57% of the con vertedutterance is judged correctly. A relatively poor performance for the male 1-male 3 conversion stems from the fact that male l's voice quality is very close to male 3's voice. To confirm this, the "distance" between the original speaker and target speaker is shown in Fig. 6. This figure shows spectrum distance and pitch frequency distance before and after conversion. From this figure, the pitch frequency range of male Fig. 5 Distribution of psychological dis tancesfor the male-to-female voice con version. This figure represents the relative similarity distancebetween stimuli. In this figure, converted 1 and male 3 is found to be very close. Figure 7 represents the result for Experiment 3 analyzed by the same method as in 4.2.1. It is observed that the converted speech samples, "male 1 male 2" and "male 2 male 1," are both placed closer to their target speech. This indicates speech "m f" is placed very close to the speech "f." This indicates that this technique properly converted the male speech to the target female speech. Judging from the positions of "mp fp" and "ms fs," it is observed that the first and second axes roughly correspond to pitch frequency and spectrum differences, respectively. The result in dicatesthat neither pitch frequency nor spectrum carries enough information about speech indi viduality,and both are necessary. 4.2.2 Evaluation of male-to-male conversion (Re sultsof Experiments 2 and 3) The results of Experiment 2 are shown in Table 3. spectrum distortion Fig. 6 Speaker distance in spectrum distor tionand pitch frequency difference. The numbers in this table represent the percentage of responses where stimuli X are judged correctly. According to the table, listeners can't always Table 3 Percentage of correct responses. Fig. 7 Distribution of psychological dis tancesfor the male-to-male voice con version.

J.Acoust. Soc.Jpn.(E) 11, 2 (1990) that the proposed technique can also convert speech individuality between same-sex speakers. 5. CONCLUSION A new voice conversion technique through vector quantization and spectrum mapping was proposed. The advantage of this technique is summarized as follows: 1. The mapping codebooks which make it possi bleto give an individuality to synthesized speech are generated from a limited number of word utterances. 2. The mapping codebooks enable voice con versionof high quality between any two speakers. 3. The synthesis process requires few computa tionand produces speech in real time. The performance of this technique is confirmed by spectrum distortion and pitch frequency dif ference.the spectrum distortion between original speech and target speech decreased by a range of 27% to 66%. Pitch frequency difference decreased to less than 15 Hz. The overall performance of this technique is also confirmed by listening tests. It can be concluded that the converted speech has a voice quality very close to the target speaker's. ACKNOWLEDGMENTS We are grateful to Dr. Kurematsu, president of ATR Interpreting Telephony Research Laboratories, for his continuous support of this work. We wish to also thank Dr. Tohkura, head of Hearing& Speech Perception Department, for helpful discus sion. REFERENCES 1 ) K.Shikano, K.Lee, and R.Reddy,"Speaker ad aptationthrough vector quantization," ICASSP 86, 2643-2646 (1986). 2 ) S.Nakamura and K.Shikano,"Spectrogram nor malizationbased on vector quantization," Tech. Rep. Speech Acoust. Soc.Jpn.SP87-17, 9-16 (1987)(in Japanese). 3 ) H.Kuwabara and T.Takagi,"Quality control of speech by modifying formant frequencies and bandwidth," 11th Int.Congr. Phonetic Sciences, 281-284, August (1987). 4 ) D.G.Childers, B.Yegnanarayana, and K.Wu, "Voice conversion:factors responsible for quality," ICASSP 85, 748-751 (1985). 5 ) C.Hayashi,"Recent theoretical and methodological developments in multidimensional scaling and its related methods in Japan," Behaviormetrika No.18 (1985).