PHONETIC POSTERIORGRAMS FOR MANY-TO-ONE VOICE CONVERSION WITHOUT PARALLEL DATA TRAINING. Lifa Sun, Kun Li, Hao Wang, Shiyin Kang and Helen Meng

Size: px

Start display at page:

Download "PHONETIC POSTERIORGRAMS FOR MANY-TO-ONE VOICE CONVERSION WITHOUT PARALLEL DATA TRAINING. Lifa Sun, Kun Li, Hao Wang, Shiyin Kang and Helen Meng"

Hortense Bryant
6 years ago
Views:

1 PHONETIC POSTERIORGRAMS FOR MANY-TO-ONE VOICE CONVERSION WITHOUT PARALLEL DATA TRAINING Lifa Sun, Kun Li, Hao Wang, Shiyin Kang and Helen Meng Deparmen of Sysems Engineering and Engineering Managemen The Chinese Universiy of Hong Kong, Hong Kong SAR, China ABSTRACT This paper proposes a novel approach o voice conversion wih non-parallel raining daa. The idea is o bridge beween speakers by means of Phoneic PoseriorGrams () obained from a speaker-independen auomaic speech recogniion (SI-ASR) sysem. I is assumed ha hese can represen ariculaion of speech sounds in a speakernormalized space and correspond o spoken conen speakerindependenly. The proposed approach firs obains of arge speech. Then, a Deep Bidirecional Long Shor- Term Memory based Recurren Neural Nework (DBLSTM) srucure is used o model he relaionships beween he and acousic feaures of he arge speech. To conver arbirary source speech, we obain is from he same SI-ASR and feed hem ino he rained DBLSTM for generaing convered speech. Our approach has wo main advanages: 1) no parallel raining daa is required; 2) a rained model can be applied o any oher source speaker for a fixed arge speaker (i.e., manyo-one conversion). Experimens show ha our approach performs equally well or beer han sae-of-he-ar sysems in boh speech qualiy and speaker similariy. Index Terms voice conversion, phoneic poseriorgrams, non-parallel, many-o-one, SI-ASR, DBLSTM 1. INTRODUCTION Voice conversion (VC) aims o modify he speech of one speaker o make i sound as if i were spoken by anoher specific speaker. VC can be widely applied o many fields including cusomized feedback of compuer-aided pronunciaion rimming sysems, developmen of personalized speaking aids for speech-impaired subjecs, movie dubbing wih various persons voices, ec. Typical VC raining works as follows: speech segmens (e.g., frames) wih he same spoken conen are aligned firs. Then, he mapping from source acousic feaures o arge acousic feaures is found. Many previous effors on VC rely on parallel raining daa in which speech recordings come in pairs by he source speaker and he arge speaker uering he same senences. Sylianou e al. [1] proposed a coninuous probabilisic ransformaion approach based on Gaussian Mixure Models (GMMs). Toda e al. [2] improved he performance of GMM-based mehod by using global variance o alleviae he over-smoohing effec. Wu e al. [3] proposed a non-negaive marix facorizaion-based mehod o use speech exemplars o synhesize convered speech direcly. Nakashika e al. [4] used a Deep Neural Nework (DNN) o map he source and arge in high order space. Sun e al. [5] proposed a Deep Bidirecional Long Shor-Term Memory based Recurren Neural Nework (DBLSTM)-based approach o model he relaionships beween source and arge speeches by using specral feaures and heir conex informaion. All he above approaches provide reasonably good resuls. However, in pracice, parallel daa is no easily available. Hence, some researchers proposed approaches o VC wih non-parallel daa, which is a more challenging problem. Mos of hese approaches focused on finding proper frame alignmens ha is no so sraighforward. Erro e al. [6] proposed an ieraive alignmen mehod o pair phoneically equivalen acousic vecors from non-parallel uerances. Tao e al. [7] proposed a supervisory daa alignmen mehod, where phoneic informaion was used as he resricion during alignmen. Silén e al. [8] exended a dynamic kernel parial leas squares regression-based approach for non-parallel daa by combining i wih an ieraive alignmen algorihm. Benisy e al. [9] used emporal conex informaion o improve he ieraive alignmen accuracy of non-parallel daa. Unforunaely, he experimenal resuls [6 9] show ha he performance of VC wih non-parallel daa is no as good as ha of VC wih parallel daa. This oucome is reasonable because i is difficul o make non-parallel alignmen as accurae as parallel alignmen. Aryal e al. [10] proposed a very differen approach ha made use of ariculaory behavior esimaed by elecromagneic ariculography (EMA). Wih he belief ha differen speakers have he same ariculaory behavior (if heir ariculaory areas are normalized) when hey speak he same spoken conen, he auhors ook normalized EMA feaures as a bridge beween he source and arge speakers. Afer modeling he mapping beween EMA feaures and acousic feaures of he arge speaker, VC can be

2 achieved by driving he rained model wih EMA feaures of he source speaker. Our approach is inspired by [10]. However, insead of EMA feaures which are expensive o obain, we use easily accessible Phoneic PoseriorGrams () o bridge beween speakers. A PPG is a ime-versus-class marix represening he poserior probabiliies of each phoneic class for each specific ime frame of one uerance [11, 12]. Our proposed approach generaes by employing a speakerindependen auomaic speech recogniion (SI-ASR) sysem for equalizing speaker differences. Then, we use a DBLSTM srucure o model he mapping beween he obained and he corresponding acousic feaures of he arge speaker for speech parameer generaion. Finally, we perform VC by driving he rained DBLSTM model wih he source speaker s (obained from he same SI-ASR). Noe ha we are no using any underlying linguisic informaion behind from SI-ASR in VC. Our proposed approach has he following advanages: 1) no parallel raining daa is required; 2) no alignmen process (e.g., DTW) is required, which avoids he influence of possible alignmen errors; 3) a rained model can be applied o any oher source speakers as long as he arge speaker is fixed (as in many-o-one conversion). Bu for he sae-of-he-ar approach wih parallel raining daa, a rained model is only applicable o a specific source speaker (as in one-o-one conversion). The res of he paper is organized as follows: Secion 2 inroduces a sae-of-he-ar VC sysem ha relies on parallel raining daa as our baseline. Secion 3 describes our proposed VC approach wih. Secion 4 presens he experimens and he comparison of our proposed approach agains he baseline in erms of boh objecive and subjecive measures. Secion 5 concludes his paper. 2. BASELINE: DBLSTM-BASED APPROACH WITH PARALLEL TRAINING DATA The baseline approach is based on a DBLSTM framework which is rained wih parallel daa [5] Basic Framework of DBLSTM As shown in Fig. 1, DBLSTM is a sequence o sequence mapping model. The middle secion, he lef secion and he righ secion (marked wih, and +1 respecively) sand for he curren frame, he previous frame and he following frame respecively. Each square in he Fig. 1 represens one memory block, which conains self-conneced memory cells and hree gae unis (i.e., inpu, oupu and forge gaes) ha can respecively provide wrie, read and rese operaions. Furhermore, bidirecional connecions of each layer can make full use of he conex informaion in boh forward and backward direcions Fig. 1. Archiecure of DBLSTM. Oupu Layer Layer-2 Backward Layer-2 Forward Layer-1 Backward Layer-1 Forward Inpu Layer The DBLSTM nework archiecure including memory blocks and recurren connecions makes i possible o sore informaion over a longer period of ime and o learn he opimal amoun of conex informaion [5, 13] Training Sage and Conversion Sage The baseline approach is divided ino raining sage and conversion sage as illusraed in Fig. 2. Source Paired source Training Sage Source Speech Targe Speech (Parallel Daa) DTW DBLSTM Model Training Targe Paired arge Conversion Sage Source Speech Source Trained DBLSTM Model Convered Convered Speech STRAIGHT Vocoder Log F0 AP Linear Conversion Fig. 2. Schemaic diagram of he DBLSTM-based approach for VC wih parallel raining daa. In he raining sage, he specral envelope is exraced by STRAIGHT analysis [14]. Mel-cepsral coefficiens () [15] are exraced o represen he specral envelope and hen feaures from he same senences of he source speech and he arge speech are aligned by dynamic ime warping (DTW). Then, paired feaures of he source and arge speeches are reaed as he raining daa. Back-propagaion hrough ime (BPTT) is used o rain DBLSTM model.

2.3. Limiaions Despie is good performance, he DBLSTM-based approach has he following limiaions: 1) i relies on parallel raining daa which is expensive o collec; 2) he influence of DTW errors on VC

The compuaion of and he hree sages will be presened in he following subsecions. 3.2.

A phoneic class may refer o a word, a phone or a senone. In his paper, we rea senones as he phoneic class. Fig. 4 shows an example of PPG represenaion for he spoken phrase paricular case.

3 2.3. Limiaions Despie is good performance, he DBLSTM-based approach has he following limiaions: 1) i relies on parallel raining daa which is expensive o collec; 2) he influence of DTW errors on VC oupu qualiy is unavoidable. 3. PROPOSED APPROACH: VC WITH PHONETIC POSTERIORGRAMS (PPGS) of he source speech (obained from he same SI-ASR) for VC. The compuaion of and he hree sages will be presened in he following subsecions Phoneic PoseriorGrams () A PPG is a ime-versus-class marix represening he poserior probabiliies of each phoneic class for each specific ime frame of one uerance [11, 12]. A phoneic class may refer o a word, a phone or a senone. In his paper, we rea senones as he phoneic class. Fig. 4 shows an example of PPG represenaion for he spoken phrase paricular case. Se n o n e s In he conversion sage, fundamenal frequency (F0), and an aperiodic componen (AP) are exraced for one source uerance firs. Then, parameers of he convered speech are generaed as follows: are mapped by he rained DBLSTM model. Log F0 is convered by equalizing he mean and he sandard deviaion of he source and arge speeches. AP is direcly copied. Finally, he STRAIGHT vocoder is used o synhesize he speech waveform. To solve he limiaions of he baseline approach, we propose a -based approach wih he belief ha obained from an SI-ASR sysem can bridge across speakers Overview Ti me( s ) Training Sage 1 Training Sage 2 Conversion Sage Sandard ASR Corpus Targe Speech Source Speech MFCC SI-ASR Model Training Log F0 AP MFCC MFCC Trained SI ASR Model * Trained SIASR Model * DBLSTM Model Training Linear Conversion Trained DBLSTM Model Convered * means hese wo models are he same STRAIGHT Vocoder Convered Speech Fig. 3. Schemaic diagram of VC wih. SI sands for speaker-independen. Targe speech and source speech do no have any overlapped porion. The shaded par will be presened in Fig. 5. As illusraed in Fig. 3, he proposed approach is divided ino hree sages: raining sage 1, raining sage 2 and he conversion sage. The role of he SI-ASR model is o obain a represenaion of he inpu speech. Training sage 2 models he relaionships beween he and feaures of he arge speaker for speech parameer generaion. The conversion sage drives he rained DBLSTM model wih Fig. 4. PPG represenaion of he spoken phrase paricular case. The horizonal axis represens ime in seconds and he verical one conain indices of phoneic classes. The number of senones is 131. Darker shade implies a higher poserior probabiliy. We believe ha obained from an SI-ASR can represen ariculaion of speech sounds in a speakernormalized space and correspond o speech conen speakerindependenly. Therefore, we regard hese as a bridge beween he source and he arge speakers Training Sages 1 and 2 In raining sage 1, an SI-ASR sysem is rained for generaion using a muli-speaker ASR corpus. The equaions are illusraed by he example of one uerance. The inpu is he MFCC feaure vecor of h frame, denoed as X. The oupu is he vecor of poserior probabiliies P = (p(s X ) s = 1, 2,, C), where p(s X ) is he poserior probabiliy of each phoneic class s. As shown in Fig. 5, raining sage 2 rains he DBLSTM model (speech parameer generaion model) o ge he mapping relaionships beween he PPG and he sequence. For a given uerance from he arge speaker, denoes he frame index of his sequence. The inpu is he PPG (P1,, P,, PN ), compued by he rained SI-ASR model. The ideal value of he oupu layer is he

4 Sequence R R R Y Y Y+1 Oupu Layer Baseline sysem: DBLSTM-based approach wih parallel raining daa. Two asks: male-o-male (M2M) conversion and male-o-female (M2F) conversion. DBLSTM Model Training P P P+1 PPG Inpu Layer Fig. 5. Schemaic diagram of DBLSTM model raining. sequence (Y1 T,, Y T,, YN T ), exraced from he arge speech. The acual value of he oupu layer is (Y1 R,, Y R,, YN R ). The cos funcion of raining sage 2 is min N =1 YR Y T 2 (1) The model is rained o minimize he cos funcion hrough he BPTT echnique menioned in Secion 2. Noe ha he DBLSTM model is rained using only he arge speaker s feaures and he speaker-independen wihou using any oher linguisic informaion Conversion Sage In he conversion sage, he conversion of log F0 and AP is he same as ha of he baseline approach. Firs, o ge he convered, MFCC feaures of he source speech are exraced. Second, are obained from he rained SI- ASR model where he inpu is MFCC feaures. Third, are convered o by he rained DBLSTM model. Finally, he convered ogeher wih he convered log F0 and AP are used by he vocoder o synhesize he oupu speech Experimenal Seup 4. EXPERIMENTS The daa we use for VC is he CMU ARCTIC corpus [16]. The wihin-gender conversion experimen (male-o-male: BDL o RMS) and he cross-gender conversion experimen (male-o-female: BDL o SLT) are conduced. The baseline approach uses parallel speech of he source and arge speakers while our proposed approach uses only he arge speaker s speech for model raining. The signals are sampled a 16kHZ wih mono channel, windowed wih 25 ms and shifed every 5 ms. Acousic feaures, including specral envelope, F0 (1 dimension) and AP (513 dimensions) are exraced by STRAIGHT analysis [14]. The 39h order plus log energy are exraced o represen he specral envelope. Two sysems are implemened for comparison: Proposed sysem: Our proposed approach uses o augmen he DBLSTM. Two asks: maleo-male (M2M) conversion and male-o-female (M2F) conversion. In he -based approach, he SI-ASR sysem is implemened using he Kaldi speech recogniion oolki [17] wih he TIMIT corpus [18]. The sysem has a DNN archiecure wih 4 hidden layers each of which conains 1024 unis. Senones are reaed as he phoneic class of. The number of senones is 131, which is obained by clusering in raining sage 1. Hardware configuraion of he SI-ASR model raining is dual Inel Xeon E5-2640, 8 cores, 2.6GHZ. The raining ime is abou 11 hours. Then, he DBLSTM model is adoped o map he relaionships of sequence and sequence for speech parameer generaion. The implemenaion is based on he machine learning library, CURRENNT [19]. The number of unis in each layer is [ ] respecively, where each hidden layer conains one forward LSTM layer and one backward LSTM layer. BPTT is used o rain his model wih a learning rae of and a momenum of 0.9. The raining process of DBLSTM model is acceleraed by a NVIDIA Tesla K40 GPU and i akes abou 4 hours for 100 senences raining se. The baseline DBLSTM-based approach has he same model configuraion excep ha is inpu has only 39 dimensions (insead of 131). I akes abou 3 hours for 100 senences raining se Objecive Evaluaion Mel-cepsral disorion (MCD) is used o measure how close he convered is o he arge speech. MCD is he Euclidean disance beween he of he convered speech and he arge speech, denoed as MCD[dB] = 10 2 N ln10 (c d c convered d=1 d ) 2 (2) where N is he dimension of (excluding he energy feaure). c d and c convered d are he d-h coefficien of he arge and convered respecively. To explore he effec of he raining daa size, all he sysems are rained using differen amouns of raining daa 5, 20, 60, 100 and 200 senences. For he baseline approach, he raining daa consiss of parallel pairs of senences from he source and arge speakers. For he proposed approach, he raining daa consiss only of he senences from he arge speaker. The es daa se has 80 senences from he source speaker.

5 Mel-cepsral Disorion (db) Mel-cepsral Disorion (db) Score Baseline No. of Senences Baseline Fig. 6. Average MCD of baseline and proposed approaches. Male-o-male conversion experimen Baseline 0 M2M M2F Fig. 8. MOS es resuls wih he 95% confidence inervals. M2M: male-o-male experimen. M2F: male-o-female experimen. 5-poin scale: 5: excellen, 4: good, 3: fair, 2: poor, 1: bad No. of Senences he wo approaches) sounds more like he arge speaker s recording X or no preference. Each pair of A and B are shuffled o avoid preferenial bias. As shown in Fig. 9, based approach is frequenly preferred over he baseline approach. Fig. 7. Average MCD of baseline and proposed approaches. Male-o-female conversion experimen. M 2 M 11% 37% Baseline N/P 52% Fig. 6 and Fig. 7 show he resuls of male-o-male and male-o-female experimens respecively. As shown, when he raining size is a 5, 20 and 60 senences, he MCD value becomes smaller wih he increase of he daa size. The MCD value ends o converge when he raining size is larger han 60 senences. The resuls indicae ha he baseline approach and he proposed approach have similar performance in erms of objecive measure Subjecive Evaluaions We conduced a Mean Opinion Score (MOS) es and an ABX preference es as subjecive evaluaions for measuring he nauralness and speaker similariy of convered speech. 100 senences are used for raining each sysem and 10 senences (no in he raining se) are randomly seleced for esing. 21 paricipans are asked o do MOS es and ABX es. The quesionnaires of hese wo ess and some samples are presened in hps://sies.google.com/sie/2016icme/. In he MOS es, liseners are asked o rae he nauralness and clearness of he convered speech on a 5-poin scale. The resuls of he MOS es are shown in Fig. 8. The average scores of he baseline and proposed -based approaches are 3.20 and 3.87 respecively. For he ABX preference es, liseners are asked o choose which of he convered uerances A and B (generaed by M 2 F 17% 50% 33% Fig. 9. ABX preference es resuls. N/P sands for no preference. M2M: male-o-male experimen. M2F: male-ofemale experimen. The p-values of he wo experimens are and respecively. Resuls from boh MOS es and ABX es show ha our proposed -based approach perform beer han he baseline approach in boh speech qualiy and speaker similariy. Possible reasons include: 1) Proposed based approach does no require alignmen (e.g., DTW), which avoids he influence caused by possible alignmen errors; 2) he DBLSTM model of he proposed approach is rained using only he speaker-normalized and he arge speaker s acousic feaures. This minimizes he inerference from he source speaker s signal. 5. CONCLUSIONS In his paper, we propose a -based voice conversion approach for non-parallel daa., obained by an SI- ASR model, are used o bridge beween he source and arge speakers. The relaionships beween and acousic

6 feaures are modeled by a DBLSTM srucure. The proposed approach does no require parallel raining daa and is very flexible for many-o-one conversion, which are he wo main advanages over he approach for voice conversion (VC) using parallel daa. Experimens sugges ha he proposed approach improves he nauralness of he convered speech and is similariy wih arge speech. We have also ried applying our proposed model ino cross-lingual VC and have obained some good preliminary resuls. More invesigaion on he cross-lingual applicaions will be conduced in he fuure. 6. ACKNOWLEDGEMENTS The work is parially suppored by a gran from he HKSAR Governmen s General Research Fund (Projec Number: ) 7. REFERENCES [1] Y. Sylianou, O. Cappé, and E. Moulines, Coninuous probabilisic ransform for voice conversion, IEEE Transacions on Speech and Audio Processing, vol. 6, no. 2, pp , [2] T. Toda, A. W. Black, and K. Tokuda, Voice conversion based on maximum-likelihood esimaion of specral parameer rajecory, IEEE Transacions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp , [3] Z. Wu, T. Viranen, T. Kinnunen, E. S. Chng, and H. Li, Exemplar-based voice conversion using non-negaive specrogram deconvoluion, in Proc. 8h ISCA Speech Synhesis Workshop, [4] T. Nakashika, R. Takashima, T. Takiguchi, and Y. Ariki, Voice conversion in high-order eigen space using Deep Belief Nes, in Proc. Inerspeech, [5] L. Sun, S. Kang, K. Li, and H. Meng, Voice conversion using deep bidirecional Long Shor-Term Memory based Recurren Neural Neworks, in Proc. ICASSP, [6] D. Erro, A. Moreno, and A. Bonafone, INCA algorihm for raining voice conversion sysems from nonparallel corpora, IEEE Transacions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp , [7] J. Tao, M. Zhang, J. Nurminen, J. Tian, and X. Wang, Supervisory daa alignmen for ex-independen voice conversion, IEEE Transacions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp , [8] H. Silén, J. Nurminen, E. Helander, and M. Gabbouj, Voice conversion for non-parallel daases using dynamic kernel parial leas squares regression, Convergence, vol. 1, p. 2, [9] H. Benisy, D. Malah, and K. Crammer, Non-parallel voice conversion using join opimizaion of alignmen by emporal conex and specral disorion, in Proc. ICASSP, [10] S. Aryal and R. Guierrez-Osuna, Ariculaory-based conversion of foreign accens wih Deep Neural Neworks, in Proc. Inerspeech, [11] T. J. Hazen, W. Shen, and C. Whie, Query-by-example spoken erm deecion using phoneic poseriorgram emplaes, in Proc. ASRU, [12] K. Kinzley, A. Jansen, and H. Hermansky, Even selecion from phone poseriorgrams using mached filers, in Proc. Inerspeech, [13] M. Wollmer, Z. Zhang, F. Weninger, B. Schuller, and G. Rigoll, Feaure enhancemen by bidirecional LSTM neworks for conversaional speech recogniion in highly non-saionary noise, in Proc. ICASSP, [14] H. Kawahara, I. Masuda-Kasuse, and A. de Cheveigné, Resrucuring speech represenaions using a pich-adapive ime frequency smoohing and an insananeous-frequency-based F0 exracion: Possible role of a repeiive srucure in sounds, Speech communicaion, vol. 27, no. 3, pp , [15] S. Imai, Cepsral analysis synhesis on he mel frequency scale, in Proc. ICASSP, [16] J. Kominek and A. W. Black, The CMU Arcic speech daabases, in Fifh ISCA Workshop on Speech Synhesis, [17] D. Povey, A. Ghoshal, G. Boulianne, L. Burge, O. Glembek, N. Goel, M. Hannemann, P. Molicek, Y. Qian, P. Schwarz, J. Silovsky, G. Semmer, and K. Vesely, The Kaldi speech recogniion Toolki, Dec [18] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Palle, N. Dahlgren, and V. Zue, TIMIT acousic-phoneic coninuous speech corpus, [19] F. Weninger, J. Bergmann, and B. Schuller, Inroducing CURRENNT: he Munich open-source CUDA Recur- REn Neural Nework Toolki, Journal of Machine Learning Research, vol. 16, pp , 2015.

Neural Network Model of the Backpropagation Algorithm

Neural Network Model of the Backpropagation Algorithm Neural Nework Model of he Backpropagaion Algorihm Rudolf Jakša Deparmen of Cyberneics and Arificial Inelligence Technical Universiy of Košice Lená 9, 4 Košice Slovakia jaksa@neuron.uke.sk Miroslav Karák