PHONETIC POSTERIORGRAMS FOR MANY-TO-ONE VOICE CONVERSION WITHOUT PARALLEL DATA TRAINING. Lifa Sun, Kun Li, Hao Wang, Shiyin Kang and Helen Meng

Similar documents
Neural Network Model of the Backpropagation Algorithm

Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices

Fast Multi-task Learning for Query Spelling Correction

More Accurate Question Answering on Freebase

An Effiecient Approach for Resource Auto-Scaling in Cloud Environments

1 Language universals

MyLab & Mastering Business

Modeling function word errors in DNN-HMM based LVCSR systems

Information Propagation for informing Special Population Subgroups about New Ground Transportation Services at Airports

Modeling function word errors in DNN-HMM based LVCSR systems

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

A study of speaker adaptation for DNN-based speech synthesis

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Speech Emotion Recognition Using Support Vector Machine

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Learning Methods in Multilingual Speech Recognition

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Human Emotion Recognition From Speech

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Improvements to the Pruning Behavior of DNN Acoustic Models

arxiv: v1 [cs.lg] 7 Apr 2015

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Speech Recognition at ICSI: Broadcast News and beyond

Mandarin Lexical Tone Recognition: The Gating Paradigm

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

WHEN THERE IS A mismatch between the acoustic

Voice conversion through vector quantization

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Affective Classification of Generic Audio Clips using Regression Models

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Lecture 1: Machine Learning Basics

Deep Neural Network Language Models

Constructing Parallel Corpus from Movie Subtitles

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Proceedings of Meetings on Acoustics

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

On the Formation of Phoneme Categories in DNN Acoustic Models

SPEECH RECOGNITION CHALLENGE IN THE WILD: ARABIC MGB-3

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Edinburgh Research Explorer

Segregation of Unvoiced Speech from Nonspeech Interference

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Spoofing and countermeasures for automatic speaker verification

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Investigation on Mandarin Broadcast News Speech Recognition

Speaker Identification by Comparison of Smart Methods. Abstract

A new Dataset of Telephone-Based Human-Human Call-Center Interaction with Emotional Evaluation

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Reduce the Failure Rate of the Screwing Process with Six Sigma Approach

Generative models and adversarial training

/$ IEEE

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

STUDENT SATISFACTION IN PROFESSIONAL EDUCATION IN GWALIOR

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Mining Association Rules in Student s Assessment Data

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Word Segmentation of Off-line Handwritten Documents

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Letter-based speech synthesis

arxiv: v1 [cs.cl] 27 Apr 2016

Writing quality predicts Chinese learning

Calibration of Confidence Measures in Speech Recognition

Learning Methods for Fuzzy Systems

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Speech Recognition by Indexing and Sequencing

SPEAKER IDENTIFICATION FROM SHOUTED SPEECH: ANALYSIS AND COMPENSATION

Automatic Pronunciation Checker

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Reducing Features to Improve Bug Prediction

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Probabilistic Latent Semantic Analysis

Transcription:

PHONETIC POSTERIORGRAMS FOR MANY-TO-ONE VOICE CONVERSION WITHOUT PARALLEL DATA TRAINING Lifa Sun, Kun Li, Hao Wang, Shiyin Kang and Helen Meng Deparmen of Sysems Engineering and Engineering Managemen The Chinese Universiy of Hong Kong, Hong Kong SAR, China {lfsun,kli,hwang,sykang,hmmeng}@se.cuhk.edu.hk ABSTRACT This paper proposes a novel approach o voice conversion wih non-parallel raining daa. The idea is o bridge beween speakers by means of Phoneic PoseriorGrams () obained from a speaker-independen auomaic speech recogniion (SI-ASR) sysem. I is assumed ha hese can represen ariculaion of speech sounds in a speakernormalized space and correspond o spoken conen speakerindependenly. The proposed approach firs obains of arge speech. Then, a Deep Bidirecional Long Shor- Term Memory based Recurren Neural Nework (DBLSTM) srucure is used o model he relaionships beween he and acousic feaures of he arge speech. To conver arbirary source speech, we obain is from he same SI-ASR and feed hem ino he rained DBLSTM for generaing convered speech. Our approach has wo main advanages: 1) no parallel raining daa is required; 2) a rained model can be applied o any oher source speaker for a fixed arge speaker (i.e., manyo-one conversion). Experimens show ha our approach performs equally well or beer han sae-of-he-ar sysems in boh speech qualiy and speaker similariy. Index Terms voice conversion, phoneic poseriorgrams, non-parallel, many-o-one, SI-ASR, DBLSTM 1. INTRODUCTION Voice conversion (VC) aims o modify he speech of one speaker o make i sound as if i were spoken by anoher specific speaker. VC can be widely applied o many fields including cusomized feedback of compuer-aided pronunciaion rimming sysems, developmen of personalized speaking aids for speech-impaired subjecs, movie dubbing wih various persons voices, ec. Typical VC raining works as follows: speech segmens (e.g., frames) wih he same spoken conen are aligned firs. Then, he mapping from source acousic feaures o arge acousic feaures is found. Many previous effors on VC rely on parallel raining daa in which speech recordings come in pairs by he source speaker and he arge speaker uering he same senences. Sylianou e al. [1] proposed a coninuous probabilisic ransformaion approach based on Gaussian Mixure Models (GMMs). Toda e al. [2] improved he performance of GMM-based mehod by using global variance o alleviae he over-smoohing effec. Wu e al. [3] proposed a non-negaive marix facorizaion-based mehod o use speech exemplars o synhesize convered speech direcly. Nakashika e al. [4] used a Deep Neural Nework (DNN) o map he source and arge in high order space. Sun e al. [5] proposed a Deep Bidirecional Long Shor-Term Memory based Recurren Neural Nework (DBLSTM)-based approach o model he relaionships beween source and arge speeches by using specral feaures and heir conex informaion. All he above approaches provide reasonably good resuls. However, in pracice, parallel daa is no easily available. Hence, some researchers proposed approaches o VC wih non-parallel daa, which is a more challenging problem. Mos of hese approaches focused on finding proper frame alignmens ha is no so sraighforward. Erro e al. [6] proposed an ieraive alignmen mehod o pair phoneically equivalen acousic vecors from non-parallel uerances. Tao e al. [7] proposed a supervisory daa alignmen mehod, where phoneic informaion was used as he resricion during alignmen. Silén e al. [8] exended a dynamic kernel parial leas squares regression-based approach for non-parallel daa by combining i wih an ieraive alignmen algorihm. Benisy e al. [9] used emporal conex informaion o improve he ieraive alignmen accuracy of non-parallel daa. Unforunaely, he experimenal resuls [6 9] show ha he performance of VC wih non-parallel daa is no as good as ha of VC wih parallel daa. This oucome is reasonable because i is difficul o make non-parallel alignmen as accurae as parallel alignmen. Aryal e al. [10] proposed a very differen approach ha made use of ariculaory behavior esimaed by elecromagneic ariculography (EMA). Wih he belief ha differen speakers have he same ariculaory behavior (if heir ariculaory areas are normalized) when hey speak he same spoken conen, he auhors ook normalized EMA feaures as a bridge beween he source and arge speakers. Afer modeling he mapping beween EMA feaures and acousic feaures of he arge speaker, VC can be

achieved by driving he rained model wih EMA feaures of he source speaker. Our approach is inspired by [10]. However, insead of EMA feaures which are expensive o obain, we use easily accessible Phoneic PoseriorGrams () o bridge beween speakers. A PPG is a ime-versus-class marix represening he poserior probabiliies of each phoneic class for each specific ime frame of one uerance [11, 12]. Our proposed approach generaes by employing a speakerindependen auomaic speech recogniion (SI-ASR) sysem for equalizing speaker differences. Then, we use a DBLSTM srucure o model he mapping beween he obained and he corresponding acousic feaures of he arge speaker for speech parameer generaion. Finally, we perform VC by driving he rained DBLSTM model wih he source speaker s (obained from he same SI-ASR). Noe ha we are no using any underlying linguisic informaion behind from SI-ASR in VC. Our proposed approach has he following advanages: 1) no parallel raining daa is required; 2) no alignmen process (e.g., DTW) is required, which avoids he influence of possible alignmen errors; 3) a rained model can be applied o any oher source speakers as long as he arge speaker is fixed (as in many-o-one conversion). Bu for he sae-of-he-ar approach wih parallel raining daa, a rained model is only applicable o a specific source speaker (as in one-o-one conversion). The res of he paper is organized as follows: Secion 2 inroduces a sae-of-he-ar VC sysem ha relies on parallel raining daa as our baseline. Secion 3 describes our proposed VC approach wih. Secion 4 presens he experimens and he comparison of our proposed approach agains he baseline in erms of boh objecive and subjecive measures. Secion 5 concludes his paper. 2. BASELINE: DBLSTM-BASED APPROACH WITH PARALLEL TRAINING DATA The baseline approach is based on a DBLSTM framework which is rained wih parallel daa [5]. 2.1. Basic Framework of DBLSTM As shown in Fig. 1, DBLSTM is a sequence o sequence mapping model. The middle secion, he lef secion and he righ secion (marked wih, and +1 respecively) sand for he curren frame, he previous frame and he following frame respecively. Each square in he Fig. 1 represens one memory block, which conains self-conneced memory cells and hree gae unis (i.e., inpu, oupu and forge gaes) ha can respecively provide wrie, read and rese operaions. Furhermore, bidirecional connecions of each layer can make full use of he conex informaion in boh forward and backward direcions. +1 +1 Fig. 1. Archiecure of DBLSTM. Oupu Layer Layer-2 Backward Layer-2 Forward Layer-1 Backward Layer-1 Forward Inpu Layer The DBLSTM nework archiecure including memory blocks and recurren connecions makes i possible o sore informaion over a longer period of ime and o learn he opimal amoun of conex informaion [5, 13]. 2.2. Training Sage and Conversion Sage The baseline approach is divided ino raining sage and conversion sage as illusraed in Fig. 2. Source Paired source Training Sage Source Speech Targe Speech (Parallel Daa) DTW DBLSTM Model Training Targe Paired arge Conversion Sage Source Speech Source Trained DBLSTM Model Convered Convered Speech STRAIGHT Vocoder Log F0 AP Linear Conversion Fig. 2. Schemaic diagram of he DBLSTM-based approach for VC wih parallel raining daa. In he raining sage, he specral envelope is exraced by STRAIGHT analysis [14]. Mel-cepsral coefficiens () [15] are exraced o represen he specral envelope and hen feaures from he same senences of he source speech and he arge speech are aligned by dynamic ime warping (DTW). Then, paired feaures of he source and arge speeches are reaed as he raining daa. Back-propagaion hrough ime (BPTT) is used o rain DBLSTM model.

2.3. Limiaions Despie is good performance, he DBLSTM-based approach has he following limiaions: 1) i relies on parallel raining daa which is expensive o collec; 2) he influence of DTW errors on VC oupu qualiy is unavoidable. 3. PROPOSED APPROACH: VC WITH PHONETIC POSTERIORGRAMS (PPGS) of he source speech (obained from he same SI-ASR) for VC. The compuaion of and he hree sages will be presened in he following subsecions. 3.2. Phoneic PoseriorGrams () A PPG is a ime-versus-class marix represening he poserior probabiliies of each phoneic class for each specific ime frame of one uerance [11, 12]. A phoneic class may refer o a word, a phone or a senone. In his paper, we rea senones as he phoneic class. Fig. 4 shows an example of PPG represenaion for he spoken phrase paricular case. Se n o n e s In he conversion sage, fundamenal frequency (F0), and an aperiodic componen (AP) are exraced for one source uerance firs. Then, parameers of he convered speech are generaed as follows: are mapped by he rained DBLSTM model. Log F0 is convered by equalizing he mean and he sandard deviaion of he source and arge speeches. AP is direcly copied. Finally, he STRAIGHT vocoder is used o synhesize he speech waveform. To solve he limiaions of he baseline approach, we propose a -based approach wih he belief ha obained from an SI-ASR sysem can bridge across speakers. 3.1. Overview Ti me( s ) Training Sage 1 Training Sage 2 Conversion Sage Sandard ASR Corpus Targe Speech Source Speech MFCC SI-ASR Model Training Log F0 AP MFCC MFCC Trained SI ASR Model * Trained SIASR Model * DBLSTM Model Training Linear Conversion Trained DBLSTM Model Convered * means hese wo models are he same STRAIGHT Vocoder Convered Speech Fig. 3. Schemaic diagram of VC wih. SI sands for speaker-independen. Targe speech and source speech do no have any overlapped porion. The shaded par will be presened in Fig. 5. As illusraed in Fig. 3, he proposed approach is divided ino hree sages: raining sage 1, raining sage 2 and he conversion sage. The role of he SI-ASR model is o obain a represenaion of he inpu speech. Training sage 2 models he relaionships beween he and feaures of he arge speaker for speech parameer generaion. The conversion sage drives he rained DBLSTM model wih Fig. 4. PPG represenaion of he spoken phrase paricular case. The horizonal axis represens ime in seconds and he verical one conain indices of phoneic classes. The number of senones is 131. Darker shade implies a higher poserior probabiliy. We believe ha obained from an SI-ASR can represen ariculaion of speech sounds in a speakernormalized space and correspond o speech conen speakerindependenly. Therefore, we regard hese as a bridge beween he source and he arge speakers. 3.3. Training Sages 1 and 2 In raining sage 1, an SI-ASR sysem is rained for generaion using a muli-speaker ASR corpus. The equaions are illusraed by he example of one uerance. The inpu is he MFCC feaure vecor of h frame, denoed as X. The oupu is he vecor of poserior probabiliies P = (p(s X ) s = 1, 2,, C), where p(s X ) is he poserior probabiliy of each phoneic class s. As shown in Fig. 5, raining sage 2 rains he DBLSTM model (speech parameer generaion model) o ge he mapping relaionships beween he PPG and he sequence. For a given uerance from he arge speaker, denoes he frame index of his sequence. The inpu is he PPG (P1,, P,, PN ), compued by he rained SI-ASR model. The ideal value of he oupu layer is he

Sequence R R R Y Y Y+1 Oupu Layer Baseline sysem: DBLSTM-based approach wih parallel raining daa. Two asks: male-o-male (M2M) conversion and male-o-female (M2F) conversion. DBLSTM Model Training P P P+1 PPG Inpu Layer Fig. 5. Schemaic diagram of DBLSTM model raining. sequence (Y1 T,, Y T,, YN T ), exraced from he arge speech. The acual value of he oupu layer is (Y1 R,, Y R,, YN R ). The cos funcion of raining sage 2 is min N =1 YR Y T 2 (1) The model is rained o minimize he cos funcion hrough he BPTT echnique menioned in Secion 2. Noe ha he DBLSTM model is rained using only he arge speaker s feaures and he speaker-independen wihou using any oher linguisic informaion. 3.4. Conversion Sage In he conversion sage, he conversion of log F0 and AP is he same as ha of he baseline approach. Firs, o ge he convered, MFCC feaures of he source speech are exraced. Second, are obained from he rained SI- ASR model where he inpu is MFCC feaures. Third, are convered o by he rained DBLSTM model. Finally, he convered ogeher wih he convered log F0 and AP are used by he vocoder o synhesize he oupu speech. 4.1. Experimenal Seup 4. EXPERIMENTS The daa we use for VC is he CMU ARCTIC corpus [16]. The wihin-gender conversion experimen (male-o-male: BDL o RMS) and he cross-gender conversion experimen (male-o-female: BDL o SLT) are conduced. The baseline approach uses parallel speech of he source and arge speakers while our proposed approach uses only he arge speaker s speech for model raining. The signals are sampled a 16kHZ wih mono channel, windowed wih 25 ms and shifed every 5 ms. Acousic feaures, including specral envelope, F0 (1 dimension) and AP (513 dimensions) are exraced by STRAIGHT analysis [14]. The 39h order plus log energy are exraced o represen he specral envelope. Two sysems are implemened for comparison: Proposed sysem: Our proposed approach uses o augmen he DBLSTM. Two asks: maleo-male (M2M) conversion and male-o-female (M2F) conversion. In he -based approach, he SI-ASR sysem is implemened using he Kaldi speech recogniion oolki [17] wih he TIMIT corpus [18]. The sysem has a DNN archiecure wih 4 hidden layers each of which conains 1024 unis. Senones are reaed as he phoneic class of. The number of senones is 131, which is obained by clusering in raining sage 1. Hardware configuraion of he SI-ASR model raining is dual Inel Xeon E5-2640, 8 cores, 2.6GHZ. The raining ime is abou 11 hours. Then, he DBLSTM model is adoped o map he relaionships of sequence and sequence for speech parameer generaion. The implemenaion is based on he machine learning library, CURRENNT [19]. The number of unis in each layer is [131 64 64 64 64 39] respecively, where each hidden layer conains one forward LSTM layer and one backward LSTM layer. BPTT is used o rain his model wih a learning rae of 1.0 10 6 and a momenum of 0.9. The raining process of DBLSTM model is acceleraed by a NVIDIA Tesla K40 GPU and i akes abou 4 hours for 100 senences raining se. The baseline DBLSTM-based approach has he same model configuraion excep ha is inpu has only 39 dimensions (insead of 131). I akes abou 3 hours for 100 senences raining se. 4.2. Objecive Evaluaion Mel-cepsral disorion (MCD) is used o measure how close he convered is o he arge speech. MCD is he Euclidean disance beween he of he convered speech and he arge speech, denoed as MCD[dB] = 10 2 N ln10 (c d c convered d=1 d ) 2 (2) where N is he dimension of (excluding he energy feaure). c d and c convered d are he d-h coefficien of he arge and convered respecively. To explore he effec of he raining daa size, all he sysems are rained using differen amouns of raining daa 5, 20, 60, 100 and 200 senences. For he baseline approach, he raining daa consiss of parallel pairs of senences from he source and arge speakers. For he proposed approach, he raining daa consiss only of he senences from he arge speaker. The es daa se has 80 senences from he source speaker.

Mel-cepsral Disorion (db) Mel-cepsral Disorion (db) Score 8 7.5 7 6.5 6 5.5 Baseline 5 20 60 100 200 No. of Senences 5 4 3 2 1 3.08 Baseline 3.86 3.87 3.32 Fig. 6. Average MCD of baseline and proposed approaches. Male-o-male conversion experimen. 8 7.5 7 Baseline 0 M2M M2F Fig. 8. MOS es resuls wih he 95% confidence inervals. M2M: male-o-male experimen. M2F: male-o-female experimen. 5-poin scale: 5: excellen, 4: good, 3: fair, 2: poor, 1: bad. 6.5 6 5.5 5 20 60 100 200 No. of Senences he wo approaches) sounds more like he arge speaker s recording X or no preference. Each pair of A and B are shuffled o avoid preferenial bias. As shown in Fig. 9, based approach is frequenly preferred over he baseline approach. Fig. 7. Average MCD of baseline and proposed approaches. Male-o-female conversion experimen. M 2 M 11% 37% Baseline N/P 52% Fig. 6 and Fig. 7 show he resuls of male-o-male and male-o-female experimens respecively. As shown, when he raining size is a 5, 20 and 60 senences, he MCD value becomes smaller wih he increase of he daa size. The MCD value ends o converge when he raining size is larger han 60 senences. The resuls indicae ha he baseline approach and he proposed approach have similar performance in erms of objecive measure. 4.3. Subjecive Evaluaions We conduced a Mean Opinion Score (MOS) es and an ABX preference es as subjecive evaluaions for measuring he nauralness and speaker similariy of convered speech. 100 senences are used for raining each sysem and 10 senences (no in he raining se) are randomly seleced for esing. 21 paricipans are asked o do MOS es and ABX es. The quesionnaires of hese wo ess and some samples are presened in hps://sies.google.com/sie/2016icme/. In he MOS es, liseners are asked o rae he nauralness and clearness of he convered speech on a 5-poin scale. The resuls of he MOS es are shown in Fig. 8. The average scores of he baseline and proposed -based approaches are 3.20 and 3.87 respecively. For he ABX preference es, liseners are asked o choose which of he convered uerances A and B (generaed by M 2 F 17% 50% 33% Fig. 9. ABX preference es resuls. N/P sands for no preference. M2M: male-o-male experimen. M2F: male-ofemale experimen. The p-values of he wo experimens are 2.94 10 16 and 4.94 10 3 respecively. Resuls from boh MOS es and ABX es show ha our proposed -based approach perform beer han he baseline approach in boh speech qualiy and speaker similariy. Possible reasons include: 1) Proposed based approach does no require alignmen (e.g., DTW), which avoids he influence caused by possible alignmen errors; 2) he DBLSTM model of he proposed approach is rained using only he speaker-normalized and he arge speaker s acousic feaures. This minimizes he inerference from he source speaker s signal. 5. CONCLUSIONS In his paper, we propose a -based voice conversion approach for non-parallel daa., obained by an SI- ASR model, are used o bridge beween he source and arge speakers. The relaionships beween and acousic

feaures are modeled by a DBLSTM srucure. The proposed approach does no require parallel raining daa and is very flexible for many-o-one conversion, which are he wo main advanages over he approach for voice conversion (VC) using parallel daa. Experimens sugges ha he proposed approach improves he nauralness of he convered speech and is similariy wih arge speech. We have also ried applying our proposed model ino cross-lingual VC and have obained some good preliminary resuls. More invesigaion on he cross-lingual applicaions will be conduced in he fuure. 6. ACKNOWLEDGEMENTS The work is parially suppored by a gran from he HKSAR Governmen s General Research Fund (Projec Number: 14205814) 7. REFERENCES [1] Y. Sylianou, O. Cappé, and E. Moulines, Coninuous probabilisic ransform for voice conversion, IEEE Transacions on Speech and Audio Processing, vol. 6, no. 2, pp. 131 142, 1998. [2] T. Toda, A. W. Black, and K. Tokuda, Voice conversion based on maximum-likelihood esimaion of specral parameer rajecory, IEEE Transacions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2222 2235, 2007. [3] Z. Wu, T. Viranen, T. Kinnunen, E. S. Chng, and H. Li, Exemplar-based voice conversion using non-negaive specrogram deconvoluion, in Proc. 8h ISCA Speech Synhesis Workshop, 2013. [4] T. Nakashika, R. Takashima, T. Takiguchi, and Y. Ariki, Voice conversion in high-order eigen space using Deep Belief Nes, in Proc. Inerspeech, 2013. [5] L. Sun, S. Kang, K. Li, and H. Meng, Voice conversion using deep bidirecional Long Shor-Term Memory based Recurren Neural Neworks, in Proc. ICASSP, 2015. [6] D. Erro, A. Moreno, and A. Bonafone, INCA algorihm for raining voice conversion sysems from nonparallel corpora, IEEE Transacions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp. 944 953, 2010. [7] J. Tao, M. Zhang, J. Nurminen, J. Tian, and X. Wang, Supervisory daa alignmen for ex-independen voice conversion, IEEE Transacions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp. 932 943, 2010. [8] H. Silén, J. Nurminen, E. Helander, and M. Gabbouj, Voice conversion for non-parallel daases using dynamic kernel parial leas squares regression, Convergence, vol. 1, p. 2, 2013. [9] H. Benisy, D. Malah, and K. Crammer, Non-parallel voice conversion using join opimizaion of alignmen by emporal conex and specral disorion, in Proc. ICASSP, 2014. [10] S. Aryal and R. Guierrez-Osuna, Ariculaory-based conversion of foreign accens wih Deep Neural Neworks, in Proc. Inerspeech, 2015. [11] T. J. Hazen, W. Shen, and C. Whie, Query-by-example spoken erm deecion using phoneic poseriorgram emplaes, in Proc. ASRU, 2009. [12] K. Kinzley, A. Jansen, and H. Hermansky, Even selecion from phone poseriorgrams using mached filers, in Proc. Inerspeech, 2011. [13] M. Wollmer, Z. Zhang, F. Weninger, B. Schuller, and G. Rigoll, Feaure enhancemen by bidirecional LSTM neworks for conversaional speech recogniion in highly non-saionary noise, in Proc. ICASSP, 2013. [14] H. Kawahara, I. Masuda-Kasuse, and A. de Cheveigné, Resrucuring speech represenaions using a pich-adapive ime frequency smoohing and an insananeous-frequency-based F0 exracion: Possible role of a repeiive srucure in sounds, Speech communicaion, vol. 27, no. 3, pp. 187 207, 1999. [15] S. Imai, Cepsral analysis synhesis on he mel frequency scale, in Proc. ICASSP, 1983. [16] J. Kominek and A. W. Black, The CMU Arcic speech daabases, in Fifh ISCA Workshop on Speech Synhesis, 2004. [17] D. Povey, A. Ghoshal, G. Boulianne, L. Burge, O. Glembek, N. Goel, M. Hannemann, P. Molicek, Y. Qian, P. Schwarz, J. Silovsky, G. Semmer, and K. Vesely, The Kaldi speech recogniion Toolki, Dec. 2011. [18] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Palle, N. Dahlgren, and V. Zue, TIMIT acousic-phoneic coninuous speech corpus, 1993. [19] F. Weninger, J. Bergmann, and B. Schuller, Inroducing CURRENNT: he Munich open-source CUDA Recur- REn Neural Nework Toolki, Journal of Machine Learning Research, vol. 16, pp. 547 551, 2015.