Non-Linear Pitch Modification in Voice Conversion using Artificial Neural Networks

Similar documents
A study of speaker adaptation for DNN-based speech synthesis

Human Emotion Recognition From Speech

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Speech Emotion Recognition Using Support Vector Machine

Voice conversion through vector quantization

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Speaker recognition using universal background model on YOHO database

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Modeling function word errors in DNN-HMM based LVCSR systems

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Speech Recognition at ICSI: Broadcast News and beyond

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Modeling function word errors in DNN-HMM based LVCSR systems

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

WHEN THERE IS A mismatch between the acoustic

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Edinburgh Research Explorer

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Letter-based speech synthesis

Word Segmentation of Off-line Handwritten Documents

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Learning Methods for Fuzzy Systems

Body-Conducted Speech Recognition and its Application to Speech Support System

Speaker Identification by Comparison of Smart Methods. Abstract

On the Formation of Phoneme Categories in DNN Acoustic Models

Affective Classification of Generic Audio Clips using Regression Models

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

INPE São José dos Campos

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Expressive speech synthesis: a review

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Python Machine Learning

Evolutive Neural Net Fuzzy Filtering: Basic Description

Probabilistic Latent Semantic Analysis

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Mandarin Lexical Tone Recognition: The Gating Paradigm

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Lecture 1: Machine Learning Basics

Learning Methods in Multilingual Speech Recognition

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Spoofing and countermeasures for automatic speaker verification

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Automatic Pronunciation Checker

Calibration of Confidence Measures in Speech Recognition

Speaker Recognition. Speaker Diarization and Identification

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Softprop: Softmax Neural Network Backpropagation Learning

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

/$ IEEE

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Generative models and adversarial training

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Segregation of Unvoiced Speech from Nonspeech Interference

Axiom 2013 Team Description Paper

Deep Neural Network Language Models

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Rhythm-typology revisited.

Proceedings of Meetings on Acoustics

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Corpus Linguistics (L615)

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Support Vector Machines for Speaker and Language Recognition

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

SARDNET: A Self-Organizing Feature Map for Sequences

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Software Maintenance

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Statistical Parametric Speech Synthesis

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Lecture 10: Reinforcement Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Speech Recognition by Indexing and Sequencing

Investigation on Mandarin Broadcast News Speech Recognition

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Transcription:

Non-Linear Pitch Modification in Voice Conversion using Artificial Neural Networks Bajibabu Bollepalli, Jonas Beskow, Joakim Gustafson Department of Speech, Music and Hearing, KTH, Sweden Abstract. Majority of the current voice conversion methods do not focus on the modelling local variations of pitch contour, but only on linear modification of the pitch values, based on means and standard deviations. However, a significant amount of speaker related information is also present in pitch contour. In this paper we propose a non-linear pitch modification method for mapping the pitch contours of the source speaker according to the target speaker pitch contours. This work is done within the framework of Artificial Neural Networks (ANNs) based voice conversion. The pitch contours are represented with Discrete Cosine Transform (DCT) coefficients at the segmental level. The results evaluated using subjective and objective measures confirm that the proposed method performed better in mimicking the target speaker s speaking style when compared to the linear modification method. 1 Introduction The aim of a voice conversion system is to transform the utterance of an arbitrary speaker, referred to as source speaker, to sound as if spoken by a specific speaker, referred to as target speaker. Listeners perceive the source speaker s speech as if uttered by the target speaker. Voice conversion can also be referred to as voice transformation or voice morphing. Since past two decades voice conversion has been an active research topic in the area of speech synthesis [1], [2], [3], [4]. Applications like text-to-speech (TTS), speech-to-speech translation, mimicry generation and human-machine interaction systems are greatly benefited by having a voice conversion module. In the literature, majority of voice conversion techniques focused mainly on the modification of short-term spectral features [5], [6]. However, prosodic features, such as pitch contour and speaking rhythm, also contain important cues of speaker identity. In [7] it was shown that pure prosody alone can be used, to an extent, to recognize speakers that are familiar to us. To build a good quality voice conversion system, it needs to modify the prosodic features along with the spectral features. The pitch contour is one of the most important prosodic features related to speaker identity. The most common method for pitch contour transform is: log(f t ) = log(f s ) µ s logf σ s logf σ t logf + µ t logf (1)

where f s, f t represent the pitch values at frame level, and µ s logf, σlogf s, µ t logf, and σlogf t represent the mean and standard deviation of the pitch values in log domain for the source and target speakers, respectively. In this paper, we refer to this method as linear transformation. The local shapes of the pitch contour segments are not modelled and transformed in the linear transformation method. To capture the local dynamics of the pitch contour, we proposed a non-linear transformation method using artificial neural networks (ANNs). The pitch contours over the voiced segments are represented by their discrete cosine transform (DCT) coefficients. There are some studies which have used the DCT for parametric representation of pitch contour and its modelling [8], [9], [1]. In [8], it is shown that the use of DCT for analysis and synthesis of pitch contours is beneficial. In [9], DCT is used to model the pitch contours of syllables for conversion of neutral speech into expressive speech using Gaussian mixture models (GMM). In [1], DCT representation is used for modelling and transformation of prosodic information in a voice conversion system using a code book generated by classification and regression trees (CART) methods. The work presented in this paper is different from [1] in the following aspects: 1. The proposed method does not use any linguistic information for pitch contour modification. 2. The proposed method uses ANNs to model the non-linear mapping between the pitch contours of source and target speakers. 3. The proposed method, represents the pitch contour of a voiced segment using two sets of parameters. One set represents the statistics, and another set represents the fine variations of a pitch contour. This paper is organised as follows: Section 2 describes the database, feature extraction and parametrization of the pitch contour. Section 3, outlines the ANN based voice conversion system. The experimental results obtained using both subjective and objective tests are presented in Section 4. Section 5 gives a summary of the work. 2 Database and feature extraction The experiments are carried out on the CMU ARCTIC database consisting of utterances recorded by seven speakers. Each speaker has recorded a set of 1132 phonetically balanced utterances, same for all speakers. ARCTIC database contains the utterances of SLT (US Female), CLB (US Female), BDL (US Male), RMS (US Male), JMK (Canadian Male), AWB (Scottish Male), and KSP (Indian Male). To extract the features from a given speech signal we used a high quality analysis tool called STRAIGHT vocoder [11]. The features were extracted for every 5ms of speech. Features are: 1) mel-cepstral coefficients (MCEPs), 2)band aperiodicity coefficients (BAPs) and 3) fundamental frequency (pitch contour). All these three features were used for voice conversion. Section 2.1 explains about the parametrization of pitch contour.

2.1 Parametrization of pitch contour The proposed pitch contour model is defined on a voiced segment basis. For voiced speech, the pitch contour varies slowly and continuously over time. It is therefore well modelled by using DCT, an orthogonal transform. One advantage of DCT representation is that the mean square error between two linearly timealigned pitch contours can be simply estimated from the mean square error between coefficients. The following steps explains the parametrization of a pitch contour: 1. Derive the pitch contours from the utterances spoken by the source speaker. 2. Segment the pitch contour with respect to the voiced segments present in the utterance. 3. Consider only if the duration of each voiced segment is ms. If the duration is less than ms then use the linear transformation to transform the pitch values. 4. Map the pitch contour of each voiced segment onto equivalent rectangular bandwidth (ERB) scale using Equation 2. F ERB = log 1 (.437 F + 1) (2) 5. Compute the DCT coefficients for each voiced segment using Equation 3. c n = M 1 i= F (i) cos( π M n(i + 1 )) (3) 2 where pitch contour F of length M is decomposed into N DCT coefficients [c, c 1, c 2, c 3,...c N 1 ]. The first coefficient represents the mean value and remaining DCT coefficients represents the variations in pitch contour such as those due to syllable stress. 6. Each segment is represented by two sets of parameters. They are F shape = [c 1, c 2, c 3,...c N 1 ] and F limits = [c, var F, max F, min F, log(dur)] (4) Where F shape and F limits represents the local variations and the constraints in a pitch contour. [c, c 1, c 2, c 3,...c N 1 ] are the DCT coefficients and var F, max F, min F, and log(dur) are variance, maximum value, minimum value, and logrithm of duration of a pitch contour, respectively. 3 Voice conversion using ANNs Figure 1, shows the block diagram of both training and transformation process in a voice conversion system. In this work, we used the parallel utterances to build a mapping function between source and target speakers. Even though both speakers speak the same utterances they still differ in the durations. To align the feature vectors of source speaker with respect to target speaker we use the

Fig. 1. A block diagram of voice conversion system dynamic time warping (DTW) method. It enables us to build a mapping function at frame-level. For mapping the acoustic features between the source and target speakers, various models have been explored in literature. These models are specific to the kind of features used for mapping. For instance, GMMs [3], vector quantization (VQ) [1] and ANNs [4] are widely used for mapping the vocal tract characteristics. The changes in the vocal tract shape for different speakers are highly non-linear, therefore to model these non-linearities, it is required to capture the non-linear relations present in the patterns. Hence, to capture the non-linear relations between acoustic features, we use a neural network based model (multilayer feed forward neural networks) for mapping the MCEPs, BAPs and pitch contour coefficients. During the process of training, acoustic features of the source and target speakers are given as input-output to the network. The network learns from these two data set and tries to capture a non-linear mapping function based on minimum mean square error. A generalized back propagation learning [12] is used to adjust the weights of the neural network so as to minimize the mean squared error between the desired and the actual output values. Selection of initial weights, architectures of ANNs, learning rate, momentum and number of iterations are some of the optimization parameters in training. Once the training is complete, we get a weight matrix that represents the mapping function between the acoustic features of the given source and target speakers. Such a weight matrix can be used to predict acoustic features of the target speaker from acoustic features of the source speaker. Different network structures can be possible by varying the number of hidden layers and the number of nodes in each of the hidden layer. In [13] it is shown that four layer network is optimal for mapping the vocal tract characteristics of

F (Hz) F (Hz) F (Hz) F (Hz) 1 1 2 2 1 1 2 2 1 1 2 2 1 1 (a).2.4.6.8 1 1.2 1.4 (b).2.4.6.8 1 1.2 1.4 (c).2.4.6.8 1 1.2 1.4 (d).2.4.6.8 Time (Sec) 1 1.2 1.4 Fig. 2. Conversion of pitch contour from source speaker to target speaker. (a) original source speaker pitch contour, (b) linear modification of source speaker pitch contour, (c) non-linear modification of source speaker pitch contour and (d) original target speaker pitch contour. the source speaker to the target speaker. Therefore, we consider the four layer networks with architectures 4L 8N 8N 4L, 21L 42N 42N 21L, 9L 18N 18N 9L and 5L 1N 1N 5L for mapping the features MCEPs, BAPs, F shape and F limits, respectively. The first and fourth layers are inputoutput layers with linear units (L) and have the same dimension as that of input-output acoustic features. The second layer (first-hidden layer) and third layer (second-hidden layer) have non-linear nodes (N), which help in capturing the non-linear relationship that may exist between the input-output features. 4 Experiments and Results As described in Section 2, from ARCTIC database we picked one male speaker (RMS) and one female speaker (SLT) for our experiments. For each speaker, we considered 8 parallel utterances for training and a separate set of 32 utterances for testing. We extracted acoustic features, MCEPs of dimension 4, BAPs of dimension 21, and 1 DCT coefficients for every 5ms of speech. Given these features for training, they are aligned using dynamic time warping to obtain paired feature vectors as explained in Section 3. We build a separate mapping function for spectral, band aperiodicity and pitch contour transformations. After the mapping functions are trained, we use the test sentences of the source speaker

to predict the acoustic features of the target speaker. The pitch contour is constructed back by using the IDCT on predicted features. An instance of converted pitch contour from source speaker (RMS) to target speaker (SLT) is illustrated in Figure 2. From Figure 2.(b), we can observe that linear modification of pitch contour is not able to model the local variations of the target speaker, whereas in Figure 2.(c) the non-linear method is able to model the local variations of the target speaker. Please note that here we have used the same durations of the source speaker. Table 1. RMSE (in Hz) between target and converted contours with linear and nonlinear transformation methods. Speaker pair Linear modification Non-linear modification RMS-to-SLT 18.28 14.36 SLT-to-RMS 15.92 12. In order to evaluate the performance of the proposed method, we estimate the root mean square error (RMSE) between target and converted pitch contours of test set. The RMSE is calculated after the durations of predicted contours normalized with respect to actual contours of target speaker. It can be seen from Table 1 that the non-linear transformation method performed better than linear method. Table 2. Speaker similarity score Speaker pair Linear modification Non-linear modification RMS-to-SLT 3 3.3 SLT-to-RMS 2.55 3.1 An informal perceptual test was also conducted with 1 transformed speech signals randomly chosen for both conversion pair and presented to 1 listeners. We have used the STRAIGHT vocoder to synthesize the transformed speech signals. The subjects were asked to compare similarity of the transformed speech signals with respect to original target speaker speech signals. The ratings were given on a scale of 1-5, with 5 for excellent match and 1 for not-at-all match. The scores are shown in Table 2. It can be observed from Table 2, that non-linear modification performs better than linear modification in perceptual tests as well. 5 Conclusion A non-linear pitch modification method was proposed for mapping the pitch contours of the source speaker according to the target speaker pitch contours.

In this method, pitch contour was compressed to a few coefficients using DCT. A four layer ANN model was used for modelling the non-linear patterns of a pitch contour between the source and target speaker. The results showed that both objective and subjective scores gave very clear preference for the proposed method in mimicking the target speaker s speaking style when compared to the linear modification method. References 1. M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, Voice conversion through vector quantization, in Proc. of ICASSP, New York, USA, pp. 655-658, Apr. 1988. 2. Y. Stylianou, O. Cappe, and E. Moulines, Continuous probabilistic transform for voice conversion, IEEE Transactions on Speech and Audio Processing, vol. 6, no. 2, pp. 131-142, Mar. 1998. 3. Y. Ohtani, T. Toda, H. Saruwatari, and K. Shikano, Maximum likelihood voice conversion based on GMM with STRAIGHT mixed excitation, in Proc. of INTER- SPEECH, Pittsburgh, USA, pp. 2266-2269, Sep. 26. 4. B. Bollepalli, A. W. Black, and K. Prahallad, Modeling a noisy-channel for voice conversion using articulatory features, in Proc. of INTERSPEECH, Portland, USA, Aug. 212. 5. T. Dutoit, A. Holzapfel, M. Jottrand, A. Moinet, J. Perez, and Y. Stylianou, Towards a voice conversion system based on frame selection, in Proc. of ICASSP, pp. 513-516, 27. 6. Y. Stylianou, Voice transformation: A survey, in Proc. of ICASSP, pp. 3585-3588, 29. 7. Elina Helander and Jani Nurminen, On the importance of pure prosody in the perception of speaker identity, in Proc. of INTERSPEECH, pp. 2665-2668, 27. 8. J. Teutenberg, C. Watson and P. Riddle, Modeling and synthesizing F contours with the discrete cosine transform, Proc. of ICASSP, pp. 3973-3976, 28. 9. Christophe Veaux and Xavier Rodet, Intonation conversion from neutral to expressive speech. INTERSPEECH, 2765-2768, 211. 1. Elina Helander and Jani Nurminen, A Novel method for prosody prediction in voice conversion, in Proc. of ICASSP, pp. IV-9-512, 27. 11. H. Kawahara, I. Masuda-Katsuse, and A. Cheveigne, Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneousfrequency-based F extraction: Possible role of a repetitive structure in sounds. Speech Communication, vol. 27, pp. 187-27, 1999. 12. S. Haykin, Neural networks: A comprehensive foundation, Prentice-Hall Inc., NJ, 1999. 13. S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad, Spectral mapping using artificial neural networks for voice conversion, IEEE Trans. Audio, Speech and Language Processing, vol. 18, no. 5, pp. 954-964, 21.