Synthesizer control parameters. Output layer. Hidden layer. Input layer. Time index. Allophone duration. Cycles Trained

Similar documents
Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Speech Emotion Recognition Using Support Vector Machine

Word Segmentation of Off-line Handwritten Documents

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

A study of speaker adaptation for DNN-based speech synthesis

Learning Methods for Fuzzy Systems

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Speaker Identification by Comparison of Smart Methods. Abstract

Voice conversion through vector quantization

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Segregation of Unvoiced Speech from Nonspeech Interference

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Human Emotion Recognition From Speech

*** * * * COUNCIL * * CONSEIL OFEUROPE * * * DE L'EUROPE. Proceedings of the 9th Symposium on Legal Data Processing in Europe

Speech Recognition at ICSI: Broadcast News and beyond

On the Formation of Phoneme Categories in DNN Acoustic Models

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

I-COMPETERE: Using Applied Intelligence in search of competency gaps in software project managers.

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speaker Recognition. Speaker Diarization and Identification

WHEN THERE IS A mismatch between the acoustic

Python Machine Learning

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Learning to Schedule Straight-Line Code

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Expressive speech synthesis: a review

Automatic segmentation of continuous speech using minimum phase group delay functions

Modeling function word errors in DNN-HMM based LVCSR systems

Body-Conducted Speech Recognition and its Application to Speech Support System

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Proceedings of Meetings on Acoustics

INPE São José dos Campos

Modeling function word errors in DNN-HMM based LVCSR systems

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Infrastructure Issues Related to Theory of Computing Research. Faith Fich, University of Toronto

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Knowledge Transfer in Deep Convolutional Neural Nets

A Hybrid Text-To-Speech system for Afrikaans

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Speaker recognition using universal background model on YOHO database

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Dynamic Pictures and Interactive. Björn Wittenmark, Helena Haglund, and Mikael Johansson. Department of Automatic Control

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Ph.D in Advance Machine Learning (computer science) PhD submitted, degree to be awarded on convocation, sept B.Tech in Computer science and

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

GACE Computer Science Assessment Test at a Glance

Application of Virtual Instruments (VIs) for an enhanced learning environment

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Deep Neural Network Language Models

Software Maintenance

November 2012 MUET (800)

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

Evolution of Symbolisation in Chimpanzees and Neural Nets

Statistical Parametric Speech Synthesis

THE MULTIVOC TEXT-TO-SPEECH SYSTEM

Evolutive Neural Net Fuzzy Filtering: Basic Description

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Automatic Phonetic Transcription of Words. Based On Sparse Data. Maria Wolters (i) and Antal van den Bosch (ii)

SARDNET: A Self-Organizing Feature Map for Sequences

Rhythm-typology revisited.

A Case Study: News Classification Based on Term Frequency

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Axiom 2013 Team Description Paper

Ansys Tutorial Random Vibration

Speech Recognition by Indexing and Sequencing

Accuracy (%) # features

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

HOLMER GREEN SENIOR SCHOOL CURRICULUM INFORMATION

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Calibration of Confidence Measures in Speech Recognition

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Artificial Neural Networks written examination

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Transcription:

Allophone Synthesis Using A Neural Network G. C. Cawley and P. D.Noakes Department of Electronic Systems Engineering, University of Essex Wivenhoe Park, Colchester C04 3SQ, UK email ludo@uk.ac.essex.ese Abstract Most people reading this paper will be aware of the NETalk system of Sejnowski and Rosenberg [1], in which a multi-layer perceptron was trained to select the correct allophone for combinations of letters occurring in plain English text. Once suitable allophones have been selected, the problem remains of how should the sounds corresponding to a sequence of allophones be produced? The most straight forward approach is to store pre-recorded examples of each allophone and then simply concatenate them to form the required utterance. Unfortunately the boundaries between adjacent allophones in continuous speech are not distinct, an eect known as coarticulation, and such a simplistic approach leads to very unnatural sounding speech. This paper presents some initial ndings of experiments to evaluate dierent parametric forms of speech based on linear predictive coding (LPC) for training neural networks. These experiments were performed as part of a project to improve the subjective quality of speech synthesizers, through the use of neural networks for allophone synthesis. Introduction The realisation of an allophone is context-sensitive due to the inertia of articulators such as the lips, jaw and tongue. Articulators can only move at a nite speed in recovering from the position assumed during the previous allophone, causing a gradual transition from one allophone to the next. Coarticulation can also be caused by low level neural processes within the brain, where articulators are also able to position themselves in anticipation of the subsequent allophones. These movements are redundant in that they convey little of the semantic content of the utterance, however we sub-consciously expect to hear eects of these movements in natural speech. The most simple speech synthesis systems do not attempt to model coarticulation at all, but simply concatenate pre-recorded allophones. A more sophisticated approach concatenates diphones, each consisting of the adjacent halves of two allophones. Diphones capture the immediate eects of coarticulation and abut during the relatively steady state conditions during the central part of each allophone. However this is at the expense of increased storage, as about 1200 diphones are required for the allowable permutations of around 60 allophones. If a parametric description of speech is used, such as formant data, which records the frequency and amplitudes of the spectral peaks known as formants, templates may be used to interpolate the value of each parameter between target values set for each allophone. The rate at which each parameter changes is determined according to the rank of each allophone, which reects the degree to which it aects others. This allows more natural sounding speech to be produced, but at the expense of increased complexity, and requires manual analysis of human speech to determine targets and rankings for each allophone. Our research has been concerned with investigating the use of neural networks for allophone synthesis based on formant data [2, 3]. Unfortunately formant analysis of continuous speech is a complex and computationally expensive procedure, making it dicult to obtain the large 1

amounts of training data needed. This paper presents results of initial experiments to the evaluate dierent coding techniques based on linear predictive coding, which are less complex and less computationally expensive. Linear Predictive Coding (LPC) [4] is a technique used to nd the coecients ak of an all pole lter, with transfer function H(z), such that its spectral properties are similar to that of a segment of sampled speech. Given a suitable excitation signal, speech can be reconstructed from these coecients, which are updated every 10ms to allow for the time-varying nature of speech. For voiced speech the excitation signal can be approximated by a periodic train of impulses, and for unvoiced speech by random noise. H(z) = 1 1+a 1 z 1 +a 2 z 2 ++anz n In this paper, neural networks trained using three coding schemes based on linear predictive coding are compared, PARCOR [4, 5], log area ratio [4] and line spectral pair (LSP) [5]. Table 1 summarises some of the relative merits of each of method. Table 1: A comparison of the properties of PARCOR, log area ratio and LSP coding schemes Property PARCOR Log Area Ratio LSP Inter-parameter spectral Lower order coecients Lower order coecients Uniform sensitivity more sensitive more sensitive Individual parameter Non-Uniform Uniform Uniform sensitivity Overall spectral sensitivity Good Good Very Good Interpolation properties Poor Poor Good Network architecture An architecture similar to that employed in the NETalk system [1] was used, the input layer forming a sliding window over the input stream of allophones (see Figure 1). The input layer consists of three groups of neurons corresponding to the current and right and left context allophones. Each allophone is represented by a vector of phonetic features such as the broad phonetic class and place of articulation. In addition one input neuron is used to indicate the duration of the current allophone and an index neuron is used to indicate how much of the current allophone has already been generated. In order to synthesize speech parameters for a complete allophone, the input layer is set to the appropriate pattern for the central and context allophones and the required duration. A ramp input is then applied to the index neuron. As the index increases, the outputs of the network step out the parameters required to synthesize the allophone. All ten sentences from one speaker in the TIMIT database [6] were then analysed using tenth order LPC analysis to generate PARCOR, log area ratio and LSP training data. The network was trained using the backpropagation algorithm simulator written in C running on a Sun Sparcstation. Results The results obtained are displayed in Figures 2 and 3 which show graphs of RMS error and spectral distortion against cycles trained for each coding scheme. The results given here were obtained using a hidden layer of 50 neurons, however similar results are obtained using dierent hidden layer sizes. The log area ratio is a transformation of the PARCOR parameter set designed to atten the spectral sensitivity of individual parameters, and was expected to produce marginally better 2

results for this reason. This proved to be the case, the improvement was especially noticeable during voiced sounds where PARCOR coecients tend to approach 1 where the coecients spectral sensitivity is at its greatest. LSP coding was expected to out-perform the other coding schemes. Firstly the overall spectral sensitivity of LSP parameters is slightly lower, secondly LSP coecients exhibit better interpolation properties and lastly because each LSP coecient has roughly the same spectral sensitivity. The spectral sensitivity oflower order PARCOR and log area ratio coecients are higher, and so some method is required to concentrate training on reducing the error in the low order coecients. The networks trained using LSP data seemed to train faster than those trained on PARCOR and log area ratio data, and the sentences learned with less spectral distortion. Speech generated using the network generated using LSP data was also judged to be subjectively better. Conclusions We have shown that the use of LSP parameters in training neural networks for speech synthesis results in faster training and higher objective and subjective speech quality than is obtained using PARCOR or log area ratio parameters. Work is currently underway to produce a complete neural network allophone speech synthesizer using line spectral pair representation. References [1] T. J. Sejnowski and C. R. Rosenberg. Parallel networks that learn to pronounce English text. Complex Systems, 1:145{168, 1987. [2] G. C. Cawley and A. D. P. Green. The application of neural networks to cognitive phonetic modelling. In Proc. 2nd IEE Int. Conf. on Articial Neural Networks, pages 280{284, 1991. [3] G. C. Cawley and P. D. Noakes. Diphone synthesis using a neural network. In Proc. 1992 Int. Conf. on Articial Neural Networks (ICANN-92), volume 1, pages 795{798, 1992. [4] L. R. Rabiner and R. W. Schafer. Digital processing of speech signals, chapter 8. Prentice-Hall, 1978. [5] N. Sugamura and F. Itakura. Speech analysis and synthesis methods developed at ECL in NTT from LPC to LSP. In Speech communication, volume 5, pages 199{215, 1986. [6] National technical information service (NTIS), Computer systems laboratory, Gaithesburg, MD, USA, 20899. DARPA acoustic-phonetic continuous speech corpus (TIMIT). 3

Synthesizer control parameters Output layer Hidden layer Input layer Time index Allophone duration Figure 1: Schematic drawing of network architecture 0.22 0.2 LSP Log Area Ratio PARCOR 0.18 0.16 0.14 0.12 0.1 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Cycles Trained Figure 2: Graph of RMS error against cycles trained 4

3.6 3.4 LSP Log Area Ratio PARCOR 3.2 3 2.8 2.6 2.4 2.2 2 1.8 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Cycles trained Figure 3: Graph of spectral distortion against cycles trained 5