A STUDY ON THE EFFECT OF THE NEIGHBOR PHONEMES IN NATURAL SYNTHESIS OF SPEECH

Similar documents
Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speech Emotion Recognition Using Support Vector Machine

On the Formation of Phoneme Categories in DNN Acoustic Models

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Segregation of Unvoiced Speech from Nonspeech Interference

A study of speaker adaptation for DNN-based speech synthesis

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Mandarin Lexical Tone Recognition: The Gating Paradigm

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Speech Recognition at ICSI: Broadcast News and beyond

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Human Emotion Recognition From Speech

Phonological Processing for Urdu Text to Speech System

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

WHEN THERE IS A mismatch between the acoustic

ATW 202. Business Research Methods

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

COURSE SYNOPSIS COURSE OBJECTIVES. UNIVERSITI SAINS MALAYSIA School of Management

THE RECOGNITION OF SPEECH BY MACHINE

A Hybrid Text-To-Speech system for Afrikaans

Word Segmentation of Off-line Handwritten Documents

Learners Use Word-Level Statistics in Phonetic Category Acquisition

Expressive speech synthesis: a review

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Rhythm-typology revisited.

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

SARDNET: A Self-Organizing Feature Map for Sequences

A Neural Network GUI Tested on Text-To-Phoneme Mapping

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

English Language and Applied Linguistics. Module Descriptions 2017/18

Proceedings of Meetings on Acoustics

Voice conversion through vector quantization

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

DEVELOPMENT OF LINGUAL MOTOR CONTROL IN CHILDREN AND ADOLESCENTS

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Cambridgeshire Community Services NHS Trust: delivering excellence in children and young people s health services

GDP Falls as MBA Rises?

School of Innovative Technologies and Engineering

Word Stress and Intonation: Introduction

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

REVIEW OF CONNECTED SPEECH

To appear in the Proceedings of the 35th Meetings of the Chicago Linguistics Society. Post-vocalic spirantization: Typology and phonetic motivations

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Automatic segmentation of continuous speech using minimum phase group delay functions

Learning Methods in Multilingual Speech Recognition

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Disambiguation of Thai Personal Name from Online News Articles

Phonological and Phonetic Representations: The Case of Neutralization

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

THE MULTIVOC TEXT-TO-SPEECH SYSTEM

Universal contrastive analysis as a learning principle in CAPT

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

STA 225: Introductory Statistics (CT)

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Control Tutorials for MATLAB and Simulink

Speaker Identification by Comparison of Smart Methods. Abstract

Detailed course syllabus

Modeling function word errors in DNN-HMM based LVCSR systems

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Journal of Phonetics

Linguistics 220 Phonology: distributions and the concept of the phoneme. John Alderete, Simon Fraser University

Speaker recognition using universal background model on YOHO database

Modeling function word errors in DNN-HMM based LVCSR systems

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

OPAC and User Perception in Law University Libraries in the Karnataka: A Study

PROFESSIONAL TREATMENT OF TEACHERS AND STUDENT ACADEMIC ACHIEVEMENT. James B. Chapman. Dissertation submitted to the Faculty of the Virginia

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

An Acoustic Phonetic Account of the Production of Word-Final /z/s in Central Minnesota English

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Lecture 9: Speech Recognition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Analyzing the Usage of IT in SMEs

Edinburgh Research Explorer

Body-Conducted Speech Recognition and its Application to Speech Support System

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Learning Methods for Fuzzy Systems

12- A whirlwind tour of statistics

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

The Use of Statistical, Computational and Modelling Tools in Higher Learning Institutions: A Case Study of the University of Dodoma

Building Text Corpus for Unit Selection Synthesis

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Mining Association Rules in Student s Assessment Data

Young Enterprise Tenner Challenge

Transcription:

Ceylon Journal of Science (Physical Sciences) 18 (2014) 45-49 Computer Science A STUDY ON THE EFFECT OF THE NEIGHBOR PHONEMES IN NATURAL SYNTHESIS OF SPEECH H.M.L.N.K Herath 1 and J.V. Wijayakulasooriya 2 1Postgraduate Institute of Science, University of Peradeniya, Sri Lanka. 2Department of Electronic and Electrical Engineering, Faculty of Engineering, University of Peradeniya, Sri Lanka (*Corresponding author s email: 1 lakminiherath0@gmail.com 2 jan@ee.pdn.ac.lk). (Received: 13 January 2014 / Accepted after revision: 16 June 2014) ABSTRACT Natural synthesis of speech needs to identify the minute variations in phoneme during reproduction, which is affected by many factors. This paper presents an empirical study on the correlations between consequent phonemes in a speech signal. Short /a/ phoneme was selected for the study. In order to examine the effect of neighboring phonemes more clearly, words which consist of three or four phonemes were chosen. Then, the correlations between all possible pairs were calculated by comparing one cycle of each /a/ sound, which are starting from the same phonemes. Furthermore, one cycle taken from three different places, start, middle and end of the /a/ phoneme were selected and correlations between different pairs were calculated. The correlation values have clearly shown that the middle phoneme follows the preceding phoneme s energy to build the articulation between two phonemes, smoothly as well as within the /a/ phoneme itself. University of Peradeniya 2014 INTRODUCTION Speech synthesis is the artificial production of human speech. One of the main focus areas in speech synthesis research is to reduce the amount of data needed to synthesize the speech while maintaining an acceptable quality. During recent past, more emphasis is given to improve the naturalness of the synthesized speech. In this regard, many methods from low bit rate methods and high bit rate methods have been proposed (Bristow-Johnson, 1996). However, the holy grail of natural synthesis of speech is still remaining a challenging task, particularly for low bit rate applications. There are two main computer based speech synthesizing techniques: concatenative synthesis (Wavetable synthesis in music) of speech, which stored raw waveforms corresponding to each phoneme in a database called wavetable and concatenate them according to the phonemes to be synthesized (Holmes and Holmes, 2001; Smith, 2006). Although this method produces more natural speech than the mathematical coding based models, the high capacity needed for storing the speech and high bit rates involved in transmission of the speech are main concerns. In contrast, the mathematical coding based technique such as Linear Predictive Coding (LPC), which is based on Auto Regressive (AR) modeling of speech, significantly reduces the bit rate. However, the speech is modeled as a response of a Linear Time Invariant (LTI) system to an input excitation signal. The problem with Linear Time Invariant (LTI) system is the occurrence of audible discontinuities at phoneme boundaries, which leads to unnaturalness of synthetic speech. Time varying nature of phonemes Speech does not simply consist of a string of target articulations linked by simple movement between them(ohala 1993). In fact, articulation of individual sound segments or phonemes is almost always influenced by the articulation of neighboring

segments, often to the point of considerable overlapping of articulator activities (Ohala 1993). A phoneme is the smallest contrastive unit in the sound system of a language. Phonemes are combined with other phonemes to form meaningful units such as words or morphemes. Without appropriate transition between phonemes, the resulting speech sounds are unnatural and is hard to understand. In 1933, Menzerath and Lacerda [Hardcastle W. J et al. 1999] populated the term co-articulation. It was coined to denote instance where two successive sounds were articulated together. Many decades of experimental phonetic research have produced a large literature on the topic. The elementary fact highlighted here is that coarticulation is manifested in a temporal overlap between any two channels recruited by different phonemes. In the most basic model of articulatory by Locus (Delattre, 1969), each phoneme has a single ideal articulatory target for each contrastive articulator independent of the neighboring phonemes(phung et al. 2011). Under effects of neighboring phonemes, the transition between two phonemes is described as the movement between the two ideal targets of the phonemes. The Kozhevnikov-Chistovich model shows co- articulation within syllable but not across syllables(phung et al.,2011). Although there are many co-articulation models have been proposed there is still a lack of simple models, which are easy to be implemented in speech applications, and directly performed with acoustic data (Phung et al.,2011). Most of the mathematical speech synthesis models assume that the changes between the phonemes are time invariant. In other words, the parameter of the phoneme does not change with time. Linear systems in reality produce their outputs as a linear combination of its current and previous inputs and its pervious outputs(tatham et al., 2005). But the nature of the transition between phonemes is time variant. Figures 1 and 2 show that how the formant values change from one phoneme to another phoneme in time variant and time invariant systems. If the changes between phonemes are time invariant then the formant contours should be constant throughout the duration of a phoneme as shown in figure 1. However, in natural speech, the phonemes vary from one phoneme to another as well as within the phoneme as shown in Figure 2. The objective of this study is to find the effect of the neighboring phonemes in linear time variant nature by calculating the Pearson s correlation between phonemes. Figure 1: Formant values in time invariant system Figure 2: Formant values in time variant system METHOD Out of nearly forty four phonemes in English language, short /a/ phoneme was studied in this research. Recording phoneme sounds separately was infeasible, so that words which include short /a/ sound were selected for the recording. To examine the effect of neighboring phonemes more clearly, words which consist of three or four phonemes were chosen. From recorded words, /a/ phoneme was extracted separately. The segmentation process for the short /a/ was conducted manually by looking at the time wave and listening to the segmented phoneme. Then the Pearson s correlation coefficient (Wikipedia, 2014)between all possible pairs of different words were calculated by comparing one 46

cycle of each /a/ sound. In this case, pairs of words starting with the same phoneme as well as pairs of words starting with different phonemes were considered. In addition to that, one cycle taken from three different places, start, middle and end of the /a/ phoneme were selected and correlation between different pairs was calculated. A hypothesis test was conducted to find the significance of the correlation values. Sound processing and the statistical calculations were done by using the MATLAB software. RESULTS AND DISCUSSION Following correlation values were obtained by comparing, short /a/ sounds which are starting from the same phoneme (table 1). Same procedure was conducted by changing the starting phoneme and similar results have been obtained. As shown in table 1, each and every word which are starting from same phoneme has a significant correlation value greater than 0.75 and all pairs obtained p-values closer to 0. In the Pearson s correlation statistical hypothesis tests, all pairs of /a/ phoneme obtained p- values closer to 0. This shows all calculated pair wise correlations are statistically significant. Same experiment has been conducted by changing the first phoneme of the word but without changing the last phoneme. According to figure 3, /a/ sounds extracted from the words which are starting from different phonemes, but the same ending phoneme t, the correlation values are less than 0.75. The p-values obtained for these pair wise correlations are also closer to 0. This interprets that there are moderate positive correlations between the words which are starting from different phonemes. Several experiments have been conducted by changing the last phoneme and similar results were obtained. It points out, those /a/ sound wave forms of words which are starting from same phoneme, have more correlation than the /a/ sound wave forms of words which are starting with different phoneme. So there was a significant relationship between the first phoneme and the following phoneme (vowel) of a word with compared to the relationship between the middle phoneme (vowel) and the next phoneme. The short /a/ phoneme wave form depends on the previous phoneme. That is previous letter have a clear impact on the following phoneme sound. Figure 3: Correlation values of comparing short /a/ sounds which are starting with different phonemes and ending with phoneme t. According to the figure 4, correlation values between /a/ sound of the word Bad with short /a/ sounds of other words which are stating from letter B were more than 0.7. That means the similarities between waveforms (one cycle) are greater than 50%. Most of them have correlation values more than 0.85.That means the similarities of some of wave forms were exceeding 75%. But when considering the relationship between the words which are starting with different phonemes, correlation values are less than 0.8. Some of those values are less than 0.5. This means that the relationship between /a/ sounds depends on the preceding phoneme. Figure 4:Correlation values of comparing Bad /a/ sound with short /a/ sounds, which are starting with letter B and different letters Figure 5 shows the average correlation values of different words by considering three cycles of /a/ phoneme taken from different places. One cycle near to the first let- 47

ter, middle cycle and a cycle form the end of the /a/ phoneme. Table 1: Pearson s correlation values of comparing short /a/ sound words, which start from phoneme B bad 1 bad bag ban bat back band bank batch badge bask bang bash bag 0.9412 1 ban 0.8963 0.8438 1 bat 0.8805 0.8522 0.8349 1 back 0.8566 0.8097 0.8561 0.7358 1 band 0.8843 0.8239 0.9268 0.8846 0.894 1 bank 0.8809 0.8149 0.8997 0.8169 0.9353 0.9297 1 batch 0.8602 0.8315 0.8425 0.6833 0.9276 0.7979 0.8913 1 badge 0.9066 0.8345 0.9494 0.8875 0.897 0.9613 0.9258 0.8385 1 bask 0.9677 0.9357 0.8936 0.9051 0.8488 0.8865 0.8794 0.8669 0.9222 1 bang 0.8458 0.8019 0.8764 0.7036 0.937 0.8833 0.9449 0.9099 0.8595 0.8361 1 bash 0.8912 0.8807 0.8723 0.9184 0.7806 0.8852 0.8559 0.7711 0.9169 0.944 0.7784 1 neighboring phonemes as well as within the phoneme. CONCLUSION Figure 5: Average Correlation value of /a/ phoneme of different words extracting the cycles from three different places When compared with the cycles taken from different places, figure 5 shows the starting cycle average correlation value was always less than the middle cycle average correlation value, which implies that front cycles of the /a/ sounds have a clear impact from the previous phoneme. It is because the staring cycle lies within the transition region between the two neighboring phonemes. But when it comes to the middle cycle /a/ sound wave form was stabilized, so the average correlation value was much greater than previous values. Then the transit to the next phoneme, the correlation values vary from word to word, but all the values were less than middle correlation values. Figure 5 indicates that there is a time variant linear relationship between the The underlined approach is to investigate the effect of correlation between consequent phonemes in natural synthesis of speech. This study illustrates when the starting phoneme changes, the proceeding phoneme correlation values also change significantly. Therefore, there is a smooth linear time variant transition between consequent phonemes. In addition to that, the study also points out that the middle phoneme has a different correlation values within the phoneme when compared to the start, middle and end wave forms. It shows there is a smooth variation within the /a/ phoneme itself. Thus, the correlation values have clearly shown that the middle phoneme follows the preceding phoneme energy to build the articulation between two phonemes smoothly. The study concludes that the time variant nature of neighboring phonemes as well as within the phoneme should be strongly considered when modeling more natural speech in mathematical coding based low bit rate models. REFERENCES 48

Alan O Cinn éide(2008) Linear Prediction The Technique, Its Solution and Application to Speech. Published in DIT Internal Technical Report Bristow-Johnson, R.(1996) Wavetable Synthesis 101, A Fundamental Perspective, In 101st AES Convention (Los Angeles, California), Audio Engineering Society (AES), Preprint No Delattre, P. (1969)Coarticulation and The Locus Theory, StudiaLinguistica 23(1) 1 26, Holmes, J., and Holmes, W.(2001)Speech Synthesis and Recognition, Second Edition,Taylor & Francis, London, UK. 287 Hardcastle W. J. and, Hewlett N. (1999)Coarticulation: Theory, Data and Techniques, Cambridge university press. Ohala J.J.(1993)Coarticulation and phonology- university of Alberta and university of California Berkeley, language and speech 36; 155-170. Phung, T., Luong, M. C. and Akagi, M.(2012) On the Stability of Spectral Targets under Effects of Coarticulation,International Journal of Computer and Electrical Engineering, Vol. 4, No. 4, (537-541) Phung, T., Luong, M. C., and Akagi, M.(2011), An Investigation on Perceptual Line Spectral Frequency (PLP-LSF) Target Stability against the Vowel Neutralization Phenomenon, 3rd International Conference on Signal Acquisition and Processing (ICSAP 2011): 512-514 Rabiner, L. and Juang, B. H. (1993)Fundamentals of speech Recognition, Prentice Hall International,497 Smith, J.(2006) History and Practice of Digital Sound Synthesis, CCRMA, Stanford University, Lectures notes in AES 2006 Shannon M, Zen H, Byrne W,(2013)Autoregressive Models for Statistical Parametric Speech Synthesis, IEEE transactions on audio, speech, and language processing, vol. 21 (3); (587-597) Tatham, M., Morton K. (2005), Development in speech synthesis. John Wiley & Sons Ltd, England, Chapter 4, pg 43-44. Taylor P.(2009)Text-to-Speech Synthesis, Cambridge University Press. (total pages) http://rudirumer.wordpress.com/ Phones, Phonemes, Allophones and Phonological Rules, accessed 2014. http://en.wikipedia.org/wiki/correlation_a nd_dependence, accessed in 2014 46