Reconstruction of Dysphonic Speech by MELP

Similar documents
Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speech Emotion Recognition Using Support Vector Machine

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speaker Recognition. Speaker Diarization and Identification

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Human Emotion Recognition From Speech

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

THE RECOGNITION OF SPEECH BY MACHINE

Mandarin Lexical Tone Recognition: The Gating Paradigm

On the Formation of Phoneme Categories in DNN Acoustic Models

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Consonants: articulation and transcription

Voice conversion through vector quantization

Segregation of Unvoiced Speech from Nonspeech Interference

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

WHEN THERE IS A mismatch between the acoustic

Body-Conducted Speech Recognition and its Application to Speech Support System

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

A study of speaker adaptation for DNN-based speech synthesis

SARDNET: A Self-Organizing Feature Map for Sequences

Phonetics. The Sound of Language

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Speaker recognition using universal background model on YOHO database

Speech Recognition at ICSI: Broadcast News and beyond

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Speaker Identification by Comparison of Smart Methods. Abstract

Learning Methods in Multilingual Speech Recognition

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

age, Speech and Hearii

Automatic segmentation of continuous speech using minimum phase group delay functions

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Modeling function word errors in DNN-HMM based LVCSR systems

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

A Case Study: News Classification Based on Term Frequency

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Australian Journal of Basic and Applied Sciences

Automatic intonation assessment for computer aided language learning

Online Publication Date: 01 May 1981 PLEASE SCROLL DOWN FOR ARTICLE

Journal of Phonetics

Cal s Dinner Card Deals

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Prevalence of Oral Reading Problems in Thai Students with Cleft Palate, Grades 3-5

DEVELOPMENT OF LINGUAL MOTOR CONTROL IN CHILDREN AND ADOLESCENTS

An Acoustic Phonetic Account of the Production of Word-Final /z/s in Central Minnesota English

Rhythm-typology revisited.

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Modeling function word errors in DNN-HMM based LVCSR systems

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Course Law Enforcement II. Unit I Careers in Law Enforcement

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Audible and visible speech

Learning Methods for Fuzzy Systems

Provisional. Using ambulatory voice monitoring to investigate common voice disorders: Research update

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Guidelines for blind and partially sighted candidates

5.1 Sound & Light Unit Overview

Word Segmentation of Off-line Handwritten Documents

Khairul Hisyam Kamarudin, PhD 22 Feb 2017 / UTM Kuala Lumpur

Statistical Parametric Speech Synthesis

Reducing Features to Improve Bug Prediction

Robot manipulations and development of spatial imagery

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

Expressive speech synthesis: a review

Author's personal copy

THE MULTIVOC TEXT-TO-SPEECH SYSTEM

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Practice Examination IREB

Lecture 1: Machine Learning Basics

CEFR Overall Illustrative English Proficiency Scales

GDP Falls as MBA Rises?

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

Grade 6: Correlated to AGS Basic Math Skills

A Privacy-Sensitive Approach to Modeling Multi-Person Conversations

A Case-Based Approach To Imitation Learning in Robotic Agents

Consonant-Vowel Unity in Element Theory*

Fix Your Vowels: Computer-assisted training by Dutch learners of Spanish

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

A Hybrid Text-To-Speech system for Afrikaans

Clinical Review Criteria Related to Speech Therapy 1

Learning Disability Functional Capacity Evaluation. Dear Doctor,

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Transcription:

Reconstruction of Dysphonic Speech by MELP H. Irem Türkmen, M. Elif Karsligil Yildiz Technical University, Computer Engineering Department, 34349 Yildiz, Istanbul, Turkey {irem,elif}@ce.yildiz.edu.tr Abstract. The chronical dysphony is the result of neural, structural or pathological effects on the vocal cords or larynx and it causes undesirable changes in the quality of speech. This paper presents a Mixed Excitation Linear Prediction (MELP) based system that reconstructs normally phonated speech from dysphonic speech, while preserving the individuality of the patient. The proposed system can be used as speech prosthesis for the patients who have lost the ability to produce voice. To reconstruct normally phonated speech from dysphonic speech, pitch generation using the perceived pitch relationship with formant frequencies, formant and voicing modification steps were performed for phonemes. The principle novelty of this study is to modify voiced phonemes acoustic features while preserving unvoiced ones. Therefore voicedunvoiced detection is performed for each phoneme. The proposed system is composed of three main parts. In the analysis phase the acoustic differences observed between normal and dysphonic speech are determined. Acoustic parameters of the dysphonic speech s voiced phonemes are modified in order to obtain a synthetic speech that is closer to normal speech. Finally, enhanced speech is synthesized by MELP. Keywords: Dysphonic speech enhancement, MELP, Formant modification, Pitch and voicing generation 1 Introduction Verbal communication is one of the most influential and effective way of social communication. While producing voice, airflow from the lungs to the vocal tract is interrupted by the vibration of vocal cords and quasi-periodic pulses of air are produced as the excitation. The chronical dysphony occurs in the presence of organic lesions, vocal cord paralysis, larynx cancer and results in the loss of ability to speak. Surgery for laryngeal cancer results in the removal of the larynx including vocal cords. During laryngectomy, surgeon perforates a hole in patient s neck called stoma that the patient can breathe through. After surgery, oesophageal, electrolarynx and the

tracheoesophageal (TE) speech are the ways to speak. However these techniques have disadvantages. The major drawback with esophageal speech is that the sounds are rough and often limited to relatively short segments of speech. The electrolarynx has a very mechanical tone that does not sound natural and good hand control is required to use the electrolarynx. TE voice prosthesis must be removed and cleaned periodically because infection risk exists [1]. The main purpose of this research is to developing a dysphonic speech enhancement system that can be used as speech prosthesis for the patients who have lost the ability to produce voice. Several researches which analyze and enhance the characteristics of the oesophageal and electrolarynx speech have been reported so far [2-6]. Morris and Clements [7] proposed a system that modifies formant structure and determines pitch and voicing to reconstruct speech from whisper by using MELP. In the proposed system, Turkish speech samples were recorded from native Turkish speakers who have had their larynx removed or have paralyzed vocal cords. MELP is used for synthesizing enhanced speech. Pitch relationship with formant frequencies is used in order to produce pitch for dysphonic voice. The system is composed of three major parts: Analysis of the dysphonic speech, modification of the acoustic parameters of the dysphonic speech in order to obtain synthetic speech which is closer to normal speech and finally, synthesizing the enhanced speech using the modified parameters. Modification was not applied to unvoiced phonemes, since there is no significant distortion observed in dysphonic speech for unvoiced phonemes. Figure 1 shows the block diagram of the proposed system. MELP Analysis Dysphonic Speech Enhanced Speech Parameters of Dysphonic Speech MELP Synthesis Unvoiced Phoneme Detection Voiced Phonemes Unvoiced Phonemes Modified Parameters of System Dysphonic Speech Enhancement System Fig. 1. Block Diagram of Proposed Speech Reconstruction System 2 Acoustic Differences between Dysphonic and Normally Phonated Speech Dysphonic speech differs from normally phonated speech in terms of voicing, pitch and formant structure. There is no perceived pitch period in dysphonic speech and the voice is definitely noisy. Two spectrograms for the Turkish word çalibma (IPA Code of character ç=ch, B=SH [8]) are given in Figure 2. The spectrogram in Figure

2a belongs to a patient with paralyzed vocal cords whereas Figure 2b shows the spectrogram of the normally phonation of the same word. (a) (b) Fig. 2. Spectrogram of (a) Dysphonic Speech (b) Normal Speech Several studies demonstrate that the formant locations and bandwidths of dysphonic speech differ from normally phonated speech [4]. LPC spectra of dysphonic (solid line) and normally phonated (dashed) phoneme samples are shown in Figure 3. (a) (b) (c) (d) Fig. 3. LPC spectra of dysphonic and normal voice for the phonemes (a) /AA/ as in dark (b) /r/ as in Rate (c) /k/ as in Coat (d) /s/ as in Sue As it can be seen in Figure 3a-3b, a formant structure distortion is observed in voiced phonemes, while there is no significant distortion observed in unvoiced ones (Figure 3c-3d). Moreover, it is observed that, voiced frequency bands of the unvoiced phonemes, which are pronounced by a dysphonic speaker, and normal words are not

different contrary to the voiced frequency bands of voiced phonemes. Also unvoiced phonemes have no perceived pitch when they are pronounced by a normal speaker. 3 Dysphonic Speech Enhancement System As suggested in part 2, no perceived pitch, and excitation exist in dysphonic speech. Also formant structure distortion is observed. In order to enhance the dysphonic speech, voicing decision, pitch estimation, gain and formant structure modification should be applied. On the other hand, applying the same procedure to unvoiced phonemes decreases intelligibility. As a novel approach, the proposed system modifies the acoustic parameters of phonemes except unvoiced phonemes to increase the synthetic speech quality. 3.1 Detection of Unvoiced Phonemes The need for classifying a given speech segment as voiced or unvoiced arises in many speech analysis systems. Pitch analysis, autocorrelation function and zero crossing rate are usually the methods used to make voiced-unvoiced decision [9]. However, since there is no perceived pitch observed in dysphonic speech, it is hard to make voiced-unvoiced decision using pitch analysis. In addition to this, autocorrelation coefficients and zero crossing rates are not distinctive features for voiced-unvoiced classification. In the proposed system, speaker dependent classification of voiced and unvoiced phonemes was made by using line spectrum frequencies. We manually constructed two classes of phonemes with respect to their articulation. First class contains unvoiced phonemes, and the second one contains voiced phonemes. Train set consists of the average line spectrum frequencies of voiced and unvoiced dysphonic phonemes. K-Nearest Neighborhood was applied by cross validation technique for the detection of unvoiced phonemes. The classification accuracy for phoneme groups for k =3 is given in Table 1. Analysis of the classification errors showed that about 48 percent of the errors occurred when classifying voiced consonants z, r, j and g whereas about 2 percent of errors were observed for y, v, m, n, l, d and SH. Moreover we observed that the system frequently misclassified unvoiced fricative phonemes HH and p. In the proposed system acoustic parameters of voiced phonemes were modified while acoustic parameters of unvoiced phonemes were preserved. Table 1. Classification accuracy of phoneme groups. Voiced Unvoiced Vowels Consonants Consonants Classified as Unvoiced Consonant 5,12% 17,23% 74,38% Classified as Voiced Consonant or Vowel 94,88% 82,77% 25,62%

3.2 Voicing Decision The proposed method fixes the lower four frequency bands (0 3 khz) as voiced, while fixing the upper band (3 4 khz) as unvoiced [7]. 3.3 Pitch Estimation Dysphonic speech has no perceived pitch. The synthetic speech should be natural. In order to accomplish this goal, a pitch estimation process was applied to voiced speech segments. By using the observed correlation between intensity and perceived pitch, n the pitch parameter was estimated by the following equation with pitch, estimated new n pitch, gain, gain of the frame number n, gain, average gain of dysphonic average speech segment, pitch, pitch [7]. While pitch is used to adjust the tone of the synthetic speech, is used to adjust the dynamic range of the pitch period. n n pitch new= gain gain average ) * ) + (( pitch (1) In the proposed system, pitch, is calculated automatically. Since it is too hard to obtain the normal voice of the dysphonic speaker and like dysphonic speech, whispered speech has no perceived pitch period, second formant frequency of whispered /AA/ phoneme is used to calculate the most appropriate pitch for the dysphonic speaker. Several studies point out a relationship between pitch and formant frequencies [10, 11]. To formulate the relationship, formant frequencies of /AA/ phoneme, which belong to different speakers, were studied. Spectra of normally phonated /AA/ phoneme that are voiced by four speakers who have various voice tones are shown in Figure 4. The pitch periods of the speakers are calculated as 20, 36, 52 and 89 by using normalized autocorrelation function. As seen in Figure 4, while pitch period increases, second formant frequency decreases. Fig. 4. Spectra of normally phonated /AA/ phoneme that are voiced by four speakers

Spectra of the whispered versions of the same phoneme are shown in Figure 5. Fig. 5. Spectra of whispered /AA/ phoneme that are voiced by four speakers As it is evident from Figure 5, second formant frequency of the whispered phoneme /AA/ voiced by speaker 1 with pitch period is highest. Pitch and formant frequency are inversely proportional. Reference pitch pitch can be calculated by using the following equations highest where f is the second formant frequency of the speaker who has the highest highest p pitch and is the pitch of that speaker and f frequency of the speaker who has the pitch and speaker. p is the second formant is the pitch of that highest highest a = ( p p ) /( f f ) (2) p referans = ( f f 2 ) * a + p (3) In the proposed system, pitch is calculated by using the pitch and second formant frequency values of the speakers in train set who have the highest and the highest highest pitch. Hence, f, p, f and p were set to 897, 89, 1788 and 20 respectively. 3.4 Formant Structure Modification In the proposed system, LSF based formant structure modification is applied to obtain narrow bandwidths and altered frequencies [12]. LSP trajectories are smoothed by median filter during the vowels without destroying the rapidly varying spectral content of the phonemes, [7].

4 Experimental Results In this study, 50 triphone-balanced sentences were recorded from 5 male and 2 female dysphonic Turkish native speakers. Preserving the acoustic features of unvoiced phonemes increases the intelligibility of the synthetic speech. Figure 6a shows the spectrogram of the synthetic speech for the dysphonic word çalibma (Figure 2a) produced by the modification of every phoneme, whereas Figure 6b shows the spectrogram for the same word produced by the modification of only voiced phonemes.. (a) Fig. 6. Spectrogram of synthetic speech for word çalibma (a) produced by modification of each phoneme (b) unvoiced phonemes acoustic features preserved As it is evident from Figure 6, preservation of the acoustic features of unvoiced phonemes results in synthetic speech that is closer to normally phonated one. In order to test the spectral differences between normal and synthetic speech, log spectral distances were used. Acquired average spectral enhancement is calculated as %25. Because the spectral difference is only one part of the conversion, subjective testing was also applied to evaluate how well we can synthesize normal speech from dysphonic speech. 5 listeners were asked to vote the synthetic speech in terms of the intelligibility and similarity to normal speech as 5 is best. (b) Table 2. Subjective listening test results intelligibility normal speech similarity Original Dysphonic Speech 2. 1 1. 1 Enhanced Speech 2. 7 2. 5

5 Conclusion This paper presents a MELP based system that enhances dysphonic speech. To reconstruct normal speech from dysphonic speech, pitch generation, formant and voicing modification steps were applied to only voiced phonemes, leaving the unvoiced phonemes unmodified. Subjective listener tests indicate the distinct similarity between synthetic speech and normally phonated speech. Adjusting the modification of the formants according to the phoneme structure and computing more natural pitch contours would increase the success rate. Our proposed system could be used to improve the life quality of dysphonic patients in every day situations like telecommunication applications. Acknowledgements We wish to express our appreciation to Istanbul University Cerrahpasa Medical Faculty - Ear Nose Throat and Head & Neck Surgery Department for their support in this work. References 1. Eastern Virginia Medical School, http://www.evmsent.org 2. Aguilar, G., Nakano-Miyatake, M.: Alaryngeal Speech Enhancement Using Pattern Recognition Techniques. In: IEICE - Transactions on Information and Systems, vol. E88-D, Issue 7, pp. 1618-1622 (2005) 3. Bi, N. and Qi, Y.: Speech conversion and its application to alaryngeal speech enhancement. In: Proc. ICSP96, pp.1586-1589 (1997) 4. Sawada, H, Takeuchi, N, Hisada, A.: A Real-time Clarification Filter of a Dysphonic Speech and Its Evaluation by Listening Experiments, International Conference on Disability. In: Virtual Reality and Associated Technologies (ICDVRAT2004), pp. 239-246 (2004) 5. Pozo, A., Young S.: Continuous Tracheoesophageal Speech Repair. In: EUSIPCO (2006) 6. Qi, Y., Weinberg, B. and Bi, N. : Enhancement of female esophageal and tracheoesophageal speech. In: Journal of the Acoustical Society of America.,vol. 98, pp. 2461-2465 (1995) 7. Morris, R.W., Clements, M.A.: Reconstruction of speech from whispers. In: Medical Engineering and Physics, vol. 24, Number 7, pp. 515-520(6), September (2002) 8. The International Phonetic Association, http://www.arts.gla.ac.uk/ipa/fullchart.html 9. Bishnu S. Atal, Lawrence R. Rabiner. : A Pattern Recognition Approach to Voiced- Unvoiced-Silence Classification with Applications to Speech Recognition. In: IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. assp-24, no. 3, june (1976) 10. Thomas I. B.: Perceived pitch of whispered vowels. In: J. Acoust. Soc. Am., vol. 46, no.2, pp. 468 (1969). 11. Higashikawa M., Nakai K., Sakakura A. and Takahashi H..: Perceived pitch of whispered vowels- relationship with formant frequencies: A preliminary study. In: Journal of Voice, pp. 155-158 (1996) 12. McLoughlin I. V. and Chance R. J.: LSP-based speech modification for intelligibility enhancement, In: Proceedings 13th International Conference on DSP, vol. 2, pp. 591-594 (1997)