PERCEPTUAL RESTORATION OF INTERMITTENT SPEECH USING HUMAN SPEECH-LIKE NOISE

Similar documents
Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Body-Conducted Speech Recognition and its Application to Speech Support System

Speech Emotion Recognition Using Support Vector Machine

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

WHEN THERE IS A mismatch between the acoustic

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Segregation of Unvoiced Speech from Nonspeech Interference

Speech Recognition at ICSI: Broadcast News and beyond

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Mandarin Lexical Tone Recognition: The Gating Paradigm

Author's personal copy

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Automatic segmentation of continuous speech using minimum phase group delay functions

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Rhythm-typology revisited.

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

Human Factors Engineering Design and Evaluation Checklist

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Learning Methods in Multilingual Speech Recognition

Human Emotion Recognition From Speech

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique

Understanding and Supporting Dyslexia Godstone Village School. January 2017

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

2005 National Survey of Student Engagement: Freshman and Senior Students at. St. Cloud State University. Preliminary Report.

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

Data Fusion Models in WSNs: Comparison and Analysis

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

SARDNET: A Self-Organizing Feature Map for Sequences

An Acoustic Phonetic Account of the Production of Word-Final /z/s in Central Minnesota English

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Meriam Library LibQUAL+ Executive Summary

Preparing for the oral. GCSEs in Arabic, Greek, Japanese & Russian

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

On-Line Data Analytics

Lecture 15: Test Procedure in Engineering Design

Number of students enrolled in the program in Fall, 2011: 20. Faculty member completing template: Molly Dugan (Date: 1/26/2012)

Ministry of Education, Republic of Palau Executive Summary

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

The Structure of the ORD Speech Corpus of Russian Everyday Communication

Voice conversion through vector quantization

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Guidelines for Project I Delivery and Assessment Department of Industrial and Mechanical Engineering Lebanese American University

Modeling function word errors in DNN-HMM based LVCSR systems

ONLINE COURSES. Flexibility to Meet Middle and High School Students at Their Point of Need

Alpha provides an overall measure of the internal reliability of the test. The Coefficient Alphas for the STEP are:

On the Formation of Phoneme Categories in DNN Acoustic Models

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Phonological and Phonetic Representations: The Case of Neutralization

NCEO Technical Report 27

Prototype Development of Integrated Class Assistance Application Using Smart Phone

Stages of Literacy Ros Lugg

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

On the Combined Behavior of Autonomous Resource Management Agents

EXECUTIVE SUMMARY. TIMSS 1999 International Science Report

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

TEACHING AND EXAMINATION REGULATIONS (TER) (see Article 7.13 of the Higher Education and Research Act) MASTER S PROGRAMME EMBEDDED SYSTEMS

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

CALL FOR APPLICATION "Researching Public Law in Rio"/ Pesquisar Direito Público no Rio

Word Segmentation of Off-line Handwritten Documents

Quantitative Evaluation of an Intuitive Teaching Method for Industrial Robot Using a Force / Moment Direction Sensor

Lecture 1: Machine Learning Basics

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Abstract. Janaka Jayalath Director / Information Systems, Tertiary and Vocational Education Commission, Sri Lanka.

Program in Linguistics. Academic Year Assessment Report

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Learning Methods for Fuzzy Systems

(English translation)

UK Institutional Research Brief: Results of the 2012 National Survey of Student Engagement: A Comparison with Carnegie Peer Institutions

Timeline. Recommendations

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

arxiv: v1 [cs.cl] 2 Apr 2017

What Is The National Survey Of Student Engagement (NSSE)?

BENCHMARK TREND COMPARISON REPORT:

The Effect of Extensive Reading on Developing the Grammatical. Accuracy of the EFL Freshmen at Al Al-Bayt University

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Universal contrastive analysis as a learning principle in CAPT

Fountas-Pinnell Level P Informational Text

Transcription:

rd International Congress on Sound & Vibration Athens, Greece 0- July 06 ICSV PERCEPTUAL RESTORATION OF INTERMITTENT SPEECH USING HUMAN SPEECH-LIKE NOISE Mitsunori Mizumachi, Shouma Imanaga Kyushu Institute of Technology, - Sensui-cho, Tobata-ku, Kitakyushu, Fukuoka 80-80, Japan. email: mizumach@ecs.kyutech.ac.jp Toshiharu Horiuchi KDDI R&D Laboratories, Inc., -- Ohara, Fujimino, Saitama, 6-80 Japan. Mobile phones have caused an explosive increase in packet distribution. It causes a serious problem, that is, packet loss. Packet loss concealment is indispensable for achieving smooth speech communication. As a packet loss concealment method on a client side, a waveform substitution is popular and standardized by ITU. The ITU-T G.7 conceals packet loss by inserting a phaseadjusted amplitude-attenuating previous packet into each break, but cannot deal with long-term breaks over 60 ms, that is, burst loss. The authors have previously proposed an alternative packet loss concealment relying on a human auditory capability. This method does not aim at restoring a waveform of the intermittent speech signal, but achieves perceptual restoration relying on phonetic restoration. When a gap of a intermittent speech signal is filled up with a loud arbitrary signal, we can smoothly listen to restored speech even if some segments of the original speech signal are completely lost. There is a trade-off between smoothness of the restored speech and noisiness of the gap-filling signal. Previously, the gap-filling signal was composed of a harmonic complex and ambient noises. In this paper, a human speech-like noise is substituted for the gap-filling signal. The speech-like noise can be prepared by repeatedly overlapping short-term human speech signals. It is confirmed that the proposed gap-filling signal succeeds in reducing its noisiness.. Introduction There is a serious problem in digital speech communication. A rapid increase in packet distribution causes packet loss, and recently long-term packet loss, that is, burst loss, seriously degrades quality of speech communication. Packet loss concealment is indispensable for achieving stress-free speech communication. A waveform substitution [] is one of the most popular packet loss concealment approaches, and ITU-T has standardized it for VoIP speech communication []. A model-based waveform regeneration [] is also a well-known approach, but it requires rather computational costs. Those packet loss concealment methods assume the short-term packet loss, and could not cope with burst loss. For example, the ITU-T G.7 method cannot conceal burst loss, of which duration is over 60 ms. The authors have proposed the alternative perceptual restoration of the packet loss based on an auditory illusion [,, 6]. It is interesting that an intermittent speech signal can be smoothly perceived, when the gaps are filled with noises. This auditory illusory phenomenon is called the phonemic restoration effect [7]. When the gap of the intermittent speech signal is filled up with a wideband signal, of which signal-to-noise ratio is less than -0 db, we can hear the intermittent speech smoothly even if some segments of the original speech signal are completely lost [8]. The authors have proposed

The rd International Congress of Sound and Vibration Table : Experimental conditions for optimizing human speech-like noise. Target speech Japanese sentences uttered by a male speaker Duration of speech break (packet loss) 0 ms Target speech to insertion noise ratio (SIR) db, 0 db, - db Individuality of human speech-like noise speaker-dependent and speaker-independent Number of overlapping speech signals,,, 0,, 0, 0 the less-harsh ambient noise with the speech-like harmonics complex as the gap-filling signal [, ]. Feasibility of the proposed method has been confirmed under quiet and noisy conditions [6]. In this paper, the gap-filling signal is further improved using a human speech-like noise, which can be prepared by overlapping speech signals. It is supposed that the speech-like noise could decrease the noisiness and incongruity of the insertion signal. Characteristics of human speech-like noises are investigated by listening test, because those vary depending on the number of overlapping speech signals and so on. The proposed gap-filling signal is subjectively evaluated compared with the previously-proposed method [6].. Perceptual restoration of intermittent speech Packet loss is perceptually concealed relying on the phonetic restoration effect [7]. It is important for achieving perceptual restoration to design a reasonable gap-filling signal, which should increase the smoothness of a resultant restored speech and decrease the noisiness of the gap-filling signal. The gap-filling signal is designed based on static and dynamic characteristics of speech. A broadband signal is suitable for the gap-filling signal in order to satisfy the masking potential rule [8]. The authors have confirmed that the phonetic restoration effect occurred, when low frequency components of the target speech were masked by ambient noise such as an air conditioner noise []. The air conditioner noise is mixed with a harmonic complex, which aims at masking higher-order harmonic components of speech. The gap-filling signal has been modified considering dynamic characteristics of speech. It has been confirmed that the temporal variation of the gap-filling signal contributes to decreases the noisiness of the insertion signal [].. Perceptual restoration using human speech-like noise. Human speech-like noise A human speech like noise is prepared by overlapping short-term speech signals. Its characteristics vary with the number of overlap. If speech signals are added less than ten times, we perceive the speech-like noise as overlap of speech signals. When hundreds of speech signals are overlapped, the resultant speech-like noise becomes a stationary noise, of which frequency characteristics refer to the long-term average of speech. The human speech-like noise can be prepared using speech signals uttered by a single speaker or multiple speakers. A variety of human speech like noises can be designed based on speakerdependence, gender-dependence, language-dependence, and so on. An International Speech Test Signal (ISTS), which is developed using speech signals uttered by a single multilingual speaker for testing hearing aids [9], is one of well-known speech-like noises. The ISTS is composed of female speech materials in American English, Arabic, Chinese, French, German, and Spanish. In this study, a language-dependent speech-like noise is suitable for packet loss concealment, so that the main concerns include speaker individuality and the number of overlap. ICSV, Athens (Greece), 0- July 06

The rd International Congress of Sound and Vibration Mean Opinion Score.. # overlap of human speech-like noise: 0 0 0 (: p<0.0; : p<0.0) SNdB SN0dB 0 SN-dB Target speech to speech-like insertion noise ratio Figure : Feasibility of speaker-dependent speech-like insertion for restoring intermittent speech. SNdB SN0dB SN-dB p<0.0 (: p<0.0) p<0.0 # overlap of human speech-like noise:. 0 0 0 Mean Opinion Score... SNdB SN0dB 0 SN-dB Target speech to speech-like insertion noise ratio Figure : Feasibility of speaker-independent speech-like insertion for restoring intermittent speech.. Subjective optimization of human speech-like noise Speech signals are divided into the segments with the duration of 00 ms, and then a part of the speech signal, of which duration is 0 ms, is randomly cut out from each segment. A speech-like noise is prepared by adding the designated number of different speech segments with the duration of 0 ms. Subjective evaluation is carried out concerning the restoration of intermittent speech with the speech-like noises as insertion signals. Feasibilities of the speech-like noises were subjectively examined by the five-grade mean opinion score (MOS). students with normal hearing participated in the listening test, and gave a MOS twice for each restored speech in a random order. Experimental conditions are summarized in Table. Figures and show results for speaker-dependent and speaker-independent gap-filling signals under no background noise conditions, respectively. There is no significant difference in speakerdependency. Strictly speaking, the most suitable number of overlap is different depending on the speech to insertion noise ratio. On the whole, it is suggested that the number of overlap should be more than 0 times and might be enough up to 0 times. ICSV, Athens (Greece), 0- July 06

The rd International Congress of Sound and Vibration Table : Experimental conditions for performance evaluation. Target speech Japanese sentences uttered by three male speakers Background noise Station yard noise [0] Duration of speech break (packet loss) 0 ms Target speech to insertion noise ratio (SIR) - db Target speech to background noise ratio (SNR) 9 db, 6 db, db, and no background noise Individuality of human speech-like noise speaker-dependent and speaker-independent Number of overlapping speech signals 0. Performance evaluation. Procedure Feasibility of the proposed speech-like noise is examined compared with the previously-proposed method, which employs the mixture of a harmonic complex and ambient noise as an insertion signal [6]. Restored intermittent speech signals were subjectively evaluated with the five-grade MOS on smoothness of restored speech, noisiness of the insertion signal, and comprehensive evaluation. Listening tests were carried out with participants, who were student volunteers with normal hearing, under the experimental conditions in Table.. Experimental results Experimental results are given in Figs.,, and. Concerning the smoothness of restored speech, the speaker-independent speech-like noises are superior in restoring intermittent speech to the speaker-dependent insertion noises. The speaker-dependent insertion noises could not gain advantages over the previously-proposed insertion signals [6]. On the other hand, the speaker-dependent speech-like noises significantly succeed in reducing its noisiness compared with both the speakerindependent speech-like noises and the previously-proposed insertion signals. It did not depend on the levels of the background noises. Figure indicates that the speaker-dependent speech-like noise is generally suitable as an insertion signal for restoring intermittent speech under noisy environments.. Perspective for practical application In a practical situation, a speaker-dependent speech-like noise can be prepared using receiving speech signals, just after speech communication is established. Then, once packet loss is occurred, the prepared speech-like noise is immediately substituted for the lost packets. The proposed method also has an advantage in reducing computational costs over conventional waveform substitution methods such as the ITU-T G.7 method. ICSV, Athens (Greece), 0- July 06

The rd International Congress of Sound and Vibration MOS (Smoothness) MOS (Noisiness) Speaker-dependent speech-like noise Without Previous restoration method.. No background noise.. No background noise SN9dB 9 SN6dB 6 SNdB Target speech to background noise ratio Figure : MOS on smoothness of restored speech. Speaker-independent speech-like noise (: p<0.0) Without Previous Speaker-dependent Speaker-independent restoration method speech-like noise speech-like noise. (: p<0.0; : p<0.0). p<0.0 SN9dB 9 SN6dB 6 SNdB Target speech SN9dB to background noise SN6dB ratio SNdB Figure : MOS on noisiness of insertion signal. Without Previous Speaker-dependent Speaker-independent restoration method speech-like noise speech-like noise (: p<0.0)..9.. No background noise SN9dB 9 SN6dB 6 SNdB Target speech to background noise ratio p<0.0 Figure : MOS on comprehensive evaluation. ICSV, Athens (Greece), 0-. July 06 MOS (Comprehensive Evaluation)

The rd International Congress of Sound and Vibration. Conclusions A perceptual restoration of intermittent speech, which supposes burst loss of packets, is improved using a human speech-like noise. Perceptual characteristics of human speech-like noises vary depending on the number of overlapping speech signals. It is confirmed that the overlap of 0 speech segments is enough for restoring intermittent speech. As the result of subjective evaluation, it is suggested that a speaker-dependent speech-like noise is the most suitable under practical noisy environments. Future works include performance evaluation of the proposed method in various conditions of speech communication. Acknowledgement This work was partly supported by JSPS KAKENHI Grant Number 6097. The authors thank Professor Christian Giguère for fruitful suggestions. REFERENCES. Goodman, D. J., Lockhart, G., Wasem, O. and Wong, W. C. Waveform substitution techniques for recovering missing speech segments in packet voice communications, IEEE Trans. Acoust., Speech and Signal Process., (6), 0 8, (986).. ITU Recommendation G.7 Appendix I, (999), A high quality low-complexity algorithm for packet loss concealment with G.7.. Chen, Y. L. and Chen, B. S. Model-based multi-rate representation of speech signals and its application to recovery of missing speech packets, IEEE Trans. on Speech and Audio Process., (), 0, (997).. Mizumachi, M., Ohga, K., Fujii, M. and Horiuchi, T. Restoration of intermittent speech with composite gap-filling schemes relying on human auditory capability, Proc. ICSV9, paper ID: 08, (0).. Mizumachi, M., Motomura, S., Takakura, T. and Horiuchi, T. Restoration of intermittent speech based on human auditory capability and temporal characteristics of speech, Proc. ICSV0, paper ID: 7, (0). 6. Mizumachi, M., Motomura, S., Takakura, T. and Horiuchi, T. Perceptual restoration of intermittent speech under noisy environments, Proc. ICSV, paper ID: 00, (0). 7. Warren, R. M. Perceptual restoration of missing speech sounds, Science, 67, 9 9, (970). 8. Kashino, M. Phonemic restoration: The brain creates missing speech sounds, Acoustical Science and Technology, 7(6), 8, (006). 9. Holube, I., Fredelake, S., Vlaming, M. and Kollmeier, B. Development and analysis of an international speech test signal (ISTS), Int. J. Audiol., 9(), 89 90, (00). 0. Kawai, K., Fujimoto, K., Iwase, T., Yasuoka, H., Sakuma, T. and Hidaka, Y. Development of a sound source database for environmental/architectural acoustics: Introduction of smile 00 (sound material in living environment 00), Proc. International Congress on Acoustics, pp. 6 6, (00). 6 ICSV, Athens (Greece), 0- July 06