PERCEPTUAL RESTORATION OF INTERMITTENT SPEECH USING HUMAN SPEECH-LIKE NOISE

Size: px

Start display at page:

Download "PERCEPTUAL RESTORATION OF INTERMITTENT SPEECH USING HUMAN SPEECH-LIKE NOISE"

Gabriel Higgins
6 years ago
Views:

1 rd International Congress on Sound & Vibration Athens, Greece 0- July 06 ICSV PERCEPTUAL RESTORATION OF INTERMITTENT SPEECH USING HUMAN SPEECH-LIKE NOISE Mitsunori Mizumachi, Shouma Imanaga Kyushu Institute of Technology, - Sensui-cho, Tobata-ku, Kitakyushu, Fukuoka 80-80, Japan. mizumach@ecs.kyutech.ac.jp Toshiharu Horiuchi KDDI R&D Laboratories, Inc., -- Ohara, Fujimino, Saitama, 6-80 Japan. Mobile phones have caused an explosive increase in packet distribution. It causes a serious problem, that is, packet loss. Packet loss concealment is indispensable for achieving smooth speech communication. As a packet loss concealment method on a client side, a waveform substitution is popular and standardized by ITU. The ITU-T G.7 conceals packet loss by inserting a phaseadjusted amplitude-attenuating previous packet into each break, but cannot deal with long-term breaks over 60 ms, that is, burst loss. The authors have previously proposed an alternative packet loss concealment relying on a human auditory capability. This method does not aim at restoring a waveform of the intermittent speech signal, but achieves perceptual restoration relying on phonetic restoration. When a gap of a intermittent speech signal is filled up with a loud arbitrary signal, we can smoothly listen to restored speech even if some segments of the original speech signal are completely lost. There is a trade-off between smoothness of the restored speech and noisiness of the gap-filling signal. Previously, the gap-filling signal was composed of a harmonic complex and ambient noises. In this paper, a human speech-like noise is substituted for the gap-filling signal. The speech-like noise can be prepared by repeatedly overlapping short-term human speech signals. It is confirmed that the proposed gap-filling signal succeeds in reducing its noisiness.. Introduction There is a serious problem in digital speech communication. A rapid increase in packet distribution causes packet loss, and recently long-term packet loss, that is, burst loss, seriously degrades quality of speech communication. Packet loss concealment is indispensable for achieving stress-free speech communication. A waveform substitution [] is one of the most popular packet loss concealment approaches, and ITU-T has standardized it for VoIP speech communication []. A model-based waveform regeneration [] is also a well-known approach, but it requires rather computational costs. Those packet loss concealment methods assume the short-term packet loss, and could not cope with burst loss. For example, the ITU-T G.7 method cannot conceal burst loss, of which duration is over 60 ms. The authors have proposed the alternative perceptual restoration of the packet loss based on an auditory illusion [,, 6]. It is interesting that an intermittent speech signal can be smoothly perceived, when the gaps are filled with noises. This auditory illusory phenomenon is called the phonemic restoration effect [7]. When the gap of the intermittent speech signal is filled up with a wideband signal, of which signal-to-noise ratio is less than -0 db, we can hear the intermittent speech smoothly even if some segments of the original speech signal are completely lost [8]. The authors have proposed

2 The rd International Congress of Sound and Vibration Table : Experimental conditions for optimizing human speech-like noise. Target speech Japanese sentences uttered by a male speaker Duration of speech break (packet loss) 0 ms Target speech to insertion noise ratio (SIR) db, 0 db, - db Individuality of human speech-like noise speaker-dependent and speaker-independent Number of overlapping speech signals,,, 0,, 0, 0 the less-harsh ambient noise with the speech-like harmonics complex as the gap-filling signal [, ]. Feasibility of the proposed method has been confirmed under quiet and noisy conditions [6]. In this paper, the gap-filling signal is further improved using a human speech-like noise, which can be prepared by overlapping speech signals. It is supposed that the speech-like noise could decrease the noisiness and incongruity of the insertion signal. Characteristics of human speech-like noises are investigated by listening test, because those vary depending on the number of overlapping speech signals and so on. The proposed gap-filling signal is subjectively evaluated compared with the previously-proposed method [6].. Perceptual restoration of intermittent speech Packet loss is perceptually concealed relying on the phonetic restoration effect [7]. It is important for achieving perceptual restoration to design a reasonable gap-filling signal, which should increase the smoothness of a resultant restored speech and decrease the noisiness of the gap-filling signal. The gap-filling signal is designed based on static and dynamic characteristics of speech. A broadband signal is suitable for the gap-filling signal in order to satisfy the masking potential rule [8]. The authors have confirmed that the phonetic restoration effect occurred, when low frequency components of the target speech were masked by ambient noise such as an air conditioner noise []. The air conditioner noise is mixed with a harmonic complex, which aims at masking higher-order harmonic components of speech. The gap-filling signal has been modified considering dynamic characteristics of speech. It has been confirmed that the temporal variation of the gap-filling signal contributes to decreases the noisiness of the insertion signal [].. Perceptual restoration using human speech-like noise. Human speech-like noise A human speech like noise is prepared by overlapping short-term speech signals. Its characteristics vary with the number of overlap. If speech signals are added less than ten times, we perceive the speech-like noise as overlap of speech signals. When hundreds of speech signals are overlapped, the resultant speech-like noise becomes a stationary noise, of which frequency characteristics refer to the long-term average of speech. The human speech-like noise can be prepared using speech signals uttered by a single speaker or multiple speakers. A variety of human speech like noises can be designed based on speakerdependence, gender-dependence, language-dependence, and so on. An International Speech Test Signal (ISTS), which is developed using speech signals uttered by a single multilingual speaker for testing hearing aids [9], is one of well-known speech-like noises. The ISTS is composed of female speech materials in American English, Arabic, Chinese, French, German, and Spanish. In this study, a language-dependent speech-like noise is suitable for packet loss concealment, so that the main concerns include speaker individuality and the number of overlap. ICSV, Athens (Greece), 0- July 06

3 The rd International Congress of Sound and Vibration Mean Opinion Score.. # overlap of human speech-like noise: (: p<0.0; : p<0.0) SNdB SN0dB 0 SN-dB Target speech to speech-like insertion noise ratio Figure : Feasibility of speaker-dependent speech-like insertion for restoring intermittent speech. SNdB SN0dB SN-dB p<0.0 (: p<0.0) p<0.0 # overlap of human speech-like noise: Mean Opinion Score... SNdB SN0dB 0 SN-dB Target speech to speech-like insertion noise ratio Figure : Feasibility of speaker-independent speech-like insertion for restoring intermittent speech.. Subjective optimization of human speech-like noise Speech signals are divided into the segments with the duration of 00 ms, and then a part of the speech signal, of which duration is 0 ms, is randomly cut out from each segment. A speech-like noise is prepared by adding the designated number of different speech segments with the duration of 0 ms. Subjective evaluation is carried out concerning the restoration of intermittent speech with the speech-like noises as insertion signals. Feasibilities of the speech-like noises were subjectively examined by the five-grade mean opinion score (MOS). students with normal hearing participated in the listening test, and gave a MOS twice for each restored speech in a random order. Experimental conditions are summarized in Table. Figures and show results for speaker-dependent and speaker-independent gap-filling signals under no background noise conditions, respectively. There is no significant difference in speakerdependency. Strictly speaking, the most suitable number of overlap is different depending on the speech to insertion noise ratio. On the whole, it is suggested that the number of overlap should be more than 0 times and might be enough up to 0 times. ICSV, Athens (Greece), 0- July 06

4 The rd International Congress of Sound and Vibration Table : Experimental conditions for performance evaluation. Target speech Japanese sentences uttered by three male speakers Background noise Station yard noise [0] Duration of speech break (packet loss) 0 ms Target speech to insertion noise ratio (SIR) - db Target speech to background noise ratio (SNR) 9 db, 6 db, db, and no background noise Individuality of human speech-like noise speaker-dependent and speaker-independent Number of overlapping speech signals 0. Performance evaluation. Procedure Feasibility of the proposed speech-like noise is examined compared with the previously-proposed method, which employs the mixture of a harmonic complex and ambient noise as an insertion signal [6]. Restored intermittent speech signals were subjectively evaluated with the five-grade MOS on smoothness of restored speech, noisiness of the insertion signal, and comprehensive evaluation. Listening tests were carried out with participants, who were student volunteers with normal hearing, under the experimental conditions in Table.. Experimental results Experimental results are given in Figs.,, and. Concerning the smoothness of restored speech, the speaker-independent speech-like noises are superior in restoring intermittent speech to the speaker-dependent insertion noises. The speaker-dependent insertion noises could not gain advantages over the previously-proposed insertion signals [6]. On the other hand, the speaker-dependent speech-like noises significantly succeed in reducing its noisiness compared with both the speakerindependent speech-like noises and the previously-proposed insertion signals. It did not depend on the levels of the background noises. Figure indicates that the speaker-dependent speech-like noise is generally suitable as an insertion signal for restoring intermittent speech under noisy environments.. Perspective for practical application In a practical situation, a speaker-dependent speech-like noise can be prepared using receiving speech signals, just after speech communication is established. Then, once packet loss is occurred, the prepared speech-like noise is immediately substituted for the lost packets. The proposed method also has an advantage in reducing computational costs over conventional waveform substitution methods such as the ITU-T G.7 method. ICSV, Athens (Greece), 0- July 06

5 The rd International Congress of Sound and Vibration MOS (Smoothness) MOS (Noisiness) Speaker-dependent speech-like noise Without Previous restoration method.. No background noise.. No background noise SN9dB 9 SN6dB 6 SNdB Target speech to background noise ratio Figure : MOS on smoothness of restored speech. Speaker-independent speech-like noise (: p<0.0) Without Previous Speaker-dependent Speaker-independent restoration method speech-like noise speech-like noise. (: p<0.0; : p<0.0). p<0.0 SN9dB 9 SN6dB 6 SNdB Target speech SN9dB to background noise SN6dB ratio SNdB Figure : MOS on noisiness of insertion signal. Without Previous Speaker-dependent Speaker-independent restoration method speech-like noise speech-like noise (: p<0.0)..9.. No background noise SN9dB 9 SN6dB 6 SNdB Target speech to background noise ratio p<0.0 Figure : MOS on comprehensive evaluation. ICSV, Athens (Greece), 0-. July 06 MOS (Comprehensive Evaluation)

6 The rd International Congress of Sound and Vibration. Conclusions A perceptual restoration of intermittent speech, which supposes burst loss of packets, is improved using a human speech-like noise. Perceptual characteristics of human speech-like noises vary depending on the number of overlapping speech signals. It is confirmed that the overlap of 0 speech segments is enough for restoring intermittent speech. As the result of subjective evaluation, it is suggested that a speaker-dependent speech-like noise is the most suitable under practical noisy environments. Future works include performance evaluation of the proposed method in various conditions of speech communication. Acknowledgement This work was partly supported by JSPS KAKENHI Grant Number The authors thank Professor Christian Giguère for fruitful suggestions. REFERENCES. Goodman, D. J., Lockhart, G., Wasem, O. and Wong, W. C. Waveform substitution techniques for recovering missing speech segments in packet voice communications, IEEE Trans. Acoust., Speech and Signal Process., (6), 0 8, (986).. ITU Recommendation G.7 Appendix I, (999), A high quality low-complexity algorithm for packet loss concealment with G.7.. Chen, Y. L. and Chen, B. S. Model-based multi-rate representation of speech signals and its application to recovery of missing speech packets, IEEE Trans. on Speech and Audio Process., (), 0, (997).. Mizumachi, M., Ohga, K., Fujii, M. and Horiuchi, T. Restoration of intermittent speech with composite gap-filling schemes relying on human auditory capability, Proc. ICSV9, paper ID: 08, (0).. Mizumachi, M., Motomura, S., Takakura, T. and Horiuchi, T. Restoration of intermittent speech based on human auditory capability and temporal characteristics of speech, Proc. ICSV0, paper ID: 7, (0). 6. Mizumachi, M., Motomura, S., Takakura, T. and Horiuchi, T. Perceptual restoration of intermittent speech under noisy environments, Proc. ICSV, paper ID: 00, (0). 7. Warren, R. M. Perceptual restoration of missing speech sounds, Science, 67, 9 9, (970). 8. Kashino, M. Phonemic restoration: The brain creates missing speech sounds, Acoustical Science and Technology, 7(6), 8, (006). 9. Holube, I., Fredelake, S., Vlaming, M. and Kollmeier, B. Development and analysis of an international speech test signal (ISTS), Int. J. Audiol., 9(), 89 90, (00). 0. Kawai, K., Fujimoto, K., Iwase, T., Yasuoka, H., Sakuma, T. and Hidaka, Y. Development of a sound source database for environmental/architectural acoustics: Introduction of smile 00 (sound material in living environment 00), Proc. International Congress on Acoustics, pp. 6 6, (00). 6 ICSV, Athens (Greece), 0- July 06

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

ISCA Archive SUBJECTIVE EVALUATION FOR HMM-BASED SPEECH-TO-LIP MOVEMENT SYNTHESIS Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano Graduate School of Information Science, Nara Institute of Science & Technology