19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

Size: px

Start display at page:

Download "19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007"

Bertram Shields
6 years ago
Views:

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007 THE INFLUENCE OF LINGUISTIC AND EXTRA-LINGUISTIC INFORMATION ON SYNTHETIC SPEECH INTELLIGIBILITY PACS: 43.

1 19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007 THE INFLUENCE OF LINGUISTIC AND EXTRA-LINGUISTIC INFORMATION ON SYNTHETIC SPEECH INTELLIGIBILITY PACS: Bp Gardzielewska, Hanna Institute of Acoustics, Adam Mickiewicz University, Umultowska 85, Poznan, Poland; ABSTRACT The key objective of the present study was to determine the relationship between the data reduction of Polish speech (the number of tones reproducing the speech signal) and its intelligibility. A more specific aim was to determine how synthetic speech intelligibility depends on the content of the linguistic information it carries, and so-called extra-linguistic information. A new sine-wave synthesis method was proposed for this analysis, which enabled high level results for Polish synthetic speech intelligibility to be achieved. Speech intelligibility was tested in different synthetic speech material, varying in grammatical structure, semantic information content, and the acoustic characteristics of the talker. INTRODUCTION Linguistic information is very resistant to distortion and spectral information reduction, as has been confirmed by numerous research results obtained, among others, by Remez et al [1] [2]. In their studies they treated speech signals with the SineWave Synthesis. In this synthesis, the changing pattern of vocal resonances (formants) is modeled by a limited number of tones reflecting the spectral dynamics and the structure of the signal. The synthesis rejects all the detailed acoustic information carried by a signal, including fundamental frequency, as well as harmonic and noise components. The reproduced sounds lose their naturalness but still remain intelligible. Most of the studies on intelligibility of SWS-compressed sounds [1] [3], [4] refer only to English, which is a vowel-dominated language. In contrast to English, the Polish language is consonant-dominant. The results of Polish synthetic speech intelligibility obtained with original SineWave Synthesis method turned out to be unintelligible [5]. An alternative technique of sinusoidal speech synthesis was proposed. It was based on the number of dominant frequency components presented in the original signal. In the proposed method, only the frequency components with the highest amplitude were reproduced, instead of the exact formant frequencies. The amplitudes and frequencies of the dominant frequency components were derived with 20ms resolution. Because of the large amount of energy in the high frequency range in Polish speech [6], the range of dominant frequency components being tracked incorporated a band from 200Hz up to 8kHz, at a sampling frequency equal to 16 khz. Perceptual tests performed on Polish speech show that signals synthesized with the proposed technique were judged as more natural and intelligible than SWS speech. The modified SWS method, elaborated in Adam Mickiewicz University in Poznan, provided 2.4 times better results for three tones used for Polish speech synthesis than the original method. The new method was used for Polish speech intelligibility analysis. The influence of extra-linguistic information on speech intelligibility was tested in different sentences varying in the content of acoustic characteristics of the talker (single versus different talker). The intelligibility results of sentences were compared with the intelligibility results of utterances devoid of grammatical structure and logical coherence (three unrelated words) or utterances devoid of semantic information (three words without meaning).

2 EXPERIMENT In order to determine whether in such unusual phonetic conditions (where the subjects were asked to perceive very distorted linguistic information) changes in the acoustic characteristics of a signal (different talkers) still affect speech intelligibility, sentences uttered by different and single talker were analyzed. The influences of grammar and logical structure on speech intelligibility were tested on utterances consisting of three unrelated words, uttered by a single talker. The intelligibility of utterances without semantic information was tested on three unrelated logatoms (words without meaning) uttered by a single talker. Subjects Forty four (10 women and 34 men) participants, aged 20 to 24, took part in the experiment. The participants were native speakers of Polish who reported no past or present hearing disorders and qualified as having normal hearing (normal hearing was defined as the audiometric threshold 20dB HL or better, for a frequency range from 250Hz to 8000Hz, ANS, 1996). The participants had previous experience in synthetic speech intelligibility assessment and were paid for their participation in the experiment. Speech material and equipment For sentence intelligibility testing the CORPORA multi-talker database, designed for automated recognition of Polish speech [7], was used. 27 sentences were picked at random from 114 possible sentences contained in the database. The average number of words in each sentence was 5, so that gave approximately 140 words for the list. The sentences of the list were pronounced either by different talkers or by a single, male or female talker, depending on the choice of experimental conditions. The duration of each sentence was on average 2 seconds. For testing utterances without grammatical and logical coherence, words selected from a frequency and phonetics balanced recorded wordlist for the Polish language, elaborated by W. Jassem were used. The list of signals was composed of 27 utterances, each comprising three words. The duration of each utterance corresponded to the duration of sentences, which was about 2 seconds. All the utterances were generated only by one male-talker. The signals presented in the last list were logatoms, nonsense words selected from a structurally and phonetically balanced list of logatoms for the Polish language [9]. In total 81 logatoms were randomly selected, from which 27 expressions were composed, each consisting of three logatoms. The duration of each expression was equal on average to 2 seconds. All expressions were synthesized with three tones. Procedure Five experimental conditions were tested. In the first listening session, participants listened to synthetic sentences. Participants were randomly divided into two groups of 20 and 24 persons. Sentences were generated either by different talkers or by single talker (2 conditions: DT, ST). The sentences produced by different talkers were presented to the first group and the other group only listened to the sentences generated by a single talker. The last group of participants was further divided into two subgroups: twelve participants listened to utterances presented by a female talker and twelve listened to utterances produced by a male talker (2 conditions: STF, STM). This was done in order to take into account the fact that with increasing the fundamental frequency of voice, the difficulty of determining the formant frequency increases as well. It usually leads to the conclusion that a female voice is less intelligible than a male voice [6] [10]. Comparison of the results obtained from both groups made it possible to evaluate the degree to which a change in the acoustic characteristics of a phonetically distorted signal (extra-linguistic information) affects the perception of linguistic information. In the second listening session participants assessed the intelligibility of 27 utterances built of three unrelated words (1 condition: N3). In the last listening session the participants assessed the recognition of 3 logatoms (1 condition: L3). Participants typed the contents of each utterance the way they heard it in a special dialogue box. The utterance typed by each participant was then compared to the original utterance. The 2

3 five experimental conditions were named as follow: DT, STF, STM, N3, L3. On the basis of the collected results the word s intelligibility expressed in percentage was assessed. RESULTS AND GENERAL DISCUSSION The speech intelligibility results, expressed as the percentage of average words correct for each list, are presented in Figure 1. No statistically significant interaction of participants genderresponses was obtained (STF and STM). The results obtained from the participants of different gender for sentences were averaged (ST). 100 percent words correct ST N3 DT L3 speech material Figure 1. The averaged percent words correct in 2s utterances for various speech material condition (sentences generated by single talker, ST; three unrelated words generated by single talker, N3; sentences generated by different talkers, DT; three unrelated logatoms generated by single talker, L3). Error bars indicate values of a standard deviation. Comparing the results on the intelligibility of two-second utterances, obtained in the experiment it may be concluded that a lower diversity of acoustic attributes and a higher coherence of information content improve the intelligibility of a speech signal. Preservation of grammar and the logical continuity of an utterance significantly facilitate recognition of individual words. However, three-word phrases of random words, devoid of grammatical and contextual cohesion, turned out to be easier to memorize and recognize than logical sentences uttered by different talkers. According to the results, the acoustic attributes of a talker (extra-linguistic information) cannot be neglected in speech perception, even in the case of synthetic speech. The results obtained once more confirm that paying attention to spoken words involves paying attention to the voice, which is reflected in the speech intelligibility scores [11]. Despite being prepared to receive a semantic message in such unnatural acoustic conditions, listeners showed evidence of integral processing of changes in their acoustic environment, namely talker-specific attributes, along with the processing (recognition) of linguistic attributes of a signal [12], [13], [14]. Diversity of extra-linguistic information has a great impact on correct synthetic speech signal identification. The results show that linguistic information reduction carried by the signal (sentences versus words without logical or grammatical coherence) has a lower impact on speech intelligibility results than speaker acoustic characteristic variation. The percentage of words correctly identified in utterances devoid of logical and grammatical coherence (N3) uttered by a single talker were 6% better identified than words in logical sentences uttered by a different talkers (DT). 3

4 The intelligibility of words in sentences uttered by different talkers (DT) was 12% lower than the intelligibility of the same words but uttered by single talker (ST). The results indicate differences in the perceptual processing of words, resulting not only from the physical realization of utterances [15], [16], but also from grammatical and semantic utterance information content. Preservation of grammar and the logical continuity of an utterance significantly facilitate recognition of individual words. In the case of logatoms, lack of any particular meaning made it almost impossible for subjects to reproduce them correctly. The results demonstrate how synthetic speech intelligibility is dependent on correct perceptual matching of the phonetic characteristics of heard sounds with phonetic characteristics stored in listener s long term memory [17], [18] [19] [20]. In cases when there are no original phonetic characteristics in the listener s long term memory, synthetic speech perception turned out to be almost impossible on the basis of such limited acoustic information. Perceptual matching, in principle, facilitates the invariability of a talker acoustic characteristics. The way the speech sounds are generated has secondary meaning. CONCLUSIONS Talker acoustic characteristics cannot be neglected in analyzing synthetic speech intelligibility. Extra-linguistic information has an even higher impact on synthetic speech intelligibility than the reduction of linguistic content of a signal. ACKNOWLEDGMENTS This research was supported by KBN Grant No. 4T07B References: [1] R. E. Remez, P. E. Rubin, D. B. Pisoni, T. D. Carrell: Speech perception without traditional speech cues. Science 212 (1981) [2] R. E. Remez, P. E. Rubin, S. E. Berns, J. S. Pardo, J. M. Lang: On the perceptual organization of speech. Psychological Review 101 (1994) [3] R. Q. McAulay, T. F. Quatieri: Speech analysis-synthesis based on a sinusoidal representation., IEEE Trans. ASSP 34 (1986) [4] M. Dorman, P. Loizou, D. Rainey: Speech intelligibility as a function of the number of channels of stimulation for signal processors using sine-wave and noise-band outputs. Journal of the Acoustical Society of America 102 (1997) [5] H. Wojciechowska: Speech data reduction versus speech intelligibility. Polish-German Structured Conference on Acoustics (2004) [6] W. Jassem: Podstawy fonetyki akustycznej, PWN, Warszawa, [7] S. Grocholewski: Corpora-speech database for polish diphones. Eurospeech'97 (1997), [8] W. Jassem: Frequency and phonetics balanced polish wordlists. Speech and language technology, W. Jassem, C. Basztura (Editor), Vol. I, Polish Phonetic Association, Poznan (1977) [9] S. Brachmański, P. Staroniewicz: Phonetic structure of test material used for subjective speech quality measurements: Speech and language technology, W. Jassem, C. Basztura (Editor), Vol. III, WPN Format, Poznan (1999) [10] D. H. Klatt, L. C. Klatt: Analysis, synthesis, and perception of voice quality variations among female and male talkers. Journal of the Acoustical Society of America 87, No.2 (1990) [11] L. C. Nygaard, D.B. Pisoni: Speech perception: New directions in research and theory. Speech, language, and communication, J. L. Miller, P. D. Eimas (Editor), Academic, San Diego, CA, (1995) [12] J. W. Mullenix, D. B. Pisoni,: Stimulus variability and processing dependencies in speech perception,, Perception and Psychophysics 47 (1990) [13] K. P. Green, G. R. Tomiak, P. K. Kuhl: The encoding of rate and talker information during phonetic perception. Perception and Psychophysics 59 (1997) [14] R. E. Remez, J. M. Fellowes, P. E. Rubin: Voice identification based on phonetic information. Journal of Experimental Psychology: Human Perception and Performance 23 (1997) [15] D. Reddy: Speech recognition by machine: A review. Proceedings of IEEE 64, No.4 (1976) [16] D. B. Pisoni, P. A. Luce: Acoustic-phonetic representations in word recognition. Cognition 25 (1987) [17] C. S. Martin, J. W. Mullenix, D. B. Pisoni, W. V. Summers: Effects of talker variability on recall of spoken word lists. Journal of Experimental Psychology: Learning, Memory, and Cognition 17 (1989) [18] P. W. Jusczyk, D. B. Pisoni, J. Mullennix: Some consequences of stimulus variability on speech processing by 2- month-old infants. Cognition 43 (1992) [19] R. V. Shannon, F. G. Zeng, J. Wygonski, V. Kamath, M. Ekelid: Speech recognition with primarily temporal cues. Science 270 (1995)

5 [20] J. M. McQueen, A. Cutier, D. Norris: Flow of information in the spoken word recognition system. Speech Communication 41, No.1 (2003)

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Dept. for Speech, Music and Hearing Quarterly Progress and Status Report VCV-sequencies in a preliminary text-to-speech system for female speech Karlsson, I. and Neovius, L. journal: STL-QPSR volume: 35