Interspeech' Eurospeech. Design and Collection of Czech Lombard Speech Database

Similar documents
Speech Emotion Recognition Using Support Vector Machine

WHEN THERE IS A mismatch between the acoustic

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Speech Recognition at ICSI: Broadcast News and beyond

Human Emotion Recognition From Speech

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

A study of speaker adaptation for DNN-based speech synthesis

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Learning Methods in Multilingual Speech Recognition

Proceedings of Meetings on Acoustics

Speaker recognition using universal background model on YOHO database

Modeling function word errors in DNN-HMM based LVCSR systems

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Speaker Recognition. Speaker Diarization and Identification

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Segregation of Unvoiced Speech from Nonspeech Interference

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Modeling function word errors in DNN-HMM based LVCSR systems

Mandarin Lexical Tone Recognition: The Gating Paradigm

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Rhythm-typology revisited.

Body-Conducted Speech Recognition and its Application to Speech Support System

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Voice conversion through vector quantization

INPE São José dos Campos

Speaker Identification by Comparison of Smart Methods. Abstract

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

SARDNET: A Self-Organizing Feature Map for Sequences

Automatic Pronunciation Checker

Rule Learning With Negation: Issues Regarding Effectiveness

On the Formation of Phoneme Categories in DNN Acoustic Models

Author's personal copy

Calibration of Confidence Measures in Speech Recognition

Circuit Simulators: A Revolutionary E-Learning Platform

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Automatic segmentation of continuous speech using minimum phase group delay functions

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

A Case Study: News Classification Based on Term Frequency

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Corpus Linguistics (L615)

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Edinburgh Research Explorer

Cross Language Information Retrieval

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

NCEO Technical Report 27

REVIEW OF CONNECTED SPEECH

Linking Task: Identifying authors and book titles in verbose queries

Automatic intonation assessment for computer aided language learning

Disambiguation of Thai Personal Name from Online News Articles

Lecture 1: Machine Learning Basics

Assignment 1: Predicting Amazon Review Ratings

A student diagnosing and evaluation system for laboratory-based academic exercises

Course Law Enforcement II. Unit I Careers in Law Enforcement

Software Maintenance

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Houghton Mifflin Online Assessment System Walkthrough Guide

A Pipelined Approach for Iterative Software Process Model

Using GIFT to Support an Empirical Study on the Impact of the Self-Reference Effect on Learning

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

CHANCERY SMS 5.0 STUDENT SCHEDULING

Affective Classification of Generic Audio Clips using Regression Models

The following information has been adapted from A guide to using AntConc.

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Application of Virtual Instruments (VIs) for an enhanced learning environment

TIPS PORTAL TRAINING DOCUMENTATION

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

THE RECOGNITION OF SPEECH BY MACHINE

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

On the Combined Behavior of Autonomous Resource Management Agents

Using dialogue context to improve parsing performance in dialogue systems

Transcription:

Available Online: http://www.isca-speech.org/archive/interspeech_2005/i05_1577.html Interspeech'2005 - Eurospeech Lisbon, Portugal September 4-8, 2005 Design and Collection of Czech Lombard Speech Database Hynek Boril, Petr Pollak Czech Technical University in Prague, Czech Republic In this paper, design, collection and parameters of newly proposed Czech Lombard Speech Database (CLSD) are presented. The database focuses on analysis and modeling of Lombard effect to achieve robust speech recognition improvement. The CLSD consists of neutral speech and speech produced in various types of simulated noisy background. In comparison to available databases dealing with Lombard effect, an extensive set of utterances containing phonetically rich words and sentences was chosen to cover the whole phoneme vocabulary of the language. For the purposes of Lombard speech recording, usual 'noisy headphones configuration' was improved by addition of an operator qualifying utterance intelligibility while hearing the same noise mixed with speaker's voice of intensity lowered according to the selected virtual distance. This scenario motivated speakers to react more to the noise background. The CLSD currently consists of 26 speakers. Bibliographic reference. Boril, Hynek / Pollak, Petr (2005): "Design and collection of Czech Lombard speech database", In INTERSPEECH-2005, 1577-1580. ISSN 1018-4074

Design and Collection of Czech Lombard Speech Database Hynek Boil & Petr Pollák Faculty of Electrical Engineering Czech Technical University in Prague, Czech Republic borilh@fel.cvut.cz, pollak@fel.cvut.cz Abstract In this paper, design, collection and parameters of newly proposed Czech Lombard Speech Database (CLSD) are presented. The database focuses on analysis and modeling of Lombard effect to achieve robust speech recognition improvement. The CLSD consists of neutral speech and speech produced in various types of simulated noisy background. In comparison to available databases dealing with Lombard effect, an extensive set of utterances containing phonetically rich words and sentences was chosen to cover the whole phoneme vocabulary of the language. For the purposes of Lombard speech recording, usual noisy headphones configuration was improved by addition of an operator qualifying utterance intelligibility while hearing the same noise mixed with speaker s voice of intensity lowered according to the selected virtual distance. This scenario motivated speakers to react more to the noise background. The CLSD currently consists of 26 speakers. 1. Introduction Efficiency of automatic speech recognizers decreases significantly with the presence of ambient noise. Performance is affected negatively both by speech signal corruption by noise and by Lombard Effect (). While a lot of attention has been paid to noise suppression in speech signals recorded in adverse conditions, classification and elimination is promising further improvements in natural environment speech recognition accuracy. The relates to speaker modifications of speech characteristics in an effort to increase communication intelligibility in noisy environment [1]. Considering speech feature domain, introduces nonlinear distortion depending on the speaker and the level and type of ambient noise. Changes of overall vocal intensity, fundamental frequency f 0 contours, variance and distributions as well as variations of formant and antiformant locations, formant bandwidth, spectral tilt and frequency band energy distribution have been observed for [2]. Such speech feature changes influence negatively performance of neutral speech trained recognizer. Basic approaches to Lombard speech recognition can be divided into 3 groups robust features, equalization and model adjustment [1]. First two methods consider use of a neutral speech recognizer with front-end performing speech normalization, third one assumes recognizer training to Lombard speech, which is problematic due to usual lack of sufficient amount of training data and large range of speech feature changes depending on speaker and type of noise. Goal of the analysis is proposal of a degradation model representing relations between Lombard speech and clean speech [1, 3]. If such a relation is found, features or feature equalization more robust to can be found. Recently, numerous multilingual speech databases recorded partly or fully in actual noisy environments are available, e.g. SPEECON (public places and car scenarios) [4]. Strong noise background present in the recordings makes it difficult to evaluate impacts of on speech recognition separately. Moreover, in case of Czech SPEECON can be observed very rarely, as speakers did not react much to the ambient noise and just read the text [5]. In case of special databases dedicated to, noisy background is usually reproduced to the speaker through headphones, hence high SNR of the recorded speech is preserved [3, 6, 1]. Recently, several small vocabulary speech databases fully or partly dedicated to are publicly available, e.g. Speech under Simulated and Actual Stress (SUSAS) [1]. In this paper, structure, recording platform and basic parameters of CLSD are presented. The database consists of neutral and Lombard speech recorded in various simulated noisy backgrounds (car noises, artificial band-noises). A total of 26 speakers have recently been recorded. Utterances contain phonetically rich words and sentences covering the whole Czech phoneme vocabulary to allow for overall analysis and modeling of. To evaluate properties of the database, analyses of selected sensitive speech features were carried out. 2. Database structure Recently 26 speakers (12 female, 14 male) participated in the noisy background recordings, 12 of them (11 female, 1 male) were recorded in neutral conditions, neutral speech of the rest speakers is covered in the Czech SPEECON database. Each recording scenario typically comprises 108 utterances per speaker, which represents 10 12 minutes of continuous speech. The number of words uttered by speaker in one scenario slightly varies due to selected items forming the actual utterance list. In the average, 780 words per speaker and scenario were uttered. 2.1. Corpus and vocabulary The content of the database is similar to the SPEECON database. Some very specific application utterances as spelled items, internet addresses, spontaneous speech, etc., were omitted. The following items were chosen to be recorded: Phonetically rich material sentences and words. Numerals isolated & connected digits, natural numbers. Commands various application words. Special items dates, times, etc. In order to cover whole phoneme material sufficiently, 30 phonetically rich sentences (often complex) were included into each session. To allow statistically significant small

vocabulary speech recognition experiments, 470 repeated and isolated digits were added to each session. In case of SPEECON, the amount of 40 digits is available per session. 2.2. Label file specification The label file contains mainly orthographic and phonetic transcription which is completed by the information about recording conditions, speaker information, etc. Our label file originates from the SPEECON one and is extended by items concerning conditions Table 1. 3.2. Noise level adjustment To enable noise level adjustment, transfer function describing relation between sound card open circuit effective voltage V RMS_OL and SPL in headphones was determined by measurement on a dummy head, see Figure 2. For chosen noise level, corresponding V RMS_OL was set up at the beginning of each session recording, 105 100 Soundcard Output Voltage vs. Noise SPL NTY NLV DES Noise type Noise level Speaker- Operator Distance %s %f %f Filenames including noise description code The noise level set by measured level from soundcard output Distance (m) level of speech signal attenuation in operator recording monitor Table 1: Label file CLSD specific items SPL (db) 95 90 85 80 75 70 65 60 V RMS _ OL 20log (db) 6 SPL 4.38610 0 50 100 150 200 250 300 350 400 VRMS_OL (mv) 2.3. Noise backgrounds Background noises were selected for observations of speech production changes both for natural noisy environment and for artificial band-noises interfering with typical locations of f 0 and first formants occurrence. 25 noises recorded in car environment from CAR2E database [7] and 4 band-pass noises (62-125, 75-300, 220-1120, 840-2500 (Hz)) were chosen. Each car noise sample was about 14 sec long, stationary band-noises were 5 sec long. The noise sample was looped in case the utterance was to exceed the sample length. All noises were RMS normalized to provide corresponding sound pressure level (SPL) during the reproduction. Figure 2: V RMS_OL noise SPL dependency An average of 90 db SPL and 3 meters of virtual distance were chosen as default for Lombard speech recording scenarios. In some cases the settings had to be modified according to particular speaker s capabilities. 3.3. Recording studio H&T recorder developed for CLSD collection was implemented as a.net application, see Figure 3. 3. Recording platform The database was recorded digitally into hard disc. In case of the noisy conditions scenario, speaker heard his own voice mixed with noise in closed headphones. The level of the speech feedback was adjusted individually to make speaker feel comfortable. An operator qualified intelligibility of the utterances while listening to noise of the same level mixed with the utterance of intensity lowered in proportion to selected virtual speaker-listener distance. 3.1. Hardware configuration Recording set, see Figure 1, consists of 2 closed headphones AKG K44 and 2 SPEECON microphones close talk Sennheiser ME-104 and hands-free Nokia NB2, placed in different distances from the speaker s mouth. SPEAKER Middle talk Close talk Noise + speech feedback H&T RECORDER OK next / BAD - again Noise + speech monitor Figure 1: Recording setup OPERATOR Figure 3: H&T recorder window H&T recorder supports two-channel recording and separate noise/speech monitoring for speaker and operator respecting virtual distance. To each utterance an item from the noise list is assigned during the recording. Each recorded utterance was weighted by fading window derived from Blackman window [8] N 1 M, 2 (1)

2n 4n 0.420.5cos 0.08cos, 0 nm, N1 N1 wn 1, MnNu M, 2n 4n 0.420.5cos 0.08cos, NuMnNu N1 N1 where M is length (in samples) of amplitude fade-in and fadeout, N corresponding length of the original Blackman window and N u length of the whole utterance in samples. Weighting was performed to suppress clicking on the utterance boundaries. An example of harmonic signal amplitude weighting by fading window is shown in Figure 4. Figure 4: Modified Blackman weighting window 4. Database analyses Variations of fundamental frequency distribution, first four formant positions and bandwidth and accuracy in digit recognition task were evaluated to measure amount and quality of captured in the database. Feature analyses were performed in the open source tool WaveSurfer [9] which provides ESPS algorithms for pitch extraction and formant tracking [10]. For speech recognition recognizer built upon HTK [11] was used. 4.1. Fundamental frequency distribution Fundamental frequency was analyzed in voiced parts of all neutral and Lombard speech utterances. As shown in Figure 5, significant shift in f 0 distribution can be observed for speech. Solid line represents neutral speech and dash line Lombard speech f 0 distribution. Local maxima in both curves relate to major f 0 occurrences in male and female utterances respectively. Number of Frames Amplitude 70000 60000 50000 40000 30000 20000 10000 0 1 0.5 0-0.5-1 0 500 1000 1500 2000 Discrete Time Fundamental Frequency Distribution (2) 70 170 270 370 470 570 Frequency (Hz) Figure 5: and Lombard speech f 0 distribution 4.2. Formant tracking Monophone recognizer trained on 70 SPEECON office sessions was used for the CLSD forced alignment. Monophone models involved 32 mixtures and energy coefficient, 12 mel cepstral coefficients, delta and delta-delta coefficients were chosen as feature vectors. Forced alignment was performed on all CLSD utterances containing digits. 12 th order LPC was chosen for formant tracking performed by the WaveSurfer. Information about first four formant frequencies and bandwidths were assigned to corresponding phonemes. As shown in Figure 6, average positions of first two formants vary significantly for selected Czech vowels /a/, /e/, /i/, /o/, /u/ in case of neutral and speech. F2 (Hz) F2 (Hz) 2400 2200 2000 1800 1600 1400 1200 1000 2400 2200 2000 1800 1600 1400 1200 1000 /i/ /u/ /u'/ /i'/ /e/ /o/ Female Vowel Formants /e'/ /o'/ 300 400 500 600 700 800 900 F1 (Hz) /i/ /u/ /i'/ /u'/ /e/ /o/ /e'/ /a/ /o'/ /a/ /a'/ /a'/ Male Vowel Formants 300 400 500 600 700 800 900 F1 (Hz) Figure 6: Female & male vowel formants under 4.3. Recognition performance under Finally, impact of on recognition performance was evaluated. Recognizer mentioned in the previous subsection was used in the digit recognition task. Training set consisted of utterances containing isolated, repeated and connected digits. testing set was formed by 4930 and 1423 digits and Lombard set included 5360 and 6303 digits uttered by female and male speakers respectively. Recognition results are shown in Table 2, where F denotes female and M male speakers. Word recognition ratio has decreased by 12.5 % for male and by 35.5 % for female speakers. Data set F M F M Rec. ratio 92.70% 96.20% 57.18% 83.71% Table 2: Recognition results Digits vocabulary

Vowel Time (s) F 1 (Hz) (Hz) F 2 (Hz) (Hz) B 1 (Hz) (Hz) B 2 (Hz) (Hz) /a/ 251 651 153 1393 218 239 87 254 78 /e/ 295 518 101 1874 267 169 78 198 74 /i/ 254 394 55 2063 246 130 52 216 69 /o/ 49 584 105 1254 329 244 91 328 106 /u/ 61 396 76 1183 275 183 94 307 107 Table 3: speech average vowel formant positions and bandwidths Vowel Time (s) F 1 (Hz) (Hz) F 2 (Hz) (Hz) B 1 (Hz) (Hz) B 2 (Hz) (Hz) /a/ 615 755 97 1411 196 159 63 166 60 /e/ 939 611 91 1894 234 114 49 175 61 /i/ 503 457 69 1997 193 118 56 174 60 /o/ 147 653 79 1331 215 157 72 210 80 /u/ 102 452 82 1266 208 144 76 209 104 Table 4: Lombard speech average vowel formant positions and bandwidths Such a significant degradation in female speech recognition may be attributed to the fact, that f 0 often shifts under into the location of typical neutral speech first formant and formant frequencies rise to locations where they never appeared during the neutral recognizer training. In Tables 3 and 4 average first two formant positions, bandwidths and corresponding standard deviations are shown as detected for selected Czech vowels in the CLSD. For size reasons, male and female data are presented together in this case. 5. Conclusions Structure, recording platform and basic parameters of newly proposed Czech Lombard Speech Database are presented in this paper. The database recently consists of neutral speech and Lombard speech produced in simulated noisy conditions by 26 speakers. Covering complete phoneme dictionary of the Czech language, the database focuses on analysis and modeling. To evaluate amount and quality of captured in the database, variations of selected speech features sensitive to were analyzed. Both f 0 distribution and formants display significant changes in Lombard speech as already known from small vocabulary databases. Recognition ratio for Lombard speech decreased by 12.5 % for male and by 35.5 % for female speakers in digit recognition task, which also proves that CLSD contains challenging data for research in Lombard speech recognition. Sample of the CLSD is available at [12], complete database is available upon prior arrangement. 6. Acknowledgements The presented work was supported by GAR 102/05/0278 "New Trends in Research and Application of Voice Technology", GAR 102/03/H085 "Biological and Speech Signals Modeling", and research activity MSM 6840770014 "Research in the Area of the Prospective Information and Navigation Technologies". 7. References [1] Hansen, J. H. L., Analysis and Compensation of Speech under Stress and Noise for Environmental Robustness in Speech Recognition, Speech Communications, Special Issue on Speech under Stress, 20(2):151-170, November 1996. [2] Womack, B. D., Hansen, J. H. L., Classification of Speech under Stress Using Target Driven Features, Speech Communications, Special Issue on Speech under Stress, 20(1-2):131-150, November 1996. [3] Chi, S. M., Oh, Y. H., Lombard Effect Compensation and Noise Suppression for Noisy Lombard Speech Recognition, Proc. ICSLP '96, 4:2013-2016, Philadelphia, 1996. [4] www.speecon.com [5] Boil, H., Recognition of Speech under Lombard Effect, Proc. of the 14th Czech-German Workshop on Speech Processing, p. 110 113, Prague, Czech Republic, 2004. [6] Wakao, A., Takeda, K., Itakura, F., Variability of Lombard Effects under Different Noise Conditions, Proc. ICSLP '96, 4:2009-2012, Philadelphia 1996. [7] Pollák, P., Vopika, J., Sovka, P., Czech Language Database of Car Speech and Environmental Noise, EUROSPEECH-99, 5:2263-6, Budapest, Hungary 1999. [8] Harris, F. J., On the Use of Windows for Harmonic Analysis with the Discrete Fourier Transform, Proc. IEEE, 66:51-83, 1978. [9] Sjölander, K., Beskow, J., WaveSurfer - an Open Source Speech Tool, Proc. of ICSLP 2000, Bejing, China, 2000. [10] ESPS (Entropic Signal Processing System 5.3.1), Entropic Research Laboratory, http://www.entropic.com [11] Young, S. et al: The HTK Book ver. 2.2, Entropic Ltd 1999. [12] http://noel.feld.cvut.cz/speechlab, download section.