OBJECTIVE SPEECH INTELLIGIBILITY MEASURES BASED ON SPEECH TRANSMISSION INDEX FOR FORENSIC APPLICATIONS

Similar documents
WHEN THERE IS A mismatch between the acoustic

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Human Emotion Recognition From Speech

Speech Recognition at ICSI: Broadcast News and beyond

Segregation of Unvoiced Speech from Nonspeech Interference

Author's personal copy

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

A study of speaker adaptation for DNN-based speech synthesis

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Learning Methods in Multilingual Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Voice conversion through vector quantization

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Speech Emotion Recognition Using Support Vector Machine

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speaker recognition using universal background model on YOHO database

Speaker Identification by Comparison of Smart Methods. Abstract

THE RECOGNITION OF SPEECH BY MACHINE

Body-Conducted Speech Recognition and its Application to Speech Support System

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Integrating simulation into the engineering curriculum: a case study

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Corpus Linguistics (L615)

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Course Law Enforcement II. Unit I Careers in Law Enforcement

Mandarin Lexical Tone Recognition: The Gating Paradigm

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Operational Knowledge Management: a way to manage competence

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Speaker Recognition. Speaker Diarization and Identification

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Proceedings of Meetings on Acoustics

On-Line Data Analytics

Probability and Statistics Curriculum Pacing Guide

On the Formation of Phoneme Categories in DNN Acoustic Models

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

Statewide Framework Document for:

Modeling function word errors in DNN-HMM based LVCSR systems

The Structure of the ORD Speech Corpus of Russian Everyday Communication

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Rhythm-typology revisited.

GDP Falls as MBA Rises?

On the Combined Behavior of Autonomous Resource Management Agents

Modeling function word errors in DNN-HMM based LVCSR systems

Calibration of Confidence Measures in Speech Recognition

INPE São José dos Campos

Evidence for Reliability, Validity and Learning Effectiveness

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

On-the-Fly Customization of Automated Essay Scoring

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Ph.D. in Behavior Analysis Ph.d. i atferdsanalyse

Grade 6: Correlated to AGS Basic Math Skills

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Unit 7 Data analysis and design

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Pre-AP Geometry Course Syllabus Page 1

Office Hours: Mon & Fri 10:00-12:00. Course Description

STA 225: Introductory Statistics (CT)

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Case study Norway case 1

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

University of Groningen. Systemen, planning, netwerken Bosman, Aart

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Human Factors Computer Based Training in Air Traffic Control

Introduction to the Practice of Statistics

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham

Data Fusion Models in WSNs: Comparison and Analysis

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

English Language and Applied Linguistics. Module Descriptions 2017/18

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

How to Judge the Quality of an Objective Classroom Test

Transcription:

OBJECTIVE SPEECH INTELLIGIBILITY MEASURES BASED ON SPEECH TRANSMISSION INDEX FOR FORENSIC APPLICATIONS GIOVANNI COSTANTINI 1,2, ANDREA PAOLONI 3, AND MASSIMILIANO TODISCO 1 1 Department of Electronic Engineering, University of Rome Tor Vergata, Rome, Italy costantini@uniroma2.it massimiliano.todisco@uniroma2.it 2 Institute of Acoustics O. M. Corbino, Rome, Italy 3 Fondazione Ugo Bordoni, Rome, Italy pao@fub.it In order to decide whether the transposition of the speech originated from a lawful interception, in the written text reflect the views of those who spoke the words or only the views of the transcriber, is required to have an objective method to measure speech intelligibility. Usually, a lawful interception can have two types of distortion: the distortions that affect the speech signal itself (called speech distortion) and the distortions that affect the background noise (called noise distortion). Unfortunately, the forensic expert does not have the original clean signal, therefore, must make its assessment based on the only available signal. This paper addresses the issue by using three different objective approaches, that are the Signal-to-Noise ratio weighted with the curves "A" (S/NA), the Articulation Index (AI) and Speech Transmission Index (STI) to evaluate the signal and a corpus which is measured voice intelligibility. The approaches are tested with three different types of noise and the results were compared with the speech intelligibility scores measured by subjective tests. Measures based on STI have proven to be reliable for predicting the intelligibility in forensic applications. INTRODUCTION Environmental interceptions have become the most frequent sources of evidence in criminal trials, but in order to use the intercepted speech as a evidence it is mandatory to transpose it into written text. On numerous occasions the speech was almost unintelligible, however, some areas felt that they could draw from that signal a correct interpretation. Unfortunately, when the signal is almost understandable happen that the transcript does not correspond to anything that was said. Too often, in forensics field, we see transcripts that reflect the views of the transcribers rather that the views of those who spoke the words. Difficulties of making a useful transposition of speech into written text are mainly due to words spoken in a low voice and/or covered by environmental noise. Probably the biggest threat to speech comprehension is competing noise, voices or other sounds reaching the listener. In addition, as is well known to linguists, it is almost impossible to transform speech into written text without losing information [1]. The present work aims to study the problem of the intelligibility of speech signal. In general two principally different assessment methods may be applied: subjective assessment, based on the use of the listeners, and objective assessment, based on physical parameters of the signal. The subjective tests are often too costly and laborious to deploy and not well accepted in the courtroom. For that reasons we think would be useful to have a method to verify objectively whether a given signal can be transcribed with reasonable assurance of the reliability. The paper is organized as follows: in the following section the assessment of intelligibility will be described; in the second section, the speech corpus and subjective intelligibility evaluations will be shown; the third section will be devoted to the description of the objective measures; in section fourth we present the results of two experiments involving speech coming from a real trial. Some comments conclude the paper. 1 ASSESSMENT OF INTELLIGIBILITY The ISO 9921 on the "Assessment of verbal communication" defines the intelligibility of speech as: "a measure of effectiveness in understanding the language". To assess the reliability and effectiveness of a transcription system of speech signals, we must define an objective measure of intelligibility index closely correlated with the subjective performance of a group of listeners. Traditionally, the intelligibility is the percentage of signal is made subjectively by a group of listeners on the basis of the percentage of correct identifications. AES 39th International Conference, Hillerød, Denmark, 2010 June 17 19 1

consists of 50 different test signals. The listener fill in the proper space the word heard. Fig. 1 shows the application interface. The result of the subjective tests is shown in Fig. 2. We note that, for the same S/N, bubble noise leads to significantly higher values of the intelligibility than the other two types of disturbance. Tab. I: Speech corpus used for subjective listening tests The known signal may consist of phrases, words or simple sounds without meaning (logatoms). Many objective speech intelligibility measures have been proposed in the past to predict the subjective intelligibility of speech [2], [3]. Most of the literature in this field comes from the IT world, where the problem is to study the impact of transmission channel and encoders on intelligibility of speech signals [4], [5]. To do this, the proposed algorithms often use a doublesided approach based on a comparison between the clean speech signal and the transmitted signal. This approach is not usable in a legal context because the expert witness has only the noisy version of the signal. Thus it is necessary to assess single-sided, i.e. on the basis of only the noisy signal. 2 SPEECH CORPUS AND SUBJECTIVE INTELLIGIBILITY EVALUATIONS Both subjective and objective tests are conducted using the European project SAM EUROM 1 [6]. In particular have been used 50 rhyming Italian words with or without meaning, preceded by the word PRENDI (get) and followed by the word INTANTO (yet). Degradations considered include additive noise. In particular, the corpus have been properly made noisy by adding Pink, Hammer and Babble noise. Each type of noise appeared in five different degrees of signal to noise ratio (S/N = 2, 0, -2, -4, -6) and read by 4 different voices, two men and two women. At the end of operations, therefore, can be found to have 60 different corpora each formed by 50 different words. Table I shows the complete speech corpus. 2.1 Subjective listening tests A first experiment was conducted in order to obtain the subjective intelligibility score. The speech corpora have been subjected to a group of 12 listeners, 4 for every degradation condition, using software developed for this purpose under the Max/MSP [7] environment, that deliver each item at chance many times as listener agreed. One test set Figure 1: Interface used for the subjective listening tests. 3 OBJECTIVE MEASURES Objective measurements do not measure intelligibility but determine physical parameters to predict intelligibility according to a certain model. Three frequently used objective measures were evaluated: the Signal to Noise ratio Weighted with the curves "A" (S/NA) [8], the Articulation Index (AI) [9], [10] and the Speech Transmission Index (STI) [11] based measures. 3.1 S/NA The Signal to Noise ratio Weighted with the curves "A" is the simplest and easiest method proposed. It can be formally presented as (1) where S A is the A-weighted long-term average speech level and N A the A-weighted long-term average level of background noise, measured over any particular time. 3.2 AI The Articulation Index estimates the intelligibility of speech from the spectral properties of the speech and the masking noise. It has been shown to accurately predict performance in a variety of phonetically balanced intelligibility tests across a wide range of different listening environments [10]. The AI was calculated using the 20-band method described by [9]. 3.3 STI-based measures In the Speech Transmission Index theory the intelligibility of speech is related to the preservations of the spectral differences between successive speech elements, the phonemes. AES 39th International Conference, Hillerød, Denmark, 2010 June 17 19 2

Figure 2: Intelligibility versus Signal to Noise ratio. Figure 3: Block diagram of the STI-based measure. AES 39th International Conference, Hillerød, Denmark, 2010 June 17 19 3

This can be described by the envelope function. The envelope function is determined by the specific sequence of phones of a specific utterance. The STI-based measure is computed as follows. The noisy signal were first bandpass filtered into seven octave bands starting from 125 Hz to 8000 Hz. The envelope of each band was computed using the power of the signal. In particular, let us consider a discrete time-domain signal x(n) filtered in the k th octave band, we define the envelop function as (2) where N e is the window size, h is the hop size, m {0, 1, 2,, M} the hop number, h(n) is a finite-length sliding Hanning window and n is the summation variable. After that, we compute the normalized envelope spectrum as follows Finally, the STI-based measure is obtained as a weighted mean of the MTI over seven octave bands, and is written (7) The sum of these weighting factors W k is 1 [9]. Fig. 3 shows the block diagram of the STI-based measure. 4 EXPERIMENTS AND RESULTS Performances of the objective measures are presented in terms of the Pearson product-moment correlation coefficient r between the subjective intelligibility ratings and the objective measure, and is given by (8) (3) where N s is the window size, F s is the sampling rate, f i is the 14 frequencies in the range 0.63 Hz to 12.5 Hz at 1/3-octave step, w(p) is a finite-length rectangular window and p is the summation variable. The SNR in each band is computed as (4) and subsequently limited to the range of [-15, 15] db The Transmission Index (TI) in each band is computed by linearly mapping the SNR values between 0 and 1 using the following equation: (5) For each octave band, the average TI over a specified frequency range gives the Modulation Transfer Index (MTI), as given by (6) where S and O are the subjective and objective scores, with means and, and standard deviation σ S and σ O respectively, while n is the number the different degrees of signal to noise ratio considered. The coefficient ranges from -1 to 1 with 1 being the highest-correlated to subjective scores and vice versa. A first experiment was conducted using for the intelligibility assessment the STI-based measure, the S/NA and the AI on the speech corpus described in section 2. These two last measures are double-sided that are not suitable for assessing signals in forensic but are useful in comparison to our proposed single-sided speech intelligibility measure based on STI. The experiment has highlighted the correlation between objective and subjective data in the particular conditions that are typical band forensic application. We note that all correlations are above 96%. The results of these experiments are summarized in Fig. 4-6. A second experiment was conducted using for the intelligibility assessment the STI-based measure on two real audio interceptions used in forensics. We calculate a time-varying STI-based measure on a frame-by-frame basis. The short-time STI-based measure can be used to give a running measure of the speech intelligibility. In particular, we compute the STIbased measure using a sliding window of 500 milliseconds with 50% overlap. Finally, we link the STI-based measure to the Intelligibility by computing a linear fitting regarding the Pink noise curve shows in Fig. 6. AES 39th International Conference, Hillerød, Denmark, 2010 June 17 19 4

Figure 4: Intelligibility versus Signal to Noise ratio weighted with the curve A. Figure 5: Intelligibility versus Articulation Index. Figure 6: Intelligibility versus STI-based measure. AES 39th International Conference, Hillerød, Denmark, 2010 June 17 19 5

Figure 7: Intelligibility score versus time in a real case with poor quality signal. The intelligibility has been predicted using STI-based measurement. Figure 8: Intelligibility score versus time in a real case with good quality signal. The intelligibility has been predicted using STI-based measurement. AES 39th International Conference, Hillerød, Denmark, 2010 June 17 19 6

The second experiment shows the analysis of two speech samples of about 15 seconds concerning an actual case. These files are sampled at 8 KHz and quantized with 16 bits. The first signal is very low quality in terms of S / N. The analysis shown in Fig. 7 allows to assess the intelligibility of individual segments (phrases). In the figure you can see the trend of the signal amplitude, the sonogram (middle graph) and the intelligibility of different segments. The estimated value of intelligibility is never more than 50%. The second speech segment has been recorded for a comparison. The result (Fig. 8) shows that in this case the speech intelligibility is 100%. 5 CONCLUSIONS In this paper, we have presented a simple, efficient way of providing single sided speech intelligibility measure based on STI suitable for forensic applications. This particular application is characterized by the following points: the signal is very disturbed, with S/N ratio from -6 to +2 db; the signal is only available on the degraded channel mono; the evaluation has to be made with objective methods, because the subjective methods are not well accepted in the courtrooms. The evaluation of the speech intelligibility is crucial to ensure the reliability of transcription. STI-based measures have proven to be reliable for predicting the intelligibility in forensic applications. In conclusion the present study demonstrates that the Speech Transmission Index is a good model in order to provide a tool for predicting speech intelligibility in additive stationary noise conditions. The evaluation of STI in the presence of multiplicative noise or bandwidth reduction remains topics for future research. Measures for Speech Intelligibility, INTERSPEECH 2008, 9 th Annual Conference of the International Speech Communication Association Brisbane, Australia September 22-26, 2008 [6] Chen D., Fourcin A., et alii, EUROM A spoken language resource for the EU, ESCA EUROSPEECH 95 Madrid September 1995. [7] Cycling74 Max/MSP, documentation available on the web at: http://cycling74.com/products/maxmspjitter/ [8] Hu Y., Loizou, P.C. A Comparative Intelligibility Study of Speech Enhancement Algorithms, Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on Volume 4, Issue, 15-20 April 2007, Page(s): IV-561 - IV-564. [9] Kryter K., Methods for the calculation and use of the Articulation Index, JASA 34, 1689 1697 November 1962. [10] Kryter, K., ANSI S3.5-1969, American National Standards Methods for Calculation of the Articulation Index American National Standards Institute, New York, 1969. [11] Payton K. L. A method to determine the speech transmission index from speech waveforms, JASA 106, 3637-3648, 1999. REFERENCES [1] H.Freser, Issues in transcription: factors affecting the reability of transcripts as evidence in legal cases, Fotensic linguistics, vol. 102 (2002). [2] Herman J.M. Steeneken, The Measurement of Speech Intelligibility, TNO Human Factors, Soesterberg, the Netherlands. [3] Ma J., Hu y., Loizou C.: Objective measures for predicting speech intelligibility in moist conditions based on new band importance functions JASA 125, May 2009. [4] Nobuhiko Kitawaki, and Takeshi Yamada Subjective and Objective Quality Assessment for Noise Reduced Speech, ETSI Workshop on Speech and Noise in Wideband Communication, May 2007, Sophia Antipolis, France [5] W. M. Liu, K. A. Jellyman, N. W. D. Evans, and J. S. D. Mason Assessment of Objective Quality AES 39th International Conference, Hillerød, Denmark, 2010 June 17 19 7