Albayzin Evaluation: The PRHLT-UPV Audio Segmentation System

Similar documents
Speech Emotion Recognition Using Support Vector Machine

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Learning Methods in Multilingual Speech Recognition

A study of speaker adaptation for DNN-based speech synthesis

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Human Emotion Recognition From Speech

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Word Segmentation of Off-line Handwritten Documents

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

WHEN THERE IS A mismatch between the acoustic

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Investigation on Mandarin Broadcast News Speech Recognition

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Speech Recognition by Indexing and Sequencing

Calibration of Confidence Measures in Speech Recognition

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Affective Classification of Generic Audio Clips using Regression Models

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Rule Learning With Negation: Issues Regarding Effectiveness

Using dialogue context to improve parsing performance in dialogue systems

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Speaker recognition using universal background model on YOHO database

On the Formation of Phoneme Categories in DNN Acoustic Models

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

An Online Handwriting Recognition System For Turkish

INPE São José dos Campos

Generative models and adversarial training

Lecture 1: Machine Learning Basics

Rule Learning with Negation: Issues Regarding Effectiveness

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Speaker Recognition. Speaker Diarization and Identification

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Australian Journal of Basic and Applied Sciences

Support Vector Machines for Speaker and Language Recognition

Edinburgh Research Explorer

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Speaker Identification by Comparison of Smart Methods. Abstract

Deep Neural Network Language Models

Automatic Pronunciation Checker

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Python Machine Learning

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Proceedings of Meetings on Acoustics

Segregation of Unvoiced Speech from Nonspeech Interference

Evolutive Neural Net Fuzzy Filtering: Basic Description

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Switchboard Language Model Improvement with Conversational Data from Gigaword

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

A Privacy-Sensitive Approach to Modeling Multi-Person Conversations

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

Active Learning. Yingyu Liang Computer Sciences 760 Fall

A new Dataset of Telephone-Based Human-Human Call-Center Interaction with Emotional Evaluation

Spoofing and countermeasures for automatic speaker verification

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Assignment 1: Predicting Amazon Review Ratings

South Carolina English Language Arts

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

A Biological Signal-Based Stress Monitoring Framework for Children Using Wearable Devices

Body-Conducted Speech Recognition and its Application to Speech Support System

Semi-Supervised Face Detection

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Probabilistic Latent Semantic Analysis

Truth Inference in Crowdsourcing: Is the Problem Solved?

Transcription:

Albayzin Evaluation: The PRHLT-UPV Audio Segmentation System J. A. Silvestre-Cerdà, A. Giménez, J. Andrés-Ferrer, J. Civera, and A. Juan Universitat Politècnica de València, Camí de Vera s/n, 46022 València, Spain, jsilvestre@dsic.upv.es, http://translectures.eu Abstract. This paper describes the audio segmentation system developed by the PRHLT research group at the UPV for the Albayzin Audio Segmentation Evaluation 2012. The PRHLT-UPV audio segmentation system is based on a conventional GMM-HMM speech recognition approach in which the vocabulary set is defined by the power set of segment classes. MFCC features were extracted to represent the acoustic signal and the AK toolkit was used for both, training acoustic models and performing audio segmentation. Experimental results reveals that our system provides an excellent performance on speech detection, so it could be successfully employed to provide speech segments to a diarization or speech recognition system. 1 Introduction Audio segmentation is a task with applications in subtitling, content indexing and analysis that has received notable attention due to the increasing application of automatic speech recognition (ASR) systems to multimedia repositories and broadcast news [6 9]. Formally, this task can be stated as the segmentation of a continuous audio stream into acoustically homogenous regions. Audio segmentation facilitates posterior speech processing steps such as speaker diarization or speech recognition. Previous work on audio segmentation can be classified into those tackling this task at the feature extraction level [1, 10, 4, 3], and those approximations working at the classification level [2, 5]. This latter approximation is adopted in our audio segmentation system. This paper describes the PRHLT-UPV audio segmentation system. First, the Albayzin Audio Segmentation Evaluation 2012 is presented in Section 2. Next, a complete system description is provided in Section 3. Experimental results are presented in Section 4. Finally, conclusions are drawn in Section 5.

2 Albayzin 2012 audio segmentation evaluation 2.1 Database description The database used for the audio segmentation evaluation consists of a Catalan broadcast news database from the 3/24 TV channel, which comprises 87 hours of acoustic data for training purposes. In this dataset, speech can be found in a 92% of the segments, music is present a 20% of the time and noise in the background a 40%. Another class called others was defined which can be found a 3% of the time. Regarding the overlapped classes, 40% of the time speech can be found along with noise and 15% of the time speech along with music. Table 1 shows the audio time distribution over all overlapping acoustic classes as disjoint sets for the training set. In addition, two sets, dev1 and dev2, from the Aragón Radio database of the Corporación Aragonesa de Radio y Televisión (CARTV), are used for developing and internal testing purposes, respectively. Both sets sums up to 4 hours of acoustic data. All audio signals are provided in PCM format, mono, little endian 16 bit resolution, and sampling frequency of 16 khz. Table 1. Audio time distribution of all overlapping classes for the training set. Class Time (h) Time (%) sp 31.85 38.2 mu 4.94 5.9 no 0.91 1.1 sp+mu 12.58 15.1 sp+no 31.36 37.6 no+mu 0.06 0.1 sp+no+mu 1.65 2.0 Total 87 100 2.2 Evaluation metric In order to assess the quality of our system, we used the Segmentation Error Rate metric (SER), defined as the fraction of class time that is not correctly attributed to that specific class (speech, noise or music): n SER = T(n)[max(N ref(n),n sys (n),) N correct (n)] n T(n)N (1) ref(n) where T(n) is the duration of segment n, N ref (n) is the number of reference classes that are present in segment n, N sys (n) is the number of system classes that are present in segment n, and N correct (n) is the number of reference classes in segment n correctly assigned by the segmentation system.

A forgiveness collar of one second, before and after each reference boundary, will be considered in order to take into account both inconsistent human annotations and the uncertainty about when a class begins or ends. 3 System description Audio segmentation can be viewed as a simplified case of ASR, in which the system vocabulary is constituted by the power set of segment classes: Speech (sp), music (mu) and noise (no). For the Albayzin evaluation the silence (si) class is also included to denote that none of the three classes is present in a given time instant. Thus, the system vocabulary is defined as C = {sp,mu,no,sp+mu,sp+no,mu+no,sp+mu+no,si} (2) Provided an audio stream x, the segmentation problem can be stated from a statistical point of view as the search of a sequence of class labels ĉ so that ĉ = argmax c C p(x c)p(c) (3) where, as in ASR, p(x c) and p(c) are modelled by acoustic and language models, respectively. In our case, it should be noted that each word is composed by a single phonem. Acoustic models were trained on MFCC feature vectors computed from acoustic samples using the HTK HCopy tool. We used a 0.97 coefficient preemphasis filter and a 25 ms Hamming window that moves every 10 ms over the acoustic signal. From each 10ms frame, a feature vector of 12 MFCC coefficients is obtained using a 26 channel filter bank. Finally, the energy coefficient and the first and second time derivatives of the cepstrum coefficients are added to the feature vector. Each segment class is represented by a single-state Hidden Markov Model (HMM) without loops, and its emission probability is modelled by a Gaussian Mixture Model (GMM). Acoustic HMM-GMM models were trained using the AK toolkit 1, which implements the conventional Baum-Welch algorithm. For each segment class, the number of mixture components per state was tuned on the development set. A 5-gram back-off language model with constant discount was trained on the sequence of class labels using the SRILM toolkit [11]. Constant discounts for each order were optimised on the development set. The segmentation process (search) was also carried out by the AK toolkit. 4 Experimental results This section is devoted to the description of the experimental setup and results performed before submitting our final audio segmentation system. For these 1 http://sourceforge.net/projects/aktoolkit

experiments, acoustic and language models were trained on the training set, while acoustic and language model parameters were tuned on the dev1 set. Table2showsSERfigurescomputed onthe dev1anddev2 sets.inadditionto the overall SER, SER values are provided for each acoustic class (speech, noise, music) in isolation. As observed in Table 2, our audio segmentation system offers an excellent performance in speech detection, so it could be successfully employed to provide speech segments to diarization or speech recognition systems. Table 2. Segmentation error rate (SER) for the three acoustic classes (speech, music, noise) in isolation and overall SER, computed over the dev1 and dev2 sets. Set Speech Music Noise Overall dev1 1.2 25.3 71.4 24.9 dev2 2.2 20.2 71.2 26.4 However, the system provides low performance at detecting non-speech classes, specially the noise class. This fact can be explained by two reasons. First, we are using a feature representation of the acoustic signal that is focused on highlighting human voice characteristics, and conversely to abate acoustic features from music and noise. Secondly, few music and noise data samples appear in isolation (5% and 1%, respectively) to make feasible to robustly estimate acoustic models for these classes. For this reason, the global classifier suffers from a bias towards the isolated speech class. For instance, the posterior probability of an sp+no segment, given the isolated speech model parameter set is expected to be larger than that of the isolated noise model parameter set. 5 Conclusions This paper has described the PRHLT-UPV audio segmentation system for the Albayzin evaluation. This system tackles the task of audio segmentation from the viewpoint of ASR system with a reduced vocabulary set. The vocabulary set comprises the speech, music and noise classes and combinations of those classes defined for this task. The experimental results show that this system provides excellent performance detecting speech segments, but its performance decays when dealing with music or noise segments. Acknowledgments. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 287755. Funding was also provided by the Spanish Government (itrans2 project, TIN2009-14511; FPU scholarship AP2010-4349).

References 1. Ajmera, J., McCowan, I., Bourlard, H.: Speech/music segmentation using entropy and dynamism features in a hmm classification framework. Speech Communication 40(3), 351 363 (2003) 2. Bugatti, A., Flammini, A., Migliorati, P.: Audio classification in speech and music: A comparison between a statistical and a neural approach. EURASIP J. Audio Speech Music Process. pp. 372 378 (2002) 3. Gallardo-Antolín, A., Montero, J.M.: Histogram equalization-based features for speech, music, and song discrimination. IEEE Signal processing letters 17(7), 659 662 (2010) 4. Izumitani, T., Mukai, R., Kashino, K.: A background music detection method based on robust feature extraction. In: Proc. of ICASSP. pp. 13 16 5. Lavner, Y., Ruinskiy, D.: A decision-tree-based algorithm for speech/music classification and segmentation. EURASIP J. Audio Speech Music Process. 2009, 2:1 2:14 (2009) 6. Li, D., Sethi, I.K., Dimitrova, N., McGee, T.: Classification of general audio data for content-based retrieval. Pattern Recognition Letters 22(5), 533 544 (2001) 7. Lu, L., Jiang, H., Zhang, H.: A robust audio classification and segmentation method. In: Proc. of ACM International Conference on Multimedia. pp. 203 211 (2001) 8. Meinedo, H., Neto, J.: Audio segmentation, classification and clustering in a broadcast news task. In: Proc. of ICASSP. vol. 2, pp. 5 8 (2003) 9. Nwe, T., Li, H.: Broadcast news segmentation by audio type analysis. In: Proc. of ICASSP. vol. 2, pp. 1065 1068 (2005) 10. Panagiotakis, C., Tziritas, G.: A speech/music discriminator based on rms and zero-crossings. IEEE Transactions on Multimedia 7(1), 155 166 (2005) 11. Stolcke, A.: SRILM an extensible language modeling toolkit. In: Proc. of IC- SLP 02. pp. 901 904 (September 2002)