Acoustic Scene Classification

Similar documents
Human Emotion Recognition From Speech

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speech Emotion Recognition Using Support Vector Machine

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

A study of speaker adaptation for DNN-based speech synthesis

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Modeling function word errors in DNN-HMM based LVCSR systems

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speaker recognition using universal background model on YOHO database

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Modeling function word errors in DNN-HMM based LVCSR systems

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Word Segmentation of Off-line Handwritten Documents

Affective Classification of Generic Audio Clips using Regression Models

WHEN THERE IS A mismatch between the acoustic

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Learning Methods in Multilingual Speech Recognition

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Lecture 1: Machine Learning Basics

Generative models and adversarial training

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Speaker Identification by Comparison of Smart Methods. Abstract

Speech Recognition at ICSI: Broadcast News and beyond

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Support Vector Machines for Speaker and Language Recognition

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Calibration of Confidence Measures in Speech Recognition

Australian Journal of Basic and Applied Sciences

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Segregation of Unvoiced Speech from Nonspeech Interference

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Python Machine Learning

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Data Fusion Models in WSNs: Comparison and Analysis

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Spoofing and countermeasures for automatic speaker verification

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Rule Learning With Negation: Issues Regarding Effectiveness

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

A Case Study: News Classification Based on Term Frequency

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Laboratorio di Intelligenza Artificiale e Robotica

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Speech Recognition by Indexing and Sequencing

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Probabilistic Latent Semantic Analysis

Reducing Features to Improve Bug Prediction

CS Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Welcome to. ECML/PKDD 2004 Community meeting

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Rule Learning with Negation: Issues Regarding Effectiveness

Speaker Recognition. Speaker Diarization and Identification

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Learning to Schedule Straight-Line Code

On the Formation of Phoneme Categories in DNN Acoustic Models

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Laboratorio di Intelligenza Artificiale e Robotica

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Learning Methods for Fuzzy Systems

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

DISTANCE LEARNING OF ENGINEERING BASED SUBJECTS: A CASE STUDY. Felicia L.C. Ong (author and presenter) University of Bradford, United Kingdom

Measurement and statistical modeling of the urban heat island of the city of Utrecht (the Netherlands)

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Assignment 1: Predicting Amazon Review Ratings

Edinburgh Research Explorer

Speaker Recognition For Speech Under Face Cover

THE world surrounding us involves multiple modalities

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Truth Inference in Crowdsourcing: Is the Problem Solved?

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

What is a Mental Model?

Deep Neural Network Language Models

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Switchboard Language Model Improvement with Conversational Data from Gigaword

Xinyu Tang. Education. Research Interests. Honors and Awards. Professional Experience

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

On-Line Data Analytics

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Using dialogue context to improve parsing performance in dialogue systems

Linking Task: Identifying authors and book titles in verbose queries

Transcription:

1 Acoustic Scene Classification By Yuliya Sergiyenko Seminar: Topics in Computer Music RWTH Aachen 24/06/2015

2 Outline 1. What is Acoustic scene classification (ASC) 2. History 1. Cocktail party problem & CASA 2. Sets of categories 3. DCASE challenge 3. Room Identification (RI) 1. Room Impulse Responses 2. Room Volume Classification 3. State of the art 1. MFCC 2. GMM 4. Competing solution 4. Conclusions

3 Acoustic scene classification Acoustic scene classification (ASC) - identifying the location at which the audio or video recording was made. Possible application fields: hearing aids with automatic program adaptation: speech recognition; forensic analysis; etc.

4 History One of our most important faculties is our ability to listen to, and follow, one speaker in the presence of others. We may call it the cocktail party problem (Cherry, 1957) Computational auditory scene analysis (CASA) is the study of auditory scene analysis (ASA) by computational means (Bregman, 1990). The goal of the CASA system is to be able to separate sound mixtures in the same way that humans are able to do.

5 History Researchers generally define a set of categories, record samples from these environments, and treat ASC as a supervised classification problem within a closed universe of possible classes. [1] Sawhney and Maes in 1997 described a simple classification of five predefined classes of environmental sounds: people, voices, subway, traffic, and other, via extraction of several discriminating features.

6 DCASE challenge To evaluate and compare existing algorithms in ACS the IEEE Audio and Acoustic Signal Processing (AASP) Technical Committee organized a challenge in 2013. The DCASE challenge dataset was especially created to provide researchers with a standardised set of recordings produced in 10 different urban environments. The soundscapes included: bus, busystreet, office, openairmarket, park, quiet-street, restaurant, supermarket, tube (underground railway) and tubestation.

7 New problem - Room identification Room identification (RI) - given the audio, identifying the particular room in which the recording was made. Beneficial in: Location estimation; Music recommendation; Better speech recognition indoors; Law-enforcement and forensics.

Location estimation: GPS data only rough estimate often fails indoors Strength of WiFi signals (2010) location must be estimated during capturing needs sufficient WiFi coverage Visual similarities (video) changes in spatial configuration Music recommendation: Could help create a list of recordings made at specific venue 8

9 Better speech recognition indoors: Automated speech recognition systems are affected by unknown room reverberance. Knowing the room will help adapt recognition system. Law-enforcement and forensics: Emergency phone calls more clues filter out fake calls

10 Room Impulse Responses (RIR) Rooms can be described through room impulse responses fingerprints of the room.[2] Obtaining RIRs is a very time-consuming process and specific measurement signals and equipment are needed. [3] It is often too complicated or even impossible to conduct such RIR measurements.

11 Room Volume Classification Room volume classification from reverberant speech signals can be useful in acoustic scene analysis applications to help in characterizing the types of rooms. [4] Previous work required the room impulse response (RIR) to explicitly either estimate or classify the room volume so the researches started looking for simpler approaches.

12 State of the art in RI N. Peters, H. Lei, and G. Friedand, Name That Room: Room Identification Using Acoustic Features in a Recording, 2012 [5] Ideas: analyzing the audio component in multimedia data; using machine learning techniques to identify rooms from ordinary audio recordings. This room identification system is derived from a GMM based system using Mel-Frequency Cepstral Coefficient (MFCC) acoustic features.

13 GMMs Mixture Models are a type of density model which comprise a number of component functions. A Gaussian Mixture Model (GMM) is a parametric probability density function represented as a weighted sum of Gaussian component densities. GMMs are commonly used as a parametric model of the probability distribution of continuous measurements or features in a biometric system.

14 MFCCs Mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel-scale of frequency. Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum").

15 The system For each audio recording, one room-dependent GMM is trained for each room using MFCC features from all audio recordings associated with that room. This is done via MAP adaptation from a room-independent GMM, trained using MFCC features from all audio tracks of all rooms in the development set. During testing, the likelihood of MFCC features from the test audio tracks are computed using the room-dependent GMMs of each room in the training set. [5]

16 Results The room identification system is able to relate audio data to the correct room. The estimation error is not randomly distributed. Rather it depends on the (acoustical) similarities of the tested rooms. For room identification short-term MFCC features are found more suitable than-long term MFCC features.

17 Results Non-parametric multidimensional scaling (MDS) was performed on the confusion data. MDS is a technique where dissimilarities of data points are modeled as distances in a lowdimensional space. A large dissimilarity is represented by a large distance and vice versa. The system achieved overall accuracy of 61% for music and 85% for speech signals.

18 Competing solution A. H. Moore, M.Brookes and P. A. Naylor, Roomprinting for forensic audio applications. [6] Ideas: Roomprint - a set of features of a room are inferred from a recording made in the room and are compared to a set of reference roomprints in order to perform identification or verification of the recording location Roomprint can include any aspect of the room which can be explicitly measured

19 Main questions Verification: If it is claimed that a recording was made in a particular room, is there sufficient evidence to reject the claim? Identification: If it is known that a recording was made in one of a number of rooms, can we determine which one is most likely? [] covered in 2010 by Peters et al. in Name that room differences in measurement systems or technique may have caused some of the between-room variability

20 Roomprints requirements A roomprint must exploit features of a room which allow it to be distinguished from other potentially similar rooms. A roomprint should ideally be invariant to the location of the talker and microphone in the room. A roomprint should ideally be invariant with time.

Parameters to include: Geometric features (the size and shape of the room) Room acoustics parameters Environmental sounds Frequency-dependent reverberation time is investigated in this paper as a promising characteristic and used in a room identification experiment. Results: an error rate of 3.9% has been obtained in a room identification experiment over 22 rooms (correct identification in 96% of trials). 21

22 Conclusions Different state of the art approaches achieve nearly the same results. There is still need for improvement until any algorithm can reach the same efficiency as human in acoustic scene classification. Using machine learning techniques for identifying room acoustic properties is a very young field of research that is becoming more important. The concept of Roomprints is the state of the art in room identification area to this day. Frequency-dependent reverberation time is a good characteristic to use in a room identification process.

23 References 1. Daniele Barchiesi, Dimitrios Giannoulis, Dan Stowell and Mark D. Plumbley, Acoustic Scene Classification, IEEE. School of Electronic Engineering and Computer Science, November 17, 2014 2. ISO 3382-1. Acoustics Measurement of room acoustic parameters Part 1: Performance spaces. International Organization for Standardization (ISO), Geneva, Switzerland, 2009. 3. G. Stan, J. Embrechts, and D. Archambeau. Comparison of different impulse response measurement techniques. J. Audio Eng. Soc., 50(4):249 262, 2002. [18] R. Stewart and M. Sandler. Database of omnidirectional and B-format impulse responses. In Proc. of ICASSP, Dallas, USA, 2010. 4. N. Shabtai, B. Rafaely, and Y. Zigel. Room volume classification from reverberant speech. In Proc. of int l Workshop on Acoustics Signal Enhancement, Tel Aviv, Israel, 2010. 5. N. Peters, H. Lei, and G. Friedand, Name That Room: Room Identification Using Acoustic Features in a Recording, in Proc. of ACM Multimedia, Nara, Japan, 2012 6. A.H. Moore, M. Brookes, P.A. Naylor, Roomprints for forensic audio applications, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), October 20-23, 2013, New Paltz, NY