Bird Sounds Classification by Large Scale Acoustic Features and Extreme Learning Machine

Similar documents
A study of speaker adaptation for DNN-based speech synthesis

Speech Emotion Recognition Using Support Vector Machine

Affective Classification of Generic Audio Clips using Regression Models

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Evolutive Neural Net Fuzzy Filtering: Basic Description

Human Emotion Recognition From Speech

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Lecture 1: Machine Learning Basics

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Speech Recognition at ICSI: Broadcast News and beyond

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

On the Formation of Phoneme Categories in DNN Acoustic Models

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Word Segmentation of Off-line Handwritten Documents

Python Machine Learning

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Australian Journal of Basic and Applied Sciences

Modeling function word errors in DNN-HMM based LVCSR systems

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Modeling function word errors in DNN-HMM based LVCSR systems

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Time series prediction

Test Effort Estimation Using Neural Network

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Seminar - Organic Computing

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Rule Learning With Negation: Issues Regarding Effectiveness

Learning Methods for Fuzzy Systems

Speaker Identification by Comparison of Smart Methods. Abstract

INPE São José dos Campos

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Calibration of Confidence Measures in Speech Recognition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

A Case Study: News Classification Based on Term Frequency

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Improvements to the Pruning Behavior of DNN Acoustic Models

Automatic Pronunciation Checker

Forget catastrophic forgetting: AI that learns after deployment

SARDNET: A Self-Organizing Feature Map for Sequences

arxiv: v1 [cs.lg] 15 Jun 2015

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A Review: Speech Recognition with Deep Learning Methods

Artificial Neural Networks written examination

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Evolution of Symbolisation in Chimpanzees and Neural Nets

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Generative models and adversarial training

Using dialogue context to improve parsing performance in dialogue systems

Artificial Neural Networks

WHEN THERE IS A mismatch between the acoustic

Mining Association Rules in Student s Assessment Data

Mandarin Lexical Tone Recognition: The Gating Paradigm

Deep Neural Network Language Models

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Learning Methods in Multilingual Speech Recognition

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Rule Learning with Negation: Issues Regarding Effectiveness

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Softprop: Softmax Neural Network Backpropagation Learning

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Probabilistic Latent Semantic Analysis

Reducing Features to Improve Bug Prediction

Classification Using ANN: A Review

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Customised Software Tools for Quality Measurement Application of Open Source Software in Education

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

CS Machine Learning

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Proceedings of Meetings on Acoustics

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Multi-Lingual Text Leveling

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

An Empirical and Computational Test of Linguistic Relativity

(Sub)Gradient Descent

Body-Conducted Speech Recognition and its Application to Speech Support System

Knowledge Transfer in Deep Convolutional Neural Nets

AQUA: An Ontology-Driven Question Answering System

Speaker recognition using universal background model on YOHO database

Lecture 1: Basic Concepts of Machine Learning

Transcription:

Technische Universität München Bird Sounds Classification by Large Scale Acoustic Features and Extreme Learning Machine Kun Qian, Zixing Zhang, Fabien Ringeval, Björn Schuller Session Biological and Biomedical Signal Processing December 16, 2015

Outline Motivation Approach Database Experiments Conclusion Zixing Zhang 2

Motivation Monitoring CLIMATE CHANGE and HABITAT LOSS. Classification of bird species by their sounds is less expensive and superior in bad weather condition than telescope. Interdisciplinary Study: Ecology, Zoology, Bioacoustics, Signal Processing, Machine Learning, Big Data, etc. Zixing Zhang 3

Motivation Systematic Framework Syllables Detection: How to find the suitable units for further feature extraction and machine learning? (Supervised or Unsupervised, Semi-supervised) Feature Extraction: How to define the capable descriptors for feeding the learning model? (Speech-like or New) Feature Selection: How to re-generate or modify the original Lower Level Descriptors (LLDs) for reducing the feature dimensions? (Classical Methods or Deep Neural Network) Machine Learning: How to set up feasible learning architecture? (Extreme Learning Machine) Zixing Zhang 4

Approach Syllables Detection: Unsupervised method based on p-center detector. Large Scale Acoustic Features Extraction: opensmile toolkit (INTERSPEECH 2009 Emotion Challenge feature set). Feature Selection: ReliefF algorithm (ranking features by their performance on classification). Machine Learning: Extreme Learning Machine (ELM). Zixing Zhang 5

P-center Detector Originated from estimating the values of entropy, the average frequency, and the centroid with the rhythmic envelope. No needs for data training phase, which is usually timeconsuming and taking much more human works than unsupervised methods. Adaptive to current processing audio recording (e.g., the quality of audio signals, the background noise level, and the specific bird sound characters, etc.). S. Tilsen and K. Johnson, Low-frequency fourier analysis of speech rhythm, The Journal of the Acoustical Society of America, vol. 124, no. 2, pp. EL34 EL39, 2008. Zixing Zhang 6

P-center Detector P-center represents the prominent part of the audio signal. Thus, the syllables can be detected when a suitable threshold and consecutive duration are set. (bird species: house sparrow) Zixing Zhang 7

P-center Detector Detection of syllables by p-center and its corresponding spectrogram. (bird species: house sparrow) Zixing Zhang 8

Large Scale Acoustic Features Extraction INTERSPEECH 2009 EC standard feature set: 12 functionals, 2 x 16 acoustic Low-Level Descriptors (LLDs), with first order delta regression coefficients, totally 12 x 2 x 16 = 384 dimensions. Toolkit: opensmile http://opensmile.sourceforge.net/ Zixing Zhang 9

Feature Selection Feature Ranking (to know which one is good or bad). ReliefF (can be regarded as an evaluator to rank features) We can get the ranking weights W (i) of the i-th feature evaluated by ReliefF algorithm. In our study, we introduce contribution rate to select the better features for further machine learning phase: where W + represents the descending sorted weights of features evaluated by ReliefF. M. Robnik-Sikonja and I. Kononenko, Theoretical and empirical analysis of relieff and rrelieff, Machine Learning, vol. 53, no. 1-2, pp. 23 69, 2003. Zixing Zhang 10

Classifier: Extreme Learning Machine (ELM) Fast and Efficient A Feedforward Neural Network with a Single Hidden Layer Three-Step Learning Model Parameters Setting: Activation Function: radbas ; Number of Hidden Nodes: 30, 000. codes available @: http://www.ntu.edu.sg/home/egbhuang/elm_codes.html G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, Extreme learning machine: theory and applications, Neurocomputing, vol. 70, no. 1, pp. 489 501, 2006. Zixing Zhang 11

Database Free & Public Database @ (the picture below is also from:) http://gallery.new-ecopsychology.org/en/voices-of nature.htm (54 species of birds, recorded in real field with high audio quality) Zixing Zhang 12

Experimental Results A comparison with different classifiers for 54 species of birds classification UAR (Unweighted Average Recall): Calculated by the sum of recall values (class-wise accuracy) for all classes divided by the number of classes. Accuracy, i. e., WAR (Weighted Average Recall): Widely used, the correctly classified instance numbers divided by the total number of instances. Björn Schuller 13

Experimental Results Feature Selection Nearly 10% improvement of UAR, and with less than 15% features used. Zixing Zhang 14

Experimental Results Classification Results with Different Scales of Species Excellent (species below 45), Good (species up to 54). Zixing Zhang 15

Conclusions The whole framework proposed is efficient and feasible. P-center based detector can be applied to the unsupervised syllables detection phase. opensmile toolkit can be used in other areas beyond the speech emotion recognition. Feature selection is a necessary phase in the classification system. ELM-based classifier can be regarded as an efficient and robust model. Zixing Zhang 16

Future Works Large Database Needed: Like the database collected by Xeno-Canto, a website dedicated to sharing bird sounds from all over the world. (includes 279,583 recordings, 9,443 species of birds, more than 3,700 hours of recording time) Note: this picture is coming from: http://www.xeno-canto.org/ Zixing Zhang 17

Future Works REAL Large Scale Features Needed: Our opensmile toolkit can extract up to more than 6,000 dimensions of features for machine learning. Syllables Detection Methods: Some other unsupervised techniques should be tested. Classifiers: Deep Neural Networks (DNNs) or Advanced ELMs. Zixing Zhang 18

Thank you! Zixing Zhang 19