Voice Activity Detection. Roope Kiiski

Similar documents
Speech Emotion Recognition Using Support Vector Machine

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Human Emotion Recognition From Speech

Generative models and adversarial training

Lecture 1: Machine Learning Basics

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speech Recognition at ICSI: Broadcast News and beyond

Calibration of Confidence Measures in Speech Recognition

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Artificial Neural Networks written examination

Rule Learning With Negation: Issues Regarding Effectiveness

Modeling function word errors in DNN-HMM based LVCSR systems

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

A study of speaker adaptation for DNN-based speech synthesis

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

A Case Study: News Classification Based on Term Frequency

WHEN THERE IS A mismatch between the acoustic

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Rule Learning with Negation: Issues Regarding Effectiveness

Australian Journal of Basic and Applied Sciences

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Python Machine Learning

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Word Segmentation of Off-line Handwritten Documents

Assignment 1: Predicting Amazon Review Ratings

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

Circuit Simulators: A Revolutionary E-Learning Platform

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speaker recognition using universal background model on YOHO database

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Getting Started with Deliberate Practice

MTH 215: Introduction to Linear Algebra

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Support Vector Machines for Speaker and Language Recognition

Learning Methods in Multilingual Speech Recognition

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

Cross-Lingual Text Categorization

Automatic Pronunciation Checker

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Speaker Identification by Comparison of Smart Methods. Abstract

Probabilistic Latent Semantic Analysis

Why Did My Detector Do That?!

CS 446: Machine Learning

(Sub)Gradient Descent

Discriminative Learning of Beam-Search Heuristics for Planning

INPE São José dos Campos

Using dialogue context to improve parsing performance in dialogue systems

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

Exploration. CS : Deep Reinforcement Learning Sergey Levine

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

School of Innovative Technologies and Engineering

An empirical study of learning speed in backpropagation

Five Challenges for the Collaborative Classroom and How to Solve Them

Learning Methods for Fuzzy Systems

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

CS Machine Learning

While you are waiting... socrative.com, room number SIMLANG2016

Grade 6: Correlated to AGS Basic Math Skills

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

B. How to write a research paper

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Speaker Recognition. Speaker Diarization and Identification

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Reducing Features to Improve Bug Prediction

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Axiom 2013 Team Description Paper

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

A Reinforcement Learning Variant for Control Scheduling

IN THIS UNIT YOU LEARN HOW TO: SPEAKING 1 Work in pairs. Discuss the questions. 2 Work with a new partner. Discuss the questions.

arxiv: v2 [cs.cv] 30 Mar 2017

Managerial Decision Making

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

Universiteit Leiden ICT in Business

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Edinburgh Research Explorer

Transcription:

Voice Activity Detection Roope Kiiski Speech recognition 4.12.2015

Content Basics of Voice Activity Detection (VAD) Features, classifier and thresholding In-depth look at different features Different kinds of classifiers My project thus far

VAD s idea is to detect whether a signal contains speech or not Discuss in groups for a minute: Why would VAD be used? What are the benefits of VAD? Speech recognition 4.12.2015

Basics of VAD Voice activity detection is basically a pre-processing algorithm. In speech coding, used to reduce amount of transmitted data, by switching off the transmission when there is no speech In speech recognition, saves processing power by sending only the parts with speech to the recognition engine. Can also be used to detect background noise, and then compensate the background noise from the speech signal.

Basics of VAD Trivial case The trivial case of voice activity detection is speech with no background noise whatsoever. In that case, we can assume that whenever there is any activity in the signal, it is speech. Hardly represents any true world signal.

Basics of VAD - Example Example of a trivial case:

Basics of VAD - Example Lets add some noise:

Basics of VAD - Performance How to measure performance of VAD? Which one is worse: false positive or false negative? Is it better to find too much speech or is it better to miss some speech? Depends on the application For coding speech we want to keep speech quality high, so we want to avoid missing speech. False negatives are bad! For keyword spotting we want to save processing power, and thus we want to avoid finding too much speech. False positives are bad!

Basics of VAD - Hangover How to increase the performance of the VAD? When do mistakes happen? Most common mistake is that VAD misses the end of the word or some silent part in the middle. This can be corrected with hysteresis, by adding a hangover. Basically, if any of the previous X frames was speech, then the current one is too.

Basics of VAD - Example Hangover:

Basics of VAD - Features Previous examples were quite simple and sometimes VAD still failed. How to make VAD more robust and accurate? By adding more features! There are various characteristics that differentiate speech and noise. We need to find out what these are and then use them. Measures of these characteristics are features.

Features Different features There are plenty of different features, but they all try to give an indication if the signal is speech or not. Signal energy is a good feature as seen previously It is also known that speech has energy mainly at the low frequencies. Zero-crossings can estimate this, as high-frequency signals have more zero-crossings than low-frequency signals. Speech can be modelled by linear prediction. Linear prediction error indicates whether signal is speech or not.

Features Different features Voiced speech also has a pitch, which can be calculated and used as a feature. Usually these features change over time, usually pretty rapidly. Thus the rate of change of the features can be used to gain information of the signal. Even the second difference can contain some information!

Features - Example

Classifier Now we have plenty of data! What to do with it?! We implement a classifier. Classifier is a system that takes all the features, and then outputs a decision for each frame, based on the features of the said frame. Can be implemented in various ways: Decision trees Linear classifiers Neural networks, Gaussian mixture models etc.

Classifier Decision tree Decision trees are simple to implement. They are hard-coded and thus not too flexible. Overall they are pretty bad, only good when the system is low-complexity and low-noise, and if accuracy isn t too important.

Classifier Linear classifier Instead of manually tuning the decisions, we make an estimate based on statistics and observed data. Decision is based on weighted sum of the features: Weights for each feature can be calculated when we know the features value for each frame, and when we know what is the desired result of the frame. In short, skipping all the math, the weights can be calculated from: w = (XX T ) 1 Xy T X + y Where X + is Moore-Penrose pseudo-inverse of the feature matrix, y is the desired output and w is vector of the weights.

Classifier - Example

Classifier - Comparison

Classifier Other classifier There are multiple other classifiers, including linear discrimination analysis, Gaussian mixture models, Neural networks and K-nearest neighbours, Support vector machine etc. Usually they are more effective, but the implementation and training of the method is more complex. Thus I didn t implement them :)

Classifier Conclusion Decision trees are simple, but sensitive to noise. Linear classifier is a lot more robust and less sensitive to noise, but it s a bit more complex than decision trees. More advanced classifiers have some advantages, such as being even less sensitive to noise, but they are much more complex. Usually, and in somewhat simple cases, linear classifier is enough.

Speech presence probability Basically all classifiers output a continuous number, which can be considered to be Speech Presence Probability. With correct threshold, we can transform the SPP into VAD.

Common problems for VAD The hardest case for VAD is a situation where there are multiple speakers or speech on the background. Then it is very hard to recognize which parts of the speech are really meant for the VAD and which are just noise. White noise is not so hard, but still it is too hard for the simplest models.

My project thus far I ve implemented feature extractors, which get the features for each frame. I ve implemented a very crude, hard coded decision tree and also a linear classifier. All the examples in this presentation are produced by my implementation, and personally I m pretty happy with its performance with the samples I ve tested. Still I need to test it with more samples though.

Conclusion The basic algorithm: Extract features from the signal. Use some classifier to get a likelihood of speech from the features. Threshold the output of classifier to determine if the signal includes speech or not. VADs main use is to reduce bandwidth and/or processing power.

Thank you for your time! Any Questions? Sources: http://www.intechopen.com/source/pdfs/104/intech- Voice_activity_detection_fundamentals_and_speech_re cognition_system_robustness.pdf https://mycourses.aalto.fi/pluginfile.php/146209/mod_res ource/content/1/slides_07_vad.pdf