Voice Activity Detection. Roope Kiiski

Voice Activity Detection Roope Kiiski Speech recognition 4.12.2015

Content Basics of Voice Activity Detection (VAD) Features, classifier and thresholding In-depth look at different features Different kinds of classifiers My project thus far

VAD s idea is to detect whether a signal contains speech or not Discuss in groups for a minute: Why would VAD be used? What are the benefits of VAD? Speech recognition 4.12.2015

Basics of VAD Voice activity detection is basically a pre-processing algorithm. In speech coding, used to reduce amount of transmitted data, by switching off the transmission when there is no speech In speech recognition, saves processing power by sending only the parts with speech to the recognition engine. Can also be used to detect background noise, and then compensate the background noise from the speech signal.

Basics of VAD Trivial case The trivial case of voice activity detection is speech with no background noise whatsoever. In that case, we can assume that whenever there is any activity in the signal, it is speech. Hardly represents any true world signal.

Basics of VAD - Example Example of a trivial case:

Basics of VAD - Example Lets add some noise:

Basics of VAD - Performance How to measure performance of VAD? Which one is worse: false positive or false negative? Is it better to find too much speech or is it better to miss some speech? Depends on the application For coding speech we want to keep speech quality high, so we want to avoid missing speech. False negatives are bad! For keyword spotting we want to save processing power, and thus we want to avoid finding too much speech. False positives are bad!

Basics of VAD - Hangover How to increase the performance of the VAD? When do mistakes happen? Most common mistake is that VAD misses the end of the word or some silent part in the middle. This can be corrected with hysteresis, by adding a hangover. Basically, if any of the previous X frames was speech, then the current one is too.

Basics of VAD - Example Hangover:

Basics of VAD - Features Previous examples were quite simple and sometimes VAD still failed. How to make VAD more robust and accurate? By adding more features! There are various characteristics that differentiate speech and noise. We need to find out what these are and then use them. Measures of these characteristics are features.

Features Different features There are plenty of different features, but they all try to give an indication if the signal is speech or not. Signal energy is a good feature as seen previously It is also known that speech has energy mainly at the low frequencies. Zero-crossings can estimate this, as high-frequency signals have more zero-crossings than low-frequency signals. Speech can be modelled by linear prediction. Linear prediction error indicates whether signal is speech or not.

Features Different features Voiced speech also has a pitch, which can be calculated and used as a feature. Usually these features change over time, usually pretty rapidly. Thus the rate of change of the features can be used to gain information of the signal. Even the second difference can contain some information!

Features - Example

Classifier Now we have plenty of data! What to do with it?! We implement a classifier. Classifier is a system that takes all the features, and then outputs a decision for each frame, based on the features of the said frame. Can be implemented in various ways: Decision trees Linear classifiers Neural networks, Gaussian mixture models etc.

Classifier Decision tree Decision trees are simple to implement. They are hard-coded and thus not too flexible. Overall they are pretty bad, only good when the system is low-complexity and low-noise, and if accuracy isn t too important.

Classifier Linear classifier Instead of manually tuning the decisions, we make an estimate based on statistics and observed data. Decision is based on weighted sum of the features: Weights for each feature can be calculated when we know the features value for each frame, and when we know what is the desired result of the frame. In short, skipping all the math, the weights can be calculated from: w = (XX T ) 1 Xy T X + y Where X + is Moore-Penrose pseudo-inverse of the feature matrix, y is the desired output and w is vector of the weights.

Classifier - Example

Classifier - Comparison

Classifier Other classifier There are multiple other classifiers, including linear discrimination analysis, Gaussian mixture models, Neural networks and K-nearest neighbours, Support vector machine etc. Usually they are more effective, but the implementation and training of the method is more complex. Thus I didn t implement them :)

Classifier Conclusion Decision trees are simple, but sensitive to noise. Linear classifier is a lot more robust and less sensitive to noise, but it s a bit more complex than decision trees. More advanced classifiers have some advantages, such as being even less sensitive to noise, but they are much more complex. Usually, and in somewhat simple cases, linear classifier is enough.

Speech presence probability Basically all classifiers output a continuous number, which can be considered to be Speech Presence Probability. With correct threshold, we can transform the SPP into VAD.

Common problems for VAD The hardest case for VAD is a situation where there are multiple speakers or speech on the background. Then it is very hard to recognize which parts of the speech are really meant for the VAD and which are just noise. White noise is not so hard, but still it is too hard for the simplest models.

My project thus far I ve implemented feature extractors, which get the features for each frame. I ve implemented a very crude, hard coded decision tree and also a linear classifier. All the examples in this presentation are produced by my implementation, and personally I m pretty happy with its performance with the samples I ve tested. Still I need to test it with more samples though.

Conclusion The basic algorithm: Extract features from the signal. Use some classifier to get a likelihood of speech from the features. Threshold the output of classifier to determine if the signal includes speech or not. VADs main use is to reduce bandwidth and/or processing power.

Thank you for your time! Any Questions? Sources: http://www.intechopen.com/source/pdfs/104/intech- Voice_activity_detection_fundamentals_and_speech_re cognition_system_robustness.pdf https://mycourses.aalto.fi/pluginfile.php/146209/mod_res ource/content/1/slides_07_vad.pdf