Precision Scaling of Neural Networks for Efficient Audio Processing

Similar documents
ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Human Emotion Recognition From Speech

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Modeling function word errors in DNN-HMM based LVCSR systems

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

A study of speaker adaptation for DNN-based speech synthesis

WHEN THERE IS A mismatch between the acoustic

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Modeling function word errors in DNN-HMM based LVCSR systems

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

CS Machine Learning

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Speech Recognition at ICSI: Broadcast News and beyond

Word Segmentation of Off-line Handwritten Documents

Assignment 1: Predicting Amazon Review Ratings

Rule Learning With Negation: Issues Regarding Effectiveness

Knowledge Transfer in Deep Convolutional Neural Nets

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Lecture 1: Machine Learning Basics

Learning Methods in Multilingual Speech Recognition

Python Machine Learning

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Calibration of Confidence Measures in Speech Recognition

Improvements to the Pruning Behavior of DNN Acoustic Models

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Speech Emotion Recognition Using Support Vector Machine

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Australian Journal of Basic and Applied Sciences

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Rule Learning with Negation: Issues Regarding Effectiveness

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Evolutive Neural Net Fuzzy Filtering: Basic Description

arxiv: v1 [cs.lg] 7 Apr 2015

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Generative models and adversarial training

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Model Ensemble for Click Prediction in Bing Search Ads

Using EEG to Improve Massive Open Online Courses Feedback Interaction

On the Formation of Phoneme Categories in DNN Acoustic Models

Segregation of Unvoiced Speech from Nonspeech Interference

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Affective Classification of Generic Audio Clips using Regression Models

Lecture 2: Quantifiers and Approximation

Using dialogue context to improve parsing performance in dialogue systems

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Author's personal copy

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Deep Neural Network Language Models

arxiv: v1 [cs.lg] 15 Jun 2015

A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation

Active Learning. Yingyu Liang Computer Sciences 760 Fall

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

A Deep Bag-of-Features Model for Music Auto-Tagging

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Time series prediction

On the Combined Behavior of Autonomous Resource Management Agents

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speaker Identification by Comparison of Smart Methods. Abstract

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Mining Association Rules in Student s Assessment Data

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Discriminative Learning of Beam-Search Heuristics for Planning

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Learning Methods for Fuzzy Systems

THE enormous growth of unstructured data, including

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Reducing Features to Improve Bug Prediction

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

arxiv: v4 [cs.cl] 28 Mar 2016

arxiv: v1 [cs.cl] 2 Apr 2017

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Artificial Neural Networks written examination

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

A Neural Network GUI Tested on Text-To-Phoneme Mapping

INPE São José dos Campos

Georgetown University at TREC 2017 Dynamic Domain Track

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Multi-Lingual Text Leveling

Transcription:

Precision Scaling of Neural Networks for Efficient Audio Processing Jong Hwan Ko School of Electrical and Computer Engineering Georgia Institue of Technology jonghwan.ko@gatech.edu Josh Fromm Department of Electrical Engineering University of Washington jwfromm@uw.edu Matthai Philipose, Ivan Tashev, and Shuayb Zarar Microsoft Research {matthaip, ivantash, shuayb}@microsoft.com Abstract While deep neural networks have shown powerful performance in many audio applications, their large computation and memory demand has been a challenge for real-time processing. In this paper, we study the impact of scaling the precision of neural networks on the performance of two common audio processing tasks, namely, voice-activity detection and single-channel speech enhancement. We determine the optimal pair of weight/neuron bit precision by exploring its impact on both the performance and processing time. Through experiments conducted with real user data, we demonstrate that deep neural networks that use lower bit precision significantly reduce the processing time (up to 30x). However, their performance impact is low (< 3.%) only in the case of classification tasks such as those present in voice activity detection. Introduction Voice activity detection (VAD) and speech enhancement are critical front-end components of audio processing systems, as they enable the rest of the system to process only the speech segments of input audio samples with improved quality []. With the rapid development of deep-learning technologies, VAD and speech enhancement approaches based on deep neural networks (DNNs) have shown powerful performance highly competitive to conventional methods [, 3, ]. However, DNNs are inherently complex with high computation and memory demand [5], which is a critical challenge in real-time speech applications. For example, even a simple 3-layer DNN for speech enhancement requires MOPs/frame and 56 MB of memory, as shown in column 3 of Table. Table : Computation/memory demand and performance of DNNs with baseline/reduced bit width. The processing time was measured on a CNTK framework [6] with an Intel CPU Task Weight/neuron bit width MOPs/frame Memory (MB) Processing time/frame (ms) Performance Voice activity detection Speech enhancement 3/3 [] 3 6.0 3.0% * / [This work] 0. ( ) 0.9 (3 ).6 (30 ).3% * 3/3 [3] 56,.0 / [This work].3 ( ).75 (3 ) 3.9 (30 ) 0.33 * VAD error, SNR improvement 3st Conference on Neural Information Processing Systems (NIPS 07), Long Beach, CA, USA.

A recently proposed method for reducing the computation and memory demand is a precision scaling technique that represents the weights and/or neurons of the network with reduced number of bits [7]. While several studies have shown effective application of binarized (-bit) networks in image classification tasks [, 9], to the best of our knowledge, no work has been done to analyze the effect of various bit-width pairs of weights and neurons on the processing time and the performance of audio processing tasks like VAD and single-channel speech enhancement. In this paper, we present the design of efficient deep neural networks for VAD and speech enhancement that scales the precision of data representation within the neural network. To minimize the bitquantization error, we use a bit-allocation scheme based on the global distribution of the weight/neuron values. The optimal pair of weight/neuron bit precision is determined by exploring the impact of bit widths on both the performance and the processing time. Our best results show that the DNN for VAD with -bit weights and -bit neurons (W/N) reduces the processing time by 30, providing 3.7 lower processing time and 9.5% lower error rate than a state-of-the-art WebRTC VAD [0]. For speech enhancement, the DNN with W/N bit precision enhances SNR (signal-to-noise ratio) by 0.33 with 30 smaller processing time. Precision Scaling of Deep Neural Networks While the rounding scheme is commonly used for precision scaling [], it can result in large quantization error as it does not consider global distribution of the values. In this work, we use a precision scaling method based on residual error mean binarization [], in which each bit assignment is associated with the corresponding approximate value determined by the distribution of the original values. As illustrated in Figure (a), the first representation bit is assigned deterministically based on their sign, and the approximate value for each bit assignment is computed by adding/subtracting the average distance from the reference value (0 in the first bit assignment). Each approximate value becomes the reference of each bit segment in the next bit assignment step. This approach allocates the same number of values in each bit assignment bin to minimize the quantization error. We estimate the ideal inference speedup due to the reduced bit precision by counting the number of operations in each bit-precision case [see Figure (b)]. In the regular 3-bit network, we need two operations (3-bit multiplication and accumulation) per one pair of input feature and weight element. When the network has -bit neurons and weights, multiplication can be replaced with XNOR and bit count operations, which can be performed with 6 elements per cycle. When the network has or more bit neurons and weights, we need to perform the three operations for all combinations of the bits. Therefore, the ideal speedup is computed as Speedup = max (, ). 3 weight bit width neuron bit width We have implemented our precision scaling methodology within the CNTK framework [6], which provides optimized CPU-implementations for variable bit precision DNN layers. Figure shows the ideal speedup and the actual speedup measured on an Intel processor. The measured speedup is similar to or even higher than the ideal values because of the benefits of loading the low-precision (a) Example bit allocation (b) (Top) 3-bit, (Bottom) -bit network. Figure : The approach of extreme precision scaling or binarization of neural networks that is distribution sensitive.

Figure : Speedup due to reduced bit precision of neurons and weights: (a) Ideal and (b) measured speedup. Blue bars indicate speedup > and gray bars indicate speedup =. weights, as the bottleneck of the CNTK matrix multiplication is memory access. The figure also indicates that reducing weight bits leads to higher speedup than reducing neuron bits since the weights can be pre-quantized, making their memory loads very efficient. 3 Experimental Framework Dataset: We created 750/50/50 files of training/validation/test datasets by convolving clean speech with room impulse responses and adding pre-recorded noise at different SNRs and distances from the microphone. Each clean speech file included 0 sample utterances that were collected from voice queries to the Microsoft Cortana Voice Assistant. Further, our noise files contained 5 types of recordings in the real world. VAD: As shown in Figure 3(a), we utilized noisy speech spectrogram windows of 6 ms and 50% overlap with a Hann smoothing filter, along with the corresponding ground-truth labels for DNN training and inference. Our baseline DNN had three 5-neuron hidden layers with 7-frame windows as in []. The network was trained to minimize the squared error between the ground-truth and predicted labels. Then the noisy spectrogram from the test dataset was used to generate the predicted labels, which were compared with the ground-truth labels to compute performance metrics. Speech enhancement: The framework we used in this case was similar to the one for VAD, except for the use of clean speech spectrogram for training instead of the ground-truth activity label. We utilized the baseline DNN model with three hidden layers presented in [3]. After performing the inference, the denoised speech from the output layer was used to compute the list of performance metrics shown in Figure 3(b). Due to space limitations, and since they are good proxies for speech quality, in this paper we only discuss the SNR and PESQ [3] metrics. Experimental Results VAD: Figure (a) indicates that the detection accuracy of the DNN is more sensitive to neuron bit reduction than weight bit reduction. Note that even the DNN with -bit weights and neurons provides (a) VAD Figure 3: Experimental framework. (b) Speech enhancement 3

Voice Activity Detection Test set with Unseen Noise W/N: 7.76% WebRTC: 0.% Classic:.0% VAD frame error (%) 0 5 W3/N3:.0% 0 W/N:.65% 5 0 Normalized speedup /normalized VAD frame error 3 0 (a) Figure : VAD performance of DNN with different pairs of weight/neuron bit precision. (a) Framelevel binary detection error and (b) normalized speedup/normalized VAD frame error. A red bar indicates the optimal pair of bit precision (-bit weights/-bit neurons). (b) lower detection error than non-dnn based methods such as classic VAD [] and WebRTC VAD [0]. To choose the optimal pair of weight/neuron bit precision in terms of detection accuracy and processing time, we introduce a new metric computed by multiplying normalized speedup and VAD error. Figure (b) shows that the optimal bit precision pair is determined as -bit weights and -bit neurons (W/N). As we reduce the bit width to W/N, the per-sample processing time reduces from 3 ms to.6 ms (30 reduction), with a slight increase in the error rate (.0% to.3%). The DNN with W/N outperforms the WebRTC VAD with 3.7 lower processing time and 9.5% lower error rate. Speech enhancement: As Figure 5(a) shows, SNR is improved for all bit-width pairs, except for -bit neurons. The optimal bit precision pair considering inference speedup and SNR improvement is W/N. However, Figure 5(b) shows that the PESQ improvement is not achieved by DNNs with low bit precision; the most efficient model that enhances PESQ is W/N with 9 speedup. This is mainly because of the limited capability of the baseline DNN model, which improves PESQ by 0.3. The result also indicates that the lower-precision values (especially in the neural bit) are not suitable for an estimation or regression task (such as speech enhancement). 5 Conclusions In this paper, we presented a methodology for efficiently scaling the precision of neural networks for two common audio processing tasks. Through a careful design-space exploration, we demonstrated that a DNN model with optimal bit-precision values reduces the processing time by 30 with only a slight increase in the error rate. Even at these modest precision scaling levels, it outperforms a state-of-the-art WebRTC VAD with 3.7 lower processing time and 9.5% lower error rate. The low bit precision DNN also enhances the quality of noisy speech, but the precision could not be reduced much for speech enhancement. Our results indicate that the precision scaling of DNNs may be better suited for classification or detection tasks such as VAD rather than estimation or regression tasks such as speech enhancement. To validate this hypothesis, we intend to further explore the scaling of neural-network bit precisions for other classification tasks such as source separation and microphone beam forming and estimation tasks such as acoustic echo cancellation. W3/N3: 39. Clean: 57.37 Noisy: 5. SNR (db) W/N: 5.5 Clean:. Noisy:.6 W3/N3:.6 Most efficient model with PESQ improvement: W/N PESQ Number of weight bits Number of weight bits (a) Figure 5: Speech enhancement performance of DNN with different precision. (a) SNR and (b) PESQ. (b)

References [] Xiao-lei Zhang and Deliang Wang. Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, ():5 6, 06. [] Ivan Tashev and Seyedmahdad Mirsamadi. DNN-based Causal Voice Activity Detector. In Information Theory and Applications Workshop, 06. [3] Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. An Experimental Study on Speech Enhancement Based on Deep Neural Networks. IEEE Signal Processing Letters, ():65 6, 0. [] Xiao-lei Zhang and Ji Wu. Deep Belief Networks Based Voice Activity Detection. IEEE Transactions on Audio, Speech, and Language Processing, ():697 70, 03. [5] J. H. Ko, D. Kim, T. Na, J. Kung, and S. Mukhopadhyay. Adaptive weight compression for memoryefficient neural networks. Design, Automation Test in Europe Conference Exhibition (DATE), 07, pages 99 0, March 07. [6] A. Agrawal et al. An introduction to computational networks and the computational network toolkit. Microsoft Technical Report MSR-TR-0-, 0. [7] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations. Journal of Machine Learning Research, :, 000. [] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized Convolutional Neural Networks for Mobile Devices. Arxiv 06, page, 06. [9] T. Na and S. Mukhopadhyay. Speeding Up Convolutional Neural Network Training with Dynamic Precision Scaling and Flexible Multiplier-Accumulator. ISLPED 06. [0] https://webrtc.org/. [] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep Learning with Limited Numerical Precision. In International Conference on International Conference on Machine Learning, 05. [] Wei Tang, Gang Hua, and Liang Wang. How to train a compact binary neural network with high accuracy? AAAI, pages 65 63, 07. [3] ITU-T, recommendation p.6, perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. International Telecommunication Union- Telecommunication Standardisation Sector, 00. [] Ivan Tashev, Andrew Lovitt, and Alex Acero. Unified Framework for Single Channel Speech Enhancement. Proceedings of the 009 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM 09), (September):3, 009. 5