arxiv: v1 [cs.sd] 1 Dec 2017

Similar documents
TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Python Machine Learning

arxiv: v1 [cs.lg] 15 Jun 2015

Human Emotion Recognition From Speech

A Deep Bag-of-Features Model for Music Auto-Tagging

arxiv: v1 [cs.lg] 7 Apr 2015

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

arxiv: v1 [cs.cl] 27 Apr 2016

Speech Emotion Recognition Using Support Vector Machine

WHEN THERE IS A mismatch between the acoustic

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Modeling function word errors in DNN-HMM based LVCSR systems

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Lecture 1: Machine Learning Basics

Modeling function word errors in DNN-HMM based LVCSR systems

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

THE enormous growth of unstructured data, including

A study of speaker adaptation for DNN-based speech synthesis

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

arxiv: v1 [cs.lg] 20 Mar 2017

Lip Reading in Profile

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Deep Neural Network Language Models

Speech Recognition at ICSI: Broadcast News and beyond

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

On the Formation of Phoneme Categories in DNN Acoustic Models

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Cultivating DNN Diversity for Large Scale Video Labelling

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Improvements to the Pruning Behavior of DNN Acoustic Models

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Softprop: Softmax Neural Network Backpropagation Learning

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Generative models and adversarial training

Learning From the Past with Experiment Databases

Dropout improves Recurrent Neural Networks for Handwriting Recognition

A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation

Word Segmentation of Off-line Handwritten Documents

arxiv: v4 [cs.cl] 28 Mar 2016

Assignment 1: Predicting Amazon Review Ratings

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Image based Static Facial Expression Recognition with Multiple Deep Network Learning

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

arxiv: v2 [cs.cl] 26 Mar 2015

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Calibration of Confidence Measures in Speech Recognition

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Speaker Identification by Comparison of Smart Methods. Abstract

CSL465/603 - Machine Learning

Taxonomy-Regularized Semantic Deep Convolutional Neural Networks

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

arxiv: v4 [cs.cv] 13 Aug 2017

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

CS Machine Learning

Residual Stacking of RNNs for Neural Machine Translation

Knowledge Transfer in Deep Convolutional Neural Nets

Dialog-based Language Learning

A Review: Speech Recognition with Deep Learning Methods

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Rule Learning With Negation: Issues Regarding Effectiveness

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Attributed Social Network Embedding

SORT: Second-Order Response Transform for Visual Recognition

arxiv: v1 [cs.cv] 10 May 2017

(Sub)Gradient Descent

Artificial Neural Networks written examination

arxiv:submit/ [cs.cv] 2 Aug 2017

Linking Task: Identifying authors and book titles in verbose queries

INPE São José dos Campos

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Proceedings of Meetings on Acoustics

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Affective Classification of Generic Audio Clips using Regression Models

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Reducing Features to Improve Bug Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

arxiv: v2 [cs.cv] 4 Mar 2016

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Transcription:

Utilizing Domain Knowledge in End-to-End Audio Processing arxiv:1712.00254v1 [cs.sd] 1 Dec 2017 Tycho Max Sylvester Tax Corti, Copenhagen, Denmark tt@cortilabs.com Hendrik Purwins Audio Analysis Lab, Aalborg University Copenhagen hpu@create.aau.dk Abstract Jose Luis Diez Antich Audio Analysis Lab, Aalborg University Copenhagen jl.diez.antich@gmail.com Lars Maaløe Corti, Copenhagen, Denmark Technical University of Denmark lm@cortilabs.com End-to-end neural network based approaches to audio modelling are generally outperformed by models trained on high-level data representations. In this paper we present preliminary work that shows the feasibility of training the first layers of a deep convolutional neural network (CNN) model to learn the commonlyused log-scaled mel-spectrogram transformation. Secondly, we demonstrate that upon initializing the first layers of an end-to-end CNN classifier with the learned transformation, convergence and performance on the ESC-50 environmental sound classification dataset are similar to a CNN-based model trained on the highly pre-processed log-scaled mel-spectrogram features. 1 Introduction End-to-end neural network models on image recognition tasks outperform other machine learning approaches by a large margin [11] but similarly good results are not seen in the audio domain. Modeling audio is particularly challenging because of long-range temporal dependencies [13] and variations for the same sound due to temporal distortions and phase shifts [5]. Various papers within automatic speech recognition (ASR), audio classification, and speech synthesis attempt to model audio from raw waveform [3][13][8][2]. Combinations of autoregressive models and dilated convolutions have shown significant improvements over previous results [13]. Still, on tasks such as ASR and environmental sound classification, using traditional transformations, such as (log-scaled mel-)spectrograms or MFCCs generally leads to superior performance. Modern neural network models are initialized using network-architecture-dependent randomized schemes [7][6]. In this paper we present preliminary work on initializing a deep neural network for audio classification by explicitly leveraging domain knowledge instead. To do so, we first show that it is possible to train the first layers of a deep neural network model, using unlabelled data, to learn a high-level audio representation. Secondly, we show that upon initializing the first layers of an end-to-end environmental sound classifier with the learned transformation, and keeping the associated parameters fixed during training, convergence and performance are similar to that of a model trained on the high-level representation. This opens up the possibility for training end-to-end neural network models on raw waveform in contexts where there is a limited amount of labeled data available. Finally, we discuss several future directions. It will be particularly interesting to see if fine-tuning of the model after convergence by unfreezing the parameters of the first layers can allow the performance of models trained on raw waveform to surpass those trained on processed features. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

2 Sound classification 2.1 Dataset and baseline results Several datasets have been generated in recent years to address the previously limited availability of labeled environmental sound data [17][4][15]. Of particular interest is the Environmental Sound Classification (ESC) dataset, which was released by Piczak in 2015 along with reproducible baseline results for several standard machine learning classifiers (k-nns, SVMs and RF ensembles) [15] as well as for a convolutional neural network model, which we denote as the PiczakCNN [14], and which we use as a baseline for our study. For our analysis, we use the ESC-50 dataset [15], which contains 2000 5-second long audio clips equally divided among 50 categories. The recordings are pre-arranged in five equally sized folds to facilitate cross-validation. The PiczakCNN consists of two 2-D CNN layers interleaved with max-pooling followed by two fully connected layers and a softmax output layer. It uses segmented log-scaled mel-spectrograms alongside their deltas as input. The reader is referred to the original paper for more details [14]. In line with Piczak, we increase the effective number of training samples by making four variations of each ESC-50 clip using time-/pitch-shifting. In addition, each of the mel-spectrograms is cut into segments of 101 frames with 50% overlap, and silent segments are discarded. At test time, a label is predicted for each of the segments associated with a clip and majority or probability voting is used to determine the final predicted class. Using majority voting the PiczakCNN model achieves a 62% accuracy on the test set averaged over the five folds [14]. Since we are interested in an end-to-end approach we add several 1-D CNN layers to the PiczakCNN architecture to allow raw waveform as input 1. These layers by themselves form the model that is trained separately to learn the mel-spectrogram transformation after which the learned weights are used to initialize the end-to-end classifier and kept frozen for the classification task. For this study, we first replicate Piczak s results using mel-spectrograms and their deltas to ensure that our set-up performs adequately. As we attempt to learn the mel-spectrogram transformation only, and not also the associated deltas, we benchmark our results on raw speech against PiczakCNN performance without deltas, which is 53% on the test set using majority voting, and averaged over the 5 folds. 2.2 Learning the mel-spectrogram transformation 2.2.1 Experimental set-up After exploring several architectures for the mel-spectrogram transformation model (MSTmodel) we picked the architecture summarized in Table 1. The reason is twofold: firstly, it shows excellent performance on the modelling task, and secondly, its parameters are matched in terms of kernel size and stride to the window-and hop-size of the short-time Fourier transform (STFT) step in the mel-spectrum calculation. The second and third layer are qualitatively similar to those used for temporal modelling in [2]. SAME padding is used in the MSTmodel so that the output feature maps are of the same spatial dimension as the input feature maps in order not to lose information at the edges of the signal during the forward pass. Table 1: Details of the MSTmodel. Layer type Number of filters Filter size Stride Activation function Padding 1D convolution 512 1024 512 ReLU SAME 1D convolution 256 3 1 ReLU SAME 1D convolution 60 3 1 Tanh SAME The MSTmodel is trained stand-alone, in supervised fashion, by taking raw audio clips as input and their corresponding mel-spectrograms as labels. The raw waveform recordings are re-sampled at 22050Hz from their original 44.1kHz, and their amplitudes are normalized per segment (corresponding to a single mel-spectrogram of framelength 101) between -1 and 1 through division with the maximum absolute value of the segment. A target mel-spectrogram is generated using the Librosa 1 Source code for the project will be made available at: https://github.com/corticph/mstmodel. 2

package [12] by applying a mel-filterbank to the magnitude spectrum of each segment (window size = 1024, hop size = 512, and mel bands = 60), and then taking its logarithm. The mel-spectrogram values are normalized and re-scaled between -1 and 1 using trainset statistics. To train the MSTmodel we use a simple mean squared error (MSE) loss between the predicted representations and the target mel-spectrograms, where the last frame of each label is sliced off to match dimensions 2. We use Adam [10] with a constant learning rate (3e 4 ) and a batch size of 100 for optimization, and prevent overfitting through early stopping based on the performance on the pre-allocated validation set. Separate models are trained for each pre-assigned fold to ensure that upon initializing the end-to-end classifier, it has not previously seen test or validation data. 2.2.2 Results and discussion Figure 1 shows the learned mel-spectrogram (right) and its target (left) for a randomly selected example from the test set. The prediction is visually similar to the target although the spectrum seems slightly smoothed. We have verified the learned transformation is not domain dependent by comparing the model predictions with their targets on pure tones, speech, and music. 0 10 20 30 40 50 0 20 40 60 80 0 20 40 60 80 Figure 1: Comparison of a mel-spectrogram (left) and its prediction (right). The signal is randomly chosen from the test set. In Figure 2, we plot some of the learned filters of the first layer of the MSTmodel. It can be seen that our network learns frequency decompositions such as wavelets and band-pass filters that are qualitatively similar to those reported in previous studies [18][1][8]. It is interesting that our network discovers the same representations despite being trained on a different task. Average accuracy on the ESC-50 test set 0.5 Figure 2: Subset of the filters learned by the first MSTmodel layer. Accuracy (%) on the test set 0.4 0.3 0.2 0.1 0.0 Baseline PiczakCNN (without deltas) Raw model (random init) Raw model (spectrogram transformation init) 0 25 50 75 100 125 150 175 200 Epoch number Figure 3: Test accuracy averaged over the folds for: (a) baseline PiczakCNN, (b) network with random initialization (Xavier), and (c) network initialized with the spectrogram transformation. 2 Alternatively, one could pad the input data but the current approach led to sufficiently accurate mel-spectrograms for our purposes. 3

2.3 End-to-end environmental sound classification 2.3.1 Experimental set-up We assess the success of learning the mel-spectrogram transformation by performance on the ESC- 50 dataset. To do so, we train three models, where in each case we adhere to the pre-defined cross-validation structure for the ESC-50 dataset [14]: 1. the baseline PiczakCNN model on melspectrograms without deltas, 2. the PiczakCNN model with the three-layer MSTmodel architecture added, and with random initialization using the Xavier scheme [6], and 3. the same model as in 2 but with the pre-trained MSTmodel layers. In the second classifier, dropout layers (keep probability = 0.5) are added to the MSTmodel-layers after each of the non-linearities to prevent overfitting. For the third classifier we keep the MSTmodel parameters frozen to the learned mel-spectrogram transformation during training of the deeper layers. The training scheme used is largely the same as the one proposed by Piczak [14] with minor adaptions to the hyperparameter choices and normalization after normalization the mel-spectrogram values are re-scaled between -1 and 1 based on the minimum and maximum of the trainset. We use a cross-entropy loss function and stochastic gradient descent with Nesterov momentum (0.9), a batch size of 500, and a learning rate of 5e 3, and we train the models for 200 epochs. The originally proposed L2-weight regularization is not used in the final experiments since it did not improve performance. At test-time, majority voting is used to determine the class of the test sample based on all its associated overlapping segments. 2.3.2 Results and discussion The performance of the three models outlined above is presented in Figure 3 on the test set averaged over the five folds. We see that initializing the weights of the first layers with the learned melspectrogram transformation, and keeping them fixed throughout training, results in better convergence and increases performance on raw waveform approximately to mel-spectrogram-levels. When using neural networks for audio modelling, architectural choices are sometimes made that appear inspired by deterministic feature extracting methods. For instance, max-pooling is often used to perform a summarizing process comparable to the mel-filter bank operating on a mel-spectrogram. In addition, log-layers are sometimes used to compress the learned internal representation [16][8]. CNNs are highly flexible models that can approximate complex mappings. By leveraging this capability in our approach, and forcing the network to learn such transformations implicitly, we limit the need for ad-hoc architectural choices. 3 Conclusion and future directions This proof-of-concept study shows that 1) through supervised training, a simple CNN architecture can learn the log-scaled mel-spectrogram transformation from raw waveform and 2) that initializing an end-to-end neural network classifier with the learned transformation yields a performance comparable to a model trained on the highly processed mel-spectrograms. These findings show that incorporating knowledge from established audio signal processing methods can improve performance of neural network based approaches on audio modeling tasks. This preliminary work opens up the possibility for a myriad of follow-up experiments. Most notably, it will be interesting to fine-tune the previously fixed parameters of the first layers of the classifier to determine if different representations are learned and whether they are more informative than mel-spectrograms. If so, the robustness of these representations can be further increased through an abundance of available unlabelled audio data. This parallels work by Jaitly and Hinton who use generative models to leverage unlabeled data to learn robust features [9]. Acknowledgments The authors would like to thank Karol Piczak for making the code to reproduce his baseline results publicly available, and for answering several questions relating the ESC dataset. In addition, we would like to thank the Corti team, Lasse Borgholt and Alexander Wahl-Rasmussen in particular, for insightful feedback, and proofreading the manuscript. 4

References [1] Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, pages 892 900, 2016. [2] R. Collobert, C. Puhrsch, and G. Synnaeve. Wav2letter: an end-to-end convnet-based speech recognition system. arxiv preprint arxiv:1609.03193, 2016. [3] S. Dieleman and B. Schrauwen. End-to-end learning for music audio. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6964 6968, 2014. [4] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017. [5] P. Ghahremani, V. Manohar, D. Povey, and S. Khudanpur. Acoustic modelling from the signal domain using cnns. In INTERSPEECH, pages 3434 3438, 2016. [6] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9, pages 249 256, 2010. [7] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. Proceedings of the IEEE international conference on computer vision, 2015. [8] Y. Hoshen, R. J. Weiss, and K. W. Wilson. Speech acoustic modeling from raw multichannel waveforms. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4624 4628, 2015. [9] N. Jaitly and G. Hinton. Learning a better representation of speech soundwaves using restricted boltzmann machines. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5884 5887, 2011. [10] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arxiv preprint arxiv:1412.6980, 2014. [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097 1105. 2012. [12] B. McFee, M. McVicar, O. Nieto, S. Balke, C. Thome, D. Liang, E. Battenberg, J. Moore, R. Bittner, R. Yamamoto, D. Ellis, F.-R. Stoter, D. Repetto, S. Waloschek, C. Carr, S. Kranzler, K. Choi, P. Viktorin, J. F. Santos, A. Holovaty, W. Pimenta, and H. Lee. librosa 0.5.0, 2017. [13] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. arxiv preprint arxiv:1609.03499, 2016. [14] K. J. Piczak. Environmental sound classification with convolutional neural networks. In Machine Learning for Signal Processing (MLSP), IEEE 25th International Workshop on Machine Learning for Signal Processing, pages 1 6, 2015. [15] K. J. Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pages 1015 1018, 2015. [16] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals. Learning the speech front-end with raw waveform cldnns. In Sixteenth Annual Conference of the International Speech Communication Association, 2015. [17] J. Salamon, C. Jacoby, and J. P. Bello. A dataset and taxonomy for urban sound research. In Proceedings of the 22nd ACM international conference on Multimedia, pages 1041 1044, 2014. [18] Z. Tüske, P. Golik, R. Schlüter, and H. Ney. Acoustic modeling with deep neural networks using raw time signal for lvcsr. In Fifteenth Annual Conference of the International Speech Communication Association, 2014. 5