Speech Accent Classification

Similar documents
Human Emotion Recognition From Speech

Python Machine Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Lecture 1: Machine Learning Basics

Artificial Neural Networks written examination

Modeling function word errors in DNN-HMM based LVCSR systems

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speech Emotion Recognition Using Support Vector Machine

Modeling function word errors in DNN-HMM based LVCSR systems

WHEN THERE IS A mismatch between the acoustic

A study of speaker adaptation for DNN-based speech synthesis

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

arxiv: v1 [cs.lg] 15 Jun 2015

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speech Recognition at ICSI: Broadcast News and beyond

Assignment 1: Predicting Amazon Review Ratings

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Speaker Identification by Comparison of Smart Methods. Abstract

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

CS Machine Learning

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Generative models and adversarial training

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

On the Formation of Phoneme Categories in DNN Acoustic Models

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

INPE São José dos Campos

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Probabilistic Latent Semantic Analysis

Model Ensemble for Click Prediction in Bing Search Ads

Speaker recognition using universal background model on YOHO database

Attributed Social Network Embedding

Second Exam: Natural Language Parsing with Neural Networks

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Learning From the Past with Experiment Databases

arxiv: v1 [cs.lg] 7 Apr 2015

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Rule Learning With Negation: Issues Regarding Effectiveness

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Learning Methods in Multilingual Speech Recognition

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

An empirical study of learning speed in backpropagation

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Axiom 2013 Team Description Paper

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Proceedings of Meetings on Acoustics

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

(Sub)Gradient Descent

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Calibration of Confidence Measures in Speech Recognition

A Case Study: News Classification Based on Term Frequency

Word Segmentation of Off-line Handwritten Documents

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Circuit Simulators: A Revolutionary E-Learning Platform

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

CSL465/603 - Machine Learning

Softprop: Softmax Neural Network Backpropagation Learning

Support Vector Machines for Speaker and Language Recognition

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Rule Learning with Negation: Issues Regarding Effectiveness

Automatic Pronunciation Checker

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Evolutive Neural Net Fuzzy Filtering: Basic Description

Knowledge Transfer in Deep Convolutional Neural Nets

Affective Classification of Generic Audio Clips using Regression Models

A Deep Bag-of-Features Model for Music Auto-Tagging

An Online Handwriting Recognition System For Turkish

Australian Journal of Basic and Applied Sciences

Data Fusion Through Statistical Matching

Applications of data mining algorithms to analysis of medical data

Learning Methods for Fuzzy Systems

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

SARDNET: A Self-Organizing Feature Map for Sequences

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Truth Inference in Crowdsourcing: Is the Problem Solved?

Cultivating DNN Diversity for Large Scale Video Labelling

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Transcription:

Speech Accent Classification Corey Shih ctshih@stanford.edu 1. Introduction English is one of the most prevalent languages in the world, and is the one most commonly used for communication between native speakers of different languages. As such, people from different regions around the world exhibit unique accents when speaking English. Classifying these accents can provide information about a speaker s nationality and heritage to speech recognition systems, which are becoming increasingly common in day-to-day life. The data gleaned from a speaker s accent can help speech recognition systems identify topics more relevant to the user, for the purposes of search results or advertisements. This project attempts to classify amongst 4 common accents (British, French, Spanish, and Mandarin) from audio samples of accented speakers speaking English. The 13 th lowest order melfrequency cepstral coefficients (MFCCs) of the audio signals are used as inputs to the algorithms. A softmax regression model and a long short-term memory (LSTM) neural network are used to predict the accent of the speaker from each audio sample. 2. Related Work The most successful previous work in this area utilizes a dictionary of words known to be sensitive to foreign accents and develops individual word and phenome based classification algorithms, using MFCCs as features. [1][2] In doing so, a classification accuracy of 93% among 4 different accents is achieved. Unfortunately, I do not have access to such an extensive database, and hence cannot replicate such results. Instead, I attempt to classify accents directly from the MFCCs of each sample. In a more recent paper, Choueiter et al. attempt 23-way accent classification using heteroscedastic linear discriminant analysis and obtain a classification accuracy of 32%. [3] Such an approach is more similar to mine, as a dictionary of accented words is not constructed beforehand. However, the accuracy of the algorithm clearly suffers. 3. Dataset and Features Audio samples were taken from the Speech Accent Archive, which provides clips of different accented speakers reciting the same English paragraph, as well as information about their geographical location, gender, and native language. [4] Samples were taken from 4 of the accent categories with the most examples (British, French, Spanish, and Mandarin) for a total of 430 examples, split into 386 training examples and 44 test examples. As different speakers speak at different rates, the audio signals were resampled to be roughly the same length (~10 6 length vectors) to better match up with each other. MFCCs are commonly used features in speech recognition systems because they approximate the important features that the human auditory system detects in audio signals. MFCCs are obtained by taking the Fourier transform of the audio signal, mapping the spectrum powers to the mel scale, and then taking the discrete cosine transform of the logarithms of the powers. The MFCCs

represent the amplitudes of the resulting spectrum. Using Ellis MFCC toolbox for MATLAB, the 13 th lowest order MFCCs of each resampled audio example were extracted using a frame time of 25 ms and a frame shift of 10 ms. [5] As the resulting matrices were extremely large and caused training times to be exceptionally long, the features were trimmed to 13 500 matrices for each example, as this seemed to result in the best compromise between model performance and training time. For use in the softmax regression model, the matrices were flattened to 1 6500 feature vectors. An example of the MFCC feature matrix is shown in Figure 1. p(y = i x; θ) = eθ T i x k e θ j T x j=1 The value of i with the highest conditional probability becomes the output of the model. LSTM. LSTMs are a type of recurrent neural network capable of learning long-term temporal dependencies in data. This makes them naturally suited to classifying sequences. LSTMs are already commonly used in natural language processing and automatic speech recognition, making them a natural choice for accent classification. The neurons in a hidden layer of a recurrent neural network form a directed cycle, allowing them to pass information to each other and exhibit temporal behavior. Additionally, in LSTMs, a forget gate allows the LSTM cells to either remember or forget their previous state, allowing for the establishment of long term dependencies. An illustration of a typical LSTM cell is shown in Figure 2. The input, output, and forget gates of a LSTM typically use sigmoidal or hyperbolic tangent activation functions. Figure 1. MFCCs for English Training Example 1. 4. Methods Two models were constructed for this project: a softmax regression model and a LSTM network. The softmax regression was used as a baseline to evaluate the effectiveness of the LSTM model. Softmax Regression. Softmax regression is commonly used for classification of multinomial data. For each of the k values the response variable y can take on, the conditional distribution of y given the features x and parameters θ are calculated by Figure 2. Schematic of LSTM cell. [6] For this project, I used MATLAB s Neural Network Toolbox to construct a neural network with a single LSTM layer and used a softmax activation function for the output node. [7] Mini-batch gradient descent with momentum was used to optimize the parameters, with a learning rate of 0.01 and a mini batch size of 30. The learning rate and mini batch size were chosen based on resulting model accuracy

and training time. Models were trained for a maximum of 100 epochs, and L2 regularization was utilized with a value of 0.001 for γ. 5. Results and Discussion The performance of each model was based on accuracy of the model over the training and test sets. Training and test accuracies for both models are given in Table 1. Table 1. Training and test accuracies for softmax and LSTM models. Model Training Accuracy Test Accuracy Softmax 54.40% 38.64% LSTM 79.02% 52.27% Confusion matrices for both models are given in Figures 3 and 4. The output classes 1, 2, 3, and 4 correspond to British, Spanish, French, and Mandarin accents, respectively. Figure 3. Confusion matrix for softmax model. Figure 4. Confusion matrix for LSTM model. From Table 1, it is clear that both models achieve higher accuracy on the test set than random guessing among the 4 categories (25%), and the LSTM model performs significantly better than the softmax model, though the resulting accuracy still leaves much to be desired. The better performance of the LSTM model comes as no surprise, as the algorithm is specifically designed to classify timedependent sequences. Comparing the training and test accuracies suggests that both models were overfit to the training set despite the L2 regularization used; higher values of γ or additional regularization techniques such as dropout or cross validation may be required to achieve more similar values for the training and test accuracy. Figures 3 and 4 reveal that both the softmax and LSTM models perform significantly better on Spanish accents than on any other accents. For the LSTM model in particular, the classification accuracy for Spanish accents was 85%, compared to accuracies between 20-30% for all other accents. This means that the increase in the overall 52.27% test accuracy for the LSTM model over the base of

25% is almost entirely due to the model s aptitude for correctly classifying Spanish accents. This is likely due to the fact that there were many more Spanish accent examples in the dataset than examples from other accent types. Of the 440 total examples, 199 were Spanish, whereas 68 were English, 67 were French, and 96 were Mandarin. If there were more examples for other accent types in the dataset, the performance of the model on the other accents would improve. The less-than-ideal performance of the models could in part be due to irregularities in the dataset. Upon inspection of individual audio samples in the dataset, it became clear that not all speakers under a certain accent category in the Speech Accent Archive actually exhibited an accent. While the Speech Accent Archive categorizes audio samples based on the native language and geographical location of the speaker, not all of these speakers had a strong accent, if any accent at all. The prevalence of non-accented speakers in the training and test sets would degrade the performance of the models; I would have to manually sift through the dataset and remove any samples I thought did not exhibit a strong accent. The dataset used for this project was extremely small when compared to the size of datasets used in most machine learning algorithms. I only had a few hundred examples total split over 4 categories, while many machine learning algorithms utilize datasets with sizes in the tens of thousands. The incredibly small size of the dataset I used negatively impacts the performance of the models trained, as there are simply not enough training examples to obtain accurate values for the parameters. Indeed, the accent category that the models performed best on was the category with the largest number of examples. Finding a larger dataset to train on is imperative for improving model performance. 6. Conclusion and Future Work Using the MFCCs of audio samples obtained from the Speech Accent Archive, I trained a softmax regression and LSTM neural network model to classify amongst British, Spanish, French and Mandarin accents. The LSTM model performed significantly better than the softmax regression model, achieving a test accuracy of 52.27%, compared to the softmax model s test accuracy of 38.67%. Both models performed better on Spanish accents than on the other accents, most likely due to the much larger number of Spanish examples in the dataset. Model performance was degraded due to the low overall number of training examples and the presence of non-accented speakers in the various accent categories. A number of additional methods could be utilized to improve the performance of the models trained in this project. Dynamic time warping could be used to sync the audio signals more effectively than simply resampling them. Alternatively, the audio samples could be split into individual words and fed into the algorithms as separate features. Furthermore, an accent classifier would ideally be able to classify accents regardless of the actual English words being spoken, but the dataset used for this project has all speakers reciting the same English paragraph. Obtaining a much larger dataset of accented speakers speaking various English phrases would be necessary to build a more general accent classifier. Shih. 7. Contributions All work on this project was done by Corey 8. References

1) Arslan, L.M.; Hansen, J.H.L. Language accent classification in American English, Speech Commun. 1996, 18, 353-367. 2) Arslan, L.M.; Hansen, J.H.L. Foreign accent classification using source generator based prosodic features, in ICASSP-95, 1995, Detroit, MI, USA. 3) Choueiter, G.; Zweig, G.; Nguyen, P. An empirical study of automatic accent classification, in ICASSP 2008, 2008, Las Vegas, NV, USA. 4) Weinberger, S.H. Speech Accent Archive, 2015, http://accent.gmu.edu 5) Ellis, D. PLP and RASTA (and MFCC, and inversion) in Matlab using melfcc.m and invmelfcc.m, 2012, http://labrosa.ee.columbia.edu/matlab/rasta mat 6) Carrier, P.L.; Cho, K. LSTM Networks for Sentiment Analysis, 2017, http://deeplearning.net/tutorial/lstm.html 7) MathWorks. Neural Network Toolbox 11.0, 2017.