Accent Classification
|
|
- John Cummings
- 11 months ago
- Views:
Transcription
1 Accent Classification Phumchanit Watanaprakornkul, Chantat Eksombatchai, and Peter Chien Introduction Accents are patterns of speech that speakers of a language exhibit; they are normally held in common with people of similar ethnic, national, or social background. Accent classification is the problem of finding out what accent people have, that is taking speech in a language from speakers from various ethnic or national backgrounds and using that speech to figure out what backgrounds the speakers are from. This provides for a way to extract useful information - where a person is from - without directly asking that person about it, and instead just listening to them speak. However, this is not always possible for humans, as sometimes they encounter accents that they have not heard before, or they come across accents that are only faintly noticeable, and then they are not able to classify the accents correctly. This paper considers applying machine learning to the problem of accent classification in the hopes of being able to reliably classify accents that are in some data set, which can be much greater than any one human has the time to listen to. Further motivation for examining the problem of accent classification comes from the problem of speech recognition. Currently, accented speech poses problems for speech recognition algorithms, as different pronunciations of the same words are not recognized the same way by speech recognition algorithms. Indeed, sometimes it is hard even for native speakers of a language to discern exactly what someone else is saying if they have a thick accent. If speech could first be classified by accent, the speech could then be passed on to some more specialized speech recognition algorithm, such as one that was trained on people with the same accent as the speaker. Dataset We used the CSLU: Foreign Accented English v 1.2 dataset. (Linguistic Data Consortium catalog number LDC2007S08). It is composed of English speech from native speakers of 23 different languages. The data format is 1 channel 16-bit linearly encoded WAV files sampled at 8kHz. Although each file contains English speech, the files contain different words and are not necessarily complete sentences. We did not work on all 23 accents. We picked three accents - Cantonese, Hindi, Russian - to work on because it would take a lot more time to run experiments on 23 classes than 3 classes but classifying speech into 23 and 3 classes are algorithmically equally challenging, as they both require multi-class classifiers. Feature Extraction: MFCC and PLP To convert an array of sample amplitudes from sound files to a more useful reporesentation, the first features we used are Mel frequency cepstral coefficients (MFCCs). This is the most popular feature in speech recognition. MFCCs separate the samples into small windows because the whole sample arrays hold too much information. So, we divided the sound into small chunks of time. We run Discrete Fourier Transform(DFT) on each window to get the power spectral, which holds frequency information of the signal over time. Then we convert the resulting spectral into values on the mel-scale, which is a perceptual scale of pitches judged by
2 human listeners. Next, we do a discrete cosine transform of the log mel-scale values, which gives the vector of MFCCs. We used the first 12 coefficients as our base feature, because using all the coefficients would have given us too many features. We also added an energy of the window to the vector, resulting in 13 features. We then appended the vector with delta, and delta delta (first and second order derivatives of the MFCC) of the vector to get 39 features in the end. This accounts for changes in the base features. The other feature that we tried is Perceptual linear prediction(plp). PLP is similar to MFCC. They both separate the signal into small windows - we used equal window size for both PLP and MFCC so that we could combine the features - and run DFT to get the power spectral. PLP then does critical-band spectral resolution by converting the spectral into values on the Bark scale. Next, PLP does pre-emphasis based on loudness, runs inverse DFT, and does Linear predictive analysis (LPC) on the result. In our experiment, we did PLP to order 12. So, we got 13 features as we also added energy to the PLP vector. We then computed delta, and delta delta to get 39 features just like with MFCC. To conclude, for each sound sample, our features are for a list of windows. In each window, we have 39 features from MFCC and 39 other features from PLP. The number of windows is different for each sound sample as they are of different lengths, but the window size is fixed. Classifier: Support Vector Machine (SVM) Our classifier is based on windows, instead of the whole sound sample, as we have features for each window. The classifier tries to determine which acccent a window belongs to. This is done by combining all the windows in the training data and labeling each window by accent of the sound sample file the window belongs to. Then we simply trained SVM on these windows. To predict the accent of a sound sample, we use SVM to label each window in that sample and then we pick the most frequent label as the label for that sample. This approach sounds very simple, but it ended up not working. We tried this on just 12 MFCC features (without energy, deltas, PLP) and it took forever to train. One reason for that is that we have about 2000 windows per sound sample and about 300 samples per accent. So, there is a lot of data. We tried to make SVM run in a reasonable amount of time by throwing away most of the windows for each sound sample randomly before we trained the SVM. For 39 MFCC features with 10 windows per each sound sample, SVM took an hour to train. The overall accuracy was 48.16%. Doing the same thing on 39 PLP features, the accuracy was 41.18%. The result we got is better than what would be expected from randomly labeling the test set. However, randomly picking 10 from 2000 windows to use for training is not a reliable method. We wanted to find a reliable method that could run in a reasonable time. Classifier: Gaussian Mixture Models (GMMs) It turns out that we were able to use GMMs to do this efficiently. We did this by having 3 GMMs - one for each accent. We trained the GMM of each accent class by using the features extracted from the windows of the samples from that accent class as our training data. We predicted the accent class of the testing data by calculating the probability that the testing data belongs to each accent class and then outputting the accent class that gave the highest probability. To calculate the probability of testing data belonging to an accent, we did feature
3 extraction on the testing data first to get the same windows and the same features as before. We used the trained GMM for a particular accent to calculate the probability of each window belonging to that accent. Then we multiplied, or summed the logs of, the probabilities of the windows together to get the probability of each sample having the accent. In this way, we found which accent is most probable for that sample. We trained GMMs with the EM algorithm. We initialized EM by using the mean from k- means and using a covariance matrix outputted by k-means - this seemed to get better results than initializing the covariance matrix to be the identity. One big problem we encountered was that our covariance matrix became singular often. Because we were working with many features, we ran into a lot of problems with the covariance matrices becoming singular, or rather close enough to singular that floating point errors would occur. This comes about because if many entries in the covariance matrix are small, the determinant becomes very small as it is the sum of the products of entries of the covariance matrix. We mainly dealt with this problem in two ways. First, we switched from using full covariance matrices to only using diagonal covariance matrices, which were simpler. It turns out that we get better results by using the diagonal matrices as our covariance matrix. The other method we tried is to apply variance limiting on them. Variance limiting is when we do not allow any of the variances to be smaller than a certain value in magnitude, i.e. every entry of the diagonal of the covariance matrix must be larger than a certain value in magnitude. The second way we dealt with the problem is linearly scaling up the numbers that we were using as the features. This helped because linearly scaling should not change the results that the GMM gets us, but it should increase all of the variances, making it less likely that the covariance matrix will be singular. Using both MFCC and PLP We tried to model on just MFCCs, just PLP, and also tried combining the two. When we combined PLP and MFCCs, we used 2 GMMs for each accent - one for MFCCs and one for PLP. We dealt with these two separately and we combined the result by multiplying the probabilities we got from each of them. Another way we could have combined both features was that we could have just concatenated the vectors to get 78 features for each window. However, that seemed to be too many features for us. Therefore, we did feature selection using forward search on the 78 features to select just a few features to use. We picked forward search instead of other feature selection techniques because it is the fastest one. Configuring our learning model We tried to find the best number of gaussians to use for the GMMs for each accent class by training on the 39 MFCC features with different numbers of gaussians. The results are shown in the following diagram. We found that it was best to use 35 gaussians for each accent class. However, the effect that the number of gaussians on accuracy is very small.
4 For feature selection, we used forward search to select features from the 78 PLP and MFCC features. The classifier that we used was one GMM with 5 gaussians for each accent class. We did not use the number of gaussians we got from the testing above because we tried these two things in parallel. We picked 5 because we wanted the number of gaussians to be small to make feature selection run faster. We know that the impact of the number of gaussians is very small, so this should not have posed much of a problem for the accuracy of our feature selection. In any case, the feature selection process took a long time to run. The results are shown in the following graph. We can see that starting from around six features, the change in the accuracy is no longer strictly increasing, and it also becomes negligible. We found it surprising that more features would not help us. We still believe that six features is not enough; however, our experiment says otherwise. We do not think we overfitted the training data because we still got similar results, that is accuracy of.4 (using 39 MFCC features just like for testing the number of gaussians), when we used the same training and test data. Therefore, the problem probably lies in the features that we used. As a final result, the best accuracy we got was 51.47% using 8 features. The forward selection picked 7 of the 8 features from MFCCs. Future Work The most problematic part of our research is that our features do not model different
5 kinds of accents well. MFCCs and PLP are suited for general audio, not specific to voice or accent. For the future, we would like to get more specific features. We considered trying prosodic features, as many papers suggested this; however, we need to be able to recognize words in the sound sample to be able to apply prosodic features. This is because we would have to label words in the dataset, which takes a lot of time if it is done manually. Or, we would have to do speech recognition before running our classifier to determine where words are, which seems inappropriate for our case given that speech recognition does not run well on accented speech, and one of the purposes of accent classification is to aid in speech recognition. Therefore, it was not feasible for us given the limited time we had. However, it might be worth considering in the future. One other thing that we failed to take into account is pauses. There are a lot of windows which are just pauses. These pauses are the same for every accent. So, the features from these windows do not help in classification. We could also try decreasing variation on our dataset by only using male or female voice or dividing the dataset into male and female voices given that pitch is often very different in male and female speech. References Dan Jurafsky. Hong Tang and Ali A. Ghorbani, Accent Classification Using Support Vector Machine and Hidden Markov Model. University of New Brunswick Ghinwa Choueiter, Geoffrey Zweig, and Patrick Nguyen, An Empirical Study of Automatic Accent Classification Karsten Kumpf and Robin W. King, Automatic Accent Classification of Foreign Accented Australian English Speech. Speech Technology Research Group, The University of Sydney NSW 2006, Australia Scott Novich, Phil Repicky, andandrea Trevino, Accent Classification Using Neural Networks, Rice University. < John H.L. Hansen and Levent M. Arslan, Foreign Accent Classification Using Source Generator Based Prosodic Features. Robust Speech Processing Laboratory, Duke University. Douglas A. Reynolds, and Richard C. Rose, Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models, IEEE. Muchael J. Carey, Eluned S. Parris, Harvey Lloyd-Thomas, and Stephen Bennet, Robust Prosodic Features For Speaker Identification. Ensigma Ltd. Andre G. Adami, Radu Mihaescu, Douglas A. Reynolds, and John J. Godfrey. Modelling Prosodic Dynamics For Speaker Recignition. Hynek Hermansky, Perceptual linear predictive (PLP) analysis of speech, Speech Technology Laboratory, Division of Panasonic Technologies Inc., Santa Babara CA
Foreign Accent Classification
Foreign Accent Classification CS 229, Fall 2011 Paul Chen pochuan@stanford.edu Julia Lee juleea@stanford.edu Julia Neidert jneid@stanford.edu ABSTRACT We worked to create an effective classifier for foreign
FILTER BANK FEATURE EXTRACTION FOR GAUSSIAN MIXTURE MODEL SPEAKER RECOGNITION
FILTER BANK FEATURE EXTRACTION FOR GAUSSIAN MIXTURE MODEL SPEAKER RECOGNITION James H. Nealand, Alan B. Bradley, & Margaret Lech School of Electrical and Computer Systems Engineering, RMIT University,
Speech Accent Classification
Speech Accent Classification Corey Shih ctshih@stanford.edu 1. Introduction English is one of the most prevalent languages in the world, and is the one most commonly used for communication between native
Pavel Král and Václav Matoušek University of West Bohemia in Plzeň (Pilsen), Czech Republic pkral
EVALUATION OF AUTOMATIC SPEAKER RECOGNITION APPROACHES Pavel Král and Václav Matoušek University of West Bohemia in Plzeň (Pilsen), Czech Republic pkral matousek@kiv.zcu.cz Abstract: This paper deals with
On the Use of Long-Term Average Spectrum in Automatic Speaker Recognition
On the Use of Long-Term Average Spectrum in Automatic Speaker Recognition Tomi Kinnunen 1, Ville Hautamäki 2, and Pasi Fränti 2 1 Speech and Dialogue Processing Lab Institution for Infocomm Research (I
Speaker Recognition Using Vocal Tract Features
International Journal of Engineering Inventions e-issn: 2278-7461, p-issn: 2319-6491 Volume 3, Issue 1 (August 2013) PP: 26-30 Speaker Recognition Using Vocal Tract Features Prasanth P. S. Sree Chitra
Human Emotion Recognition From Speech
RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati
Volume 1, No.3, November December 2012
Volume 1, No.3, November December 2012 Suchismita Sinha et al, International Journal of Computing, Communications and Networking, 1(3), November-December 2012, 115-125 International Journal of Computing,
Speaker Recognition Using MFCC and GMM with EM
RESEARCH ARTICLE OPEN ACCESS Speaker Recognition Using MFCC and GMM with EM Apurva Adikane, Minal Moon, Pooja Dehankar, Shraddha Borkar, Sandip Desai Department of Electronics and Telecommunications, Yeshwantrao
Zaki B. Nossair and Stephen A. Zahorian Department of Electrical and Computer Engineering Old Dominion University Norfolk, VA, 23529
SMOOTHED TIME/FREQUENCY FEATURES FOR VOWEL CLASSIFICATION Zaki B. Nossair and Stephen A. Zahorian Department of Electrical and Computer Engineering Old Dominion University Norfolk, VA, 23529 ABSTRACT A
Automated Rating of Recorded Classroom Presentations using Speech Analysis in Kazakh
Automated Rating of Recorded Classroom Presentations using Speech Analysis in Kazakh Akzharkyn Izbassarova, Aidana Irmanova and Alex Pappachen James School of Engineering, Nazarbayev University, Astana
Gender Classification Based on FeedForward Backpropagation Neural Network
Gender Classification Based on FeedForward Backpropagation Neural Network S. Mostafa Rahimi Azghadi 1, M. Reza Bonyadi 1 and Hamed Shahhosseini 2 1 Department of Electrical and Computer Engineering, Shahid
Modeling function word errors in DNN-HMM based LVCSR systems
Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford
Suitable Feature Extraction and Speech Recognition Technique for Isolated Tamil Spoken Words
Suitable Feature Extraction and Recognition Technique for Isolated Tamil Spoken Words Vimala.C, Radha.V Department of Computer Science, Avinashilingam Institute for Home Science and Higher Education for
International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012
Text-independent Mono and Cross-lingual Speaker Identification with the Constraint of Limited Data Nagaraja B G and H S Jayanna Department of Information Science and Engineering Siddaganga Institute of
SPEAKER IDENTIFICATION
SPEAKER IDENTIFICATION Ms. Arundhati S. Mehendale and Mrs. M. R. Dixit Department of Electronics K.I.T. s College of Engineering, Kolhapur ABSTRACT Speaker recognition is the computing task of validating
A study of speaker adaptation for DNN-based speech synthesis
A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, Simon King The Centre for Speech Technology Research (CSTR) University of Edinburgh,
Speech Emotion Recognition Using Support Vector Machine
Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,
Isolated Speech Recognition Using MFCC and DTW
Isolated Speech Recognition Using MFCC and DTW P.P.S.Subhashini Associate Professor, RVR & JC College of Engineering. ABSTRACT This paper describes an approach of isolated speech recognition by using the
Modeling function word errors in DNN-HMM based LVCSR systems
Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford
i-vector Algorithm with Gaussian Mixture Model for Efficient Speech Emotion Recognition
2015 International Conference on Computational Science and Computational Intelligence i-vector Algorithm with Gaussian Mixture Model for Efficient Speech Emotion Recognition Joan Gomes* and Mohamed El-Sharkawy
Spectral Subband Centroids as Complementary Features for Speaker Authentication
Spectral Subband Centroids as Complementary Features for Speaker Authentication Norman Poh Hoon Thian, Conrad Sanderson, and Samy Bengio IDIAP, Rue du Simplon 4, CH-19 Martigny, Switzerland norman@idiap.ch,
Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier
IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) e-issn: 2278-2834,p- ISSN: 2278-8735.Volume 10, Issue 2, Ver.1 (Mar - Apr.2015), PP 55-61 www.iosrjournals.org Analysis of Emotion
AUTOMATIC ARABIC PRONUNCIATION SCORING FOR LANGUAGE INSTRUCTION
AUTOMATIC ARABIC PRONUNCIATION SCORING FOR LANGUAGE INSTRUCTION Hassan Dahan, Abdul Hussin, Zaidi Razak, Mourad Odelha University of Malaya (MALAYSIA) hasbri@um.edu.my Abstract Automatic articulation scoring
Comparative study of automatic speech recognition techniques
Published in IET Signal Processing Received on 21st May 2012 Revised on 26th November 2012 Accepted on 8th January 2013 ISSN 1751-9675 Comparative study of automatic speech recognition techniques Michelle
A comparison between human perception and a speaker verification system score of a voice imitation
PAGE 393 A comparison between human perception and a speaker verification system score of a voice imitation Elisabeth Zetterholm, Mats Blomberg 2, Daniel Elenius 2 Department of Philosophy & Linguistics,
Table 1: Classification accuracy percent using SVMs and HMMs
Feature Sets for the Automatic Detection of Prosodic Prominence Tim Mahrt, Jui-Ting Huang, Yoonsook Mo, Jennifer Cole, Mark Hasegawa-Johnson, and Margaret Fleck This work presents a series of experiments
A SURVEY: SPEECH EMOTION IDENTIFICATION
A SURVEY: SPEECH EMOTION IDENTIFICATION Sejal Patel 1, Salman Bombaywala 2 M.E. Students, Department Of EC, SNPIT & RC, Umrakh, Gujarat, India 1 Assistant Professor, Department Of EC, SNPIT & RC, Umrakh,
A Hybrid System for Audio Segmentation and Speech endpoint Detection of Broadcast News
A Hybrid System for Audio Segmentation and Speech endpoint Detection of Broadcast News Maria Markaki 1, Alexey Karpov 2, Elias Apostolopoulos 1, Maria Astrinaki 1, Yannis Stylianou 1, Andrey Ronzhin 2
Sentiment Analysis of Speech
Sentiment Analysis of Speech Aishwarya Murarka 1, Kajal Shivarkar 2, Sneha 3, Vani Gupta 4,Prof.Lata Sankpal 5 Student, Department of Computer Engineering, Sinhgad Academy of Engineering, Pune, India 1-4
AUTOMATIC SONG-TYPE CLASSIFICATION AND SPEAKER IDENTIFICATION OF NORWEGIAN ORTOLAN BUNTING (EMBERIZA HORTULANA) VOCALIZATIONS
AUTOMATIC SONG-TYPE CLASSIFICATION AND SPEAKER IDENTIFICATION OF NORWEGIAN ORTOLAN BUNTING (EMBERIZA HORTULANA) VOCALIZATIONS Marek B. Trawicki & Michael T. Johnson Marquette University Department of Electrical
Speaker recognition using universal background model on YOHO database
Aalborg University Master Thesis project Speaker recognition using universal background model on YOHO database Author: Alexandre Majetniak Supervisor: Zheng-Hua Tan May 31, 2011 The Faculties of Engineering,
Speech Emotion Recognition Using Deep Neural Network and Extreme. learning machine
INTERSPEECH 2014 Speech Emotion Recognition Using Deep Neural Network and Extreme Learning Machine Kun Han 1, Dong Yu 2, Ivan Tashev 2 1 Department of Computer Science and Engineering, The Ohio State University,
ONLINE SPEAKER DIARIZATION USING ADAPTED I-VECTOR TRANSFORMS. Weizhong Zhu and Jason Pelecanos. IBM Research, Yorktown Heights, NY 10598, USA
ONLINE SPEAKER DIARIZATION USING ADAPTED I-VECTOR TRANSFORMS Weizhong Zhu and Jason Pelecanos IBM Research, Yorktown Heights, NY 1598, USA {zhuwe,jwpeleca}@us.ibm.com ABSTRACT Many speaker diarization
Text-Independent Speaker Recognition System
Text-Independent Speaker Recognition System ABSTRACT The article introduces a simple, yet complete and representative text-independent speaker recognition system. The system can not only recognize different
A method for recognition of coexisting environmental sound sources based on the Fisher s linear discriminant classifier
A method for recognition of coexisting environmental sound sources based on the Fisher s linear discriminant classifier Ester Creixell 1, Karim Haddad 2, Wookeun Song 3, Shashank Chauhan 4 and Xavier Valero.
Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project
Phonetic- and Speaker-Discriminant Features for Speaker Recognition by Lara Stoll Research Project Submitted to the Department of Electrical Engineering and Computer Sciences, University of California
Hidden Markov Model-based speech synthesis
Hidden Markov Model-based speech synthesis Junichi Yamagishi, Korin Richmond, Simon King and many others Centre for Speech Technology Research University of Edinburgh, UK www.cstr.ed.ac.uk Note I did not
On the Use of Perceptual Line Spectral Pairs Frequencies for Speaker Identification
On the Use of Perceptual Line Spectral Pairs Frequencies for Speaker Identification Md. Sahidullah and Goutam Saha Department of Electronics and Electrical Communication Engineering Indian Institute of
Low-Audible Speech Detection using Perceptual and Entropy Features
Low-Audible Speech Detection using Perceptual and Entropy Features Karthika Senan J P and Asha A S Department of Electronics and Communication, TKM Institute of Technology, Karuvelil, Kollam, Kerala, India.
Music Genre Classification Using MFCC, K-NN and SVM Classifier
Volume 4, Issue 2, February-2017, pp. 43-47 ISSN (O): 2349-7084 International Journal of Computer Engineering In Research Trends Available online at: www.ijcert.org Music Genre Classification Using MFCC,
Convolutional Neural Networks for Speech Recognition
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL 22, NO 10, OCTOBER 2014 1533 Convolutional Neural Networks for Speech Recognition Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang,
II. SID AND ITS CHALLENGES
Call Centre Speaker Identification using Telephone and Data Lerato Lerato and Daniel Mashao Dept. of Electrical Engineering, University of Cape Town Rondebosch 7800, Cape Town, South Africa llerato@crg.ee.uct.ac.za,
Evaluation of Adaptive Mixtures of Competing Experts
Evaluation of Adaptive Mixtures of Competing Experts Steven J. Nowlan and Geoffrey E. Hinton Computer Science Dept. University of Toronto Toronto, ONT M5S 1A4 Abstract We compare the performance of the
Low-Delay Singing Voice Alignment to Text
Low-Delay Singing Voice Alignment to Text Alex Loscos, Pedro Cano, Jordi Bonada Audiovisual Institute, Pompeu Fabra University Rambla 31, 08002 Barcelona, Spain {aloscos, pcano, jboni }@iua.upf.es http://www.iua.upf.es
Can a Professional Imitator Fool a GMM-Based Speaker Verification System?
R E S E A R C H R E P O R T I D I A P Can a Professional Imitator Fool a GMM-Based Speaker Verification System? Johnny Mariéthoz 1 Samy Bengio 2 IDIAP RR 05-61 January 11, 2006 1 IDIAP Research Institute,
L16: Speaker recognition
L16: Speaker recognition Introduction Measurement of speaker characteristics Construction of speaker models Decision and performance Applications [This lecture is based on Rosenberg et al., 2008, in Benesty
Classification with Deep Belief Networks. HussamHebbo Jae Won Kim
Classification with Deep Belief Networks HussamHebbo Jae Won Kim Table of Contents Introduction... 3 Neural Networks... 3 Perceptron... 3 Backpropagation... 4 Deep Belief Networks (RBM, Sigmoid Belief
Speech Synthesizer for the Pashto Continuous Speech based on Formant
Speech Synthesizer for the Pashto Continuous Speech based on Formant Technique Sahibzada Abdur Rehman Abid 1, Nasir Ahmad 1, Muhammad Akbar Ali Khan 1, Jebran Khan 1, 1 Department of Computer Systems Engineering,
A Low-Complexity Speaker-and-Word Recognition Application for Resource- Constrained Devices
A Low-Complexity Speaker-and-Word Application for Resource- Constrained Devices G. R. Dhinesh, G. R. Jagadeesh, T. Srikanthan Centre for High Performance Embedded Systems Nanyang Technological University,
in 82 Dutch speakers. All of them were prompted to pronounce 10 sentences in four dierent languages : Dutch, English, French, and German. All the sent
MULTILINGUAL TEXT-INDEPENDENT SPEAKER IDENTIFICATION Georey Durou Faculte Polytechnique de Mons TCTS 31, Bld. Dolez B-7000 Mons, Belgium Email: durou@tcts.fpms.ac.be ABSTRACT In this paper, we investigate
Analysis of Gender Normalization using MLP and VTLN Features
Carnegie Mellon University Research Showcase @ CMU Language Technologies Institute School of Computer Science 9-2010 Analysis of Gender Normalization using MLP and VTLN Features Thomas Schaaf M*Modal Technologies
Review of Algorithms and Applications in Speech Recognition System
Review of Algorithms and Applications in Speech Recognition System Rashmi C R Assistant Professor, Department of CSE CIT, Gubbi, Tumkur,Karnataka,India Abstract- Speech is one of the natural ways for humans
Abstract. 1 Introduction. 2 Background
Automatic Spoken Affect Analysis and Classification Deb Roy and Alex Pentland MIT Media Laboratory Perceptual Computing Group 20 Ames St. Cambridge, MA 02129 USA dkroy, sandy@media.mit.edu Abstract This
WHEN THERE IS A mismatch between the acoustic
808 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Optimization of Temporal Filters for Constructing Robust Features in Speech Recognition Jeih-Weih Hung, Member,
AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION
JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 11/2007, ISSN 1642-6037 Marek WIŚNIEWSKI *, Wiesława KUNISZYK-JÓŹKOWIAK *, Elżbieta SMOŁKA *, Waldemar SUSZYŃSKI * HMM, recognition, speech, disorders
Stay Alert!: Creating a Classifier to Predict Driver Alertness in Real-time
Stay Alert!: Creating a Classifier to Predict Driver Alertness in Real-time Aditya Sarkar, Julien Kawawa-Beaudan, Quentin Perrot Friday, December 11, 2014 1 Problem Definition Driving while drowsy inevitably
Babble Noise: Modeling, Analysis, and Applications Nitish Krishnamurthy, Student Member, IEEE, and John H. L. Hansen, Fellow, IEEE
1394 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 7, SEPTEMBER 2009 Babble Noise: Modeling, Analysis, and Applications Nitish Krishnamurthy, Student Member, IEEE, and John
Learning facial expressions from an image
Learning facial expressions from an image Bhrugurajsinh Chudasama, Chinmay Duvedi, Jithin Parayil Thomas {bhrugu, cduvedi, jithinpt}@stanford.edu 1. Introduction Facial behavior is one of the most important
VOICE RECOGNITION SECURITY SYSTEM USING MEL-FREQUENCY CEPSTRUM COEFFICIENTS
Vol 9, Suppl. 3, 2016 Online - 2455-3891 Print - 0974-2441 Research Article VOICE RECOGNITION SECURITY SYSTEM USING MEL-FREQUENCY CEPSTRUM COEFFICIENTS ABSTRACT MAHALAKSHMI P 1 *, MURUGANANDAM M 2, SHARMILA
A Flexible Framework for Key Audio Effects Detection and Auditory Context Inference
1026 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 A Flexible Framework for Key Audio Effects Detection and Auditory Context Inference Rui Cai, Lie Lu, Member, IEEE,
New Cosine Similarity Scorings to Implement Gender-independent Speaker Verification
INTERSPEECH 2013 New Cosine Similarity Scorings to Implement Gender-independent Speaker Verification Mohammed Senoussaoui 1,2, Patrick Kenny 2, Pierre Dumouchel 1 and Najim Dehak 3 1 École de technologie
Spoken Language Identification Using Hybrid Feature Extraction Methods
JOURNAL OF TELECOMMUNICATIONS, VOLUME 1, ISSUE 2, MARCH 2010 11 Spoken Language Identification Using Hybrid Feature Extraction Methods Pawan Kumar, Astik Biswas, A.N. Mishra and Mahesh Chandra Abstract
Session 1: Gesture Recognition & Machine Learning Fundamentals
IAP Gesture Recognition Workshop Session 1: Gesture Recognition & Machine Learning Fundamentals Nicholas Gillian Responsive Environments, MIT Media Lab Tuesday 8th January, 2013 My Research My Research
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition Paul Hensch 21.01.2014 Seminar aus maschinellem Lernen 1 Large-Vocabulary Speech Recognition Complications 21.01.2014
DIAGNOSTIC EVALUATION OF SYNTHETIC SPEECH USING SPEECH RECOGNITION
DIAGNOSTIC EVALUATION OF SYNTHETIC SPEECH USING SPEECH RECOGNITION Miloš Cerňak, Milan Rusko and Marian Trnka Institute of Informatics, Slovak Academy of Sciences, Bratislava, Slovakia e-mail: Milos.Cernak@savba.sk
Collaboration and abstract representations: towards predictive models based on raw speech and eye-tracking data
Collaboration and abstract representations: towards predictive models based on raw speech and eye-tracking data Marc-Antoine Nüssli, Patrick Jermann, Mirweis Sangin, Pierre Dillenbourg, Ecole Polytechnique
Machine Learning and Applications in Finance
Machine Learning and Applications in Finance Christian Hesse 1,2,* 1 Autobahn Equity Europe, Global Markets Equity, Deutsche Bank AG, London, UK christian-a.hesse@db.com 2 Department of Computer Science,
Automatic Recognition of Speaker Age in an Inter-cultural Context
Automatic Recognition of Speaker Age in an Inter-cultural Context Michael Feld, DFKI in cooperation with Meraka Institute, Pretoria FEAST Speaker Classification Purposes Bootstrapping a User Model based
Ganesh Sivaraman 1, Vikramjit Mitra 2, Carol Y. Espy-Wilson 1
FUSION OF ACOUSTIC, PERCEPTUAL AND PRODUCTION FEATURES FOR ROBUST SPEECH RECOGNITION IN HIGHLY NON-STATIONARY NOISE Ganesh Sivaraman 1, Vikramjit Mitra 2, Carol Y. Espy-Wilson 1 1 University of Maryland
Learning words from sights and sounds: a computational model. Deb K. Roy, and Alex P. Pentland Presented by Xiaoxu Wang.
Learning words from sights and sounds: a computational model Deb K. Roy, and Alex P. Pentland Presented by Xiaoxu Wang Introduction Infants understand their surroundings by using a combination of evolved
Recognition of Emotions in Speech
Recognition of Emotions in Speech Enrique M. Albornoz, María B. Crolla and Diego H. Milone Grupo de investigación en señales e inteligencia computacional Facultad de Ingeniería y Ciencias Hídricas, Universidad
VOICE RECOGNITION SYSTEM: SPEECH-TO-TEXT
VOICE RECOGNITION SYSTEM: SPEECH-TO-TEXT Prerana Das, Kakali Acharjee, Pranab Das and Vijay Prasad* Department of Computer Science & Engineering and Information Technology, School of Technology, Assam
ELEC9723 Speech Processing
ELEC9723 Speech Processing COURSE INTRODUCTION Session 1, 2013 s Course Staff Course conveners: Dr. Vidhyasaharan Sethu, v.sethu@unsw.edu.au (EE304) Laboratory demonstrator: Nicholas Cummins, n.p.cummins@unsw.edu.au
Pass Phrase Based Speaker Recognition for Authentication
Pass Phrase Based Speaker Recognition for Authentication Heinz Hertlein, Dr. Robert Frischholz, Dr. Elmar Nöth* HumanScan GmbH Wetterkreuz 19a 91058 Erlangen/Tennenlohe, Germany * Chair for Pattern Recognition,
Voice Recognition based on vote-som
Voice Recognition based on vote-som Cesar Estrebou, Waldo Hasperue, Laura Lanzarini III-LIDI (Institute of Research in Computer Science LIDI) Faculty of Computer Science, National University of La Plata
TCDSCSS: Dimensionality Reduction to Evaluate Texts of Varying Lengths - an IR Approach
TCDSCSS: Dimensionality Reduction to Evaluate Texts of Varying Lengths - an IR Approach Arun Jayapal Dept of Computer Science Trinity College Dublin jayapala@cs.tcd.ie Martin Emms Dept of Computer Science
Bird Species Identification from an Image
Bird Species Identification from an Image Aditya Bhandari, 1 Ameya Joshi, 2 Rohit Patki 3 1 Department of Computer Science, Stanford University 2 Department of Electrical Engineering, Stanford University
Deep learning for music genre classification
Deep learning for music genre classification Tao Feng University of Illinois taofeng1@illinois.edu Abstract In this paper we will present how to use Restricted Boltzmann machine algorithm to build deep
Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm
Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm Prof. Ch.Srinivasa Kumar Prof. and Head of department. Electronics and communication Nalanda Institute
Speech Recognition at ICSI: Broadcast News and beyond
Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI
/$ IEEE
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 1, JANUARY 2009 95 A Probabilistic Generative Framework for Extractive Broadcast News Speech Summarization Yi-Ting Chen, Berlin
36-350: Data Mining. Fall Lectures: Monday, Wednesday and Friday, 10:30 11:20, Porter Hall 226B
36-350: Data Mining Fall 2009 Instructor: Cosma Shalizi, Statistics Dept., Baker Hall 229C, cshalizi@stat.cmu.edu Teaching Assistant: Joseph Richards, jwrichar@stat.cmu.edu Lectures: Monday, Wednesday
Speech Emotion Recognition Using Residual Phase and MFCC Features
Speech Emotion Recognition Using Residual Phase and MFCC Features N.J. Nalini, S. Palanivel, M. Balasubramanian 3,,3 Department of Computer Science and Engineering, Annamalai University Annamalainagar
Speech Recognition using MFCC and Neural Networks
Speech Recognition using MFCC and Neural Networks 1 Divyesh S. Mistry, 2 Prof.Dr.A.V.Kulkarni Department of Electronics and Communication, Pad. Dr. D. Y. Patil Institute of Engineering & Technology, Pimpri,
Deep Learning in Music Informatics
Deep Learning in Music Informatics Demystifying the Dark Art, Part III Practicum Eric J. Humphrey 04 November 2013 Outline In this part of the talk, we ll touch on the following: Recap: What is deep learning
Sawtooth Software. Improving K-Means Cluster Analysis: Ensemble Analysis Instead of Highest Reproducibility Replicates RESEARCH PAPER SERIES
Sawtooth Software RESEARCH PAPER SERIES Improving K-Means Cluster Analysis: Ensemble Analysis Instead of Highest Reproducibility Replicates Bryan Orme & Rich Johnson, Sawtooth Software, Inc. Copyright
Accent Conversion Using Artificial Neural Networks
Accent Conversion Using Artificial Neural Networks Amy Bearman abearman@stanford.edu Kelsey Josund kelsey2@stanford.edu Gawan Fiore gfiore@stanford.edu Abstract Automatic speech recognition (ASR) systems
Semantic-based Audio Recognition and Retrieval
Semantic-based Audio Recognition and Retrieval Colin R. Buchanan Master of Science Artificial Intelligence School of Informatics University of Edinburgh 2005 Abstract This study considers the problem of
Cross-Domain Video Concept Detection Using Adaptive SVMs
Cross-Domain Video Concept Detection Using Adaptive SVMs AUTHORS: JUN YANG, RONG YAN, ALEXANDER G. HAUPTMANN PRESENTATION: JESSE DAVIS CS 3710 VISUAL RECOGNITION Problem-Idea-Challenges Address accuracy
VOICE CONVERSION BY PROSODY AND VOCAL TRACT MODIFICATION
VOICE CONVERSION BY PROSODY AND VOCAL TRACT MODIFICATION K. Sreenivasa Rao Department of ECE, Indian Institute of Technology Guwahati, Guwahati - 781 39, India. E-mail: ksrao@iitg.ernet.in B. Yegnanarayana
IN our daily lives, we encounter a rich variety of sound
1 Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection Emre Çakır, Giambattista Parascandolo, Toni Heittola, Heikki Huttunen, and Tuomas Virtanen arxiv:1702.06286v1 [cs.lg] 21 Feb
Deep Neural Networks for Acoustic Modelling. Bajibabu Bollepalli Hieu Nguyen Rakshith Shetty Pieter Smit (Mentor)
Deep Neural Networks for Acoustic Modelling Bajibabu Bollepalli Hieu Nguyen Rakshith Shetty Pieter Smit (Mentor) Introduction Automatic speech recognition Speech signal Feature Extraction Acoustic Modelling
Vowel Pronunciation Accuracy Checking System Based on Phoneme Segmentation and Formants Extraction
Vowel Pronunciation Accuracy Checking System Based on Phoneme Segmentation and Formants Extraction Chanwoo Kim and Wonyong Sung School of Electrical Engineering Seoul National University Shinlim-Dong,
In Voce, Cantato, Parlato. Studi in onore di Franco Ferrero, E.Magno- Caldognetto, P.Cosi e A.Zamboni, Unipress Padova, pp , 2003.
VOWELS: A REVISIT Maria-Gabriella Di Benedetto Università degli Studi di Roma La Sapienza Facoltà di Ingegneria Infocom Dept. Via Eudossiana, 18, 00184, Rome (Italy) (39) 06 44585863, (39) 06 4873300 FAX,
Fast Keyword Spotting in Telephone Speech
RADIOENGINEERING, VOL. 18, NO. 4, DECEMBER 2009 665 Fast Keyword Spotting in Telephone Speech Jan NOUZA, Jan SILOVSKY SpeechLab, Faculty of Mechatronics, Technical University of Liberec, Studentska 2,
CS224n: Homework 4 Reading Comprehension
CS224n: Homework 4 Reading Comprehension Leandra Brickson, Ryan Burke, Alexandre Robicquet 1 Overview To read and comprehend the human languages are challenging tasks for the machines, which requires that
Linear Regression: Predicting House Prices
Linear Regression: Predicting House Prices I am big fan of Kalid Azad writings. He has a knack of explaining hard mathematical concepts like Calculus in simple words and helps the readers to get the intuition
Auditory Context Recognition Using SVMs
Auditory Context Recognition Using SVMs Mikko Perttunen 1, Max Van Kleek 2, Ora Lassila 3, Jukka Riekki 1 1 Department of Electrical and Information Engineering, 90014 University of Oulu, Finland {first.last}@ee.oulu.fi
Dudon Wai Georgia Institute of Technology CS 7641: Machine Learning Atlanta, GA
Adult Income and Letter Recognition - Supervised Learning Report An objective look at classifier performance for predicting adult income and Letter Recognition Dudon Wai Georgia Institute of Technology