PERFORMANCE COMPARISON OF SPEECH RECOGNITION FOR VOICE ENABLING APPLICATIONS - A STUDY
|
|
- Gladys Johns
- 5 years ago
- Views:
Transcription
1 PERFORMANCE COMPARISON OF SPEECH RECOGNITION FOR VOICE ENABLING APPLICATIONS - A STUDY V. Karthikeyan 1 and V. J. Vijayalakshmi 2 1 Department of ECE, VCEW, Thiruchengode, Tamilnadu, India, Karthick77keyan@gmail.com 2 Department of EEE, KPRCET, Coimbatore Tamilnadu, India, vijik810@gmail.com Abstract - Performance Comparison of Speech Recognition for Voice Enabling Applications is presented. This paper focuses on the speaker independent system to provide all the mobile phone applications without touching the device especially for those visually challenged. The proposed system is evaluated through two different classification modeling such as Template generation using Dynamic time warping (DTW) and Hidden Markov Model (HMM) / Vector Quantization (VQ) with Mel Frequency Cepstral Coefficient (MFCC) features. The performance comparison between two classification modeling is made based on the average recognition accuracy Hidden Markov Model (HMM) / Vector Quantization (VQ) classification modeling with Mel Frequency Cepstral Coefficient (MFCC) features gives recognition rate of 82.77% for the utterances which is higher than the conventional methods. TMS320C2x DSP processor can be used to implement voice enabled mobile phone. Index Terms- Template generation, Dynamic time warping (DTW), Hidden Markov Model (HMM) / Vector Quantization (VQ) and Mel Frequency Cepstral Coefficient (MFCC) INTRODUCTION Speech recognition is a field in which the system recognizes the spoken words [1]. It enables the system to identify the words that a person speaks into microphone and recognizes them by converting them into written text. Automatic Speech Recognition (ASR) is the process of determining a sequence of words spoken by human using machines. The goal of ASR is to have speech as a medium of interaction between man and machine. This Automatic Speech Recognition (ASR) is commonly used in voice enabled mobile phones implemented using voice as an input. In these systems, it is necessary to perform recognition of voice input so that the appropriate action can be enabled. Voice enabled mobile phone refers to the enabling of mobile applications where a user may dial a contact in the device memory by saying the digit of the contact which is saved in speed dial or name of the contact[2], sending and receiving the message and making the phone to put on hold mode and loudspeaker mode. The first step in the process of voice enabling application is speech recognition. To do so, it is necessary to extract the features from the uttered word. Recognition of the uttered word is done by means of comparing both trained sample and test sample. The most commonly used feature extraction algorithm is Mel frequency Cepstral coefficient front end. The power frequency data is then filtered with filter that resembles the human ear s sensitivity curve called Mel scale Filter. While the human ear is sensitive to frequency variations in lower frequency values, this sensitivity is reduced for higher frequency 48
2 signal components. From the experimental results, it is well known that Mel Frequency Cepstral Coefficients (MFCC) is among the best acoustic features used in automatic speech recognition [3]. The Mel Frequency Cepstral Coefficients are robust, contain much information about the vocal tract configuration regardless the source of excitation, and can be used to represent all classes of speech sounds. This paper is organized as follows: In section 2, the general principles of ASR system have been discussed. Section 3 deals with the Mel Frequency Cepstral Coefficient (MFCC) feature extraction technique which is used in the proposed system and section 4 deals with two classification modeling. In section 5 results and conclusion are discussed which shows the accuracy of two classification modeling. AUTOMATIC SPEECH RECOGNITION SYSTEM In general, the Automatic Speech Recognition (ASR) system [4] consists of two modes such as training mode and testing mode. First the input speech signal is pre-processed and then features are extracted. From the extracted features reference samples are created from which the comparison and recognition is made. The general block diagram of Automatic Speech Recognition (ASR) system is shown in figure.1. Training Mode: In speaker dependent and speaker independent system, the system has to be trained by the speaker. It means that samples have to be collected from different speakers using microphone as an input device Accuracy is directly proportional to the no of samples Preprocessing methods are used to extract acoustic characteristics of the speech signal. Software analyses and generates patterns from the extracted feature vectors and stores it in matrix form as a reference pattern. These reference patterns are matched with the input speech during the recognition mode. TESTING MODE: In this mode, the test sample is analyzed for its acoustic characteristics and the important features are extracted from the input speech sample. This feature vectors are used to generate an input pattern using and stores it in matrix form. This unknown pattern is compared against known reference pattern, element by element. Once the best match is found, the appropriate action is enabled. Fig. 1.Block Diagram of Automatic Speech Recognition System FEATURE EXTRACTION Feature extraction is the key to the front-end process in speaker verification systems. The performance of a speaker verification system is highly dependent on the quality of the selected 49
3 speech features. The speech signal is a slowly varying signal and is often termed stationary. Therefore, short-time spectral analysis is the most common way to characterize speech signal. Before extracting the features the speech signal is preprocessed using following steps. i) Framing ii) Windowing, The speech signal is divided into short fixed length frames. The continuous speech signal is divided into frames where each frame consists of M samples [5]. Very often successive frames are overlapping with each other by M samples. For the proposed system, frame size of M = 256 with an overlap of 50% i.e. N=128 have been used. After frame segmentation, windowing is carried out to minimize the spectral distortion by using the window to taper the signal on both ends thus reducing the side effects caused by signal discontinuity at the beginning and at the end due to framing. Hamming window is used as spectral leakage is less. It is multiplied with each frame and the window function is as given in eqn. (1), (1) Where M is the number of samples in each frame There are different types of feature extraction techniques available such as Mel-Frequency Cepstral Coefficient (MFCC), Linear Prediction Cepstral Co-efficient (LPCC), Bark Frequency Cepstral Co-efficient (BFCC), Perceptual Linear Prediction (PLP) and Rasta Perceptual Linear Prediction. Among those Mel Frequency Cepstral Coefficient (MFCC) provides good recognition accuracy. Mel-Frequency Cepstral Coefficient (MFCC) vectors are used to provide an estimate of the vocal tract filter [6]. Background noise energy level is evaluated at the beginning and the end of speech signal and energy thresholds are applied to find speech beginning and end points. The pre emphasized speech signal is blocked into frames of N = 256 samples, with adjacent frames separated by M = 128 samples. Then windowing is done to minimize the speech signal discontinuities at the beginning and end of each analysis frame. The general block diagram of Mel Frequency Cepstral Coefficient (MFCC) is shown in figure.2 Fig. 2. Block Diagram of MFCC Technique After windowing, Fast Fourier Transform (FFT) is applied. Then a spectrum is passed through 20 Mel scale triangular filter bank. The Mel scale is a critical band frequency scale that takes into account the frequency perception in the human auditory system. Discrete Cosine Transform (DCT) is applied to the log Mel scale filter outputs and thus 12 Mel-frequency Cepstral coefficients (MFCC) are obtained. Then Cepstral filtering is performed with the help of eqn. (2), (2) 50
4 Where N is the number of filter bank channels. Mel-Frequency Cepstral Coefficients (MFCCs) are calculated from the log filter bank amplitudes {mj} using the eqn. (3) Where N is the number of filter bank channels. SPEAKER MODELLING As the extracted features requires more storage memory, it is necessary to convert those features into vectors so that storage memory requirement can be met which can be done by using classification modeling. The proposed system is verified through Template model and Hidden Markov Model (HMM) / Vector Quantization (VQ). The block diagram of the proposed system is as shown in figure.3. (3) Fig.3. Block Diagram of Proposed System Template generation modeling uses Dynamic Time Warping (DTW) for speech pattern matching [7] for speaker dependent system in which it expands or contracts the time axis non-linearly to match the input speech with the reference template. The reason here to use Dynamic Time Warping (DTW) algorithm is that Dynamic Time Warping (DTW) -based recognition engine has been widely embedded inside Qualcomm MSM (Mobile Station Modem) chips because of its less computational complexity for phone dialing. Template Generation using DTW Templates are reference patterns that are derived from features recorded over the length of the whole word rather than at particular points. The procedure is as follows: Choose one utterance from the training data for reference. Use Dynamic Time Warping (DTW) technique to align all the training data to match with the reference. Once the training data are aligned, compute the reference pattern vector as the Centroid of the feature vectors (of cepstral coefficients) corresponding to all the occurrences of the digit. Dynamic Time Warping is used to create reference templates and to find the best match between the reference template and the input template derived from the test input speech sample. The matching process needs to compensate for length differences and take account of the non-linear 51
5 nature of the length differences within the words. Dynamic Time Warping (DTW) grid is used to find the best match between input data and stored sequence. We can find a path through the grid which minimizes the total distance between them. The input data is either stretched or compressed in order to match with the input. Once an overall path has been found, the total distance between the input sequence and reference sequence can be calculated for this particular input template. In Dynamic Time Warping method (DTW), when comparing sequences with different length, the sequence length is modified by repeating or omitting some frames, so that both sequences will have the same length. This modification of sequences is called time warping. Hidden Markov Model with Vector Quantization (HMM/VQ) Here the proposed method, speaker independent isolated word recognition is implemented using Vector Quantization (VQ) and Hidden Markov model (HMM) which is suitable to provide higher accuracy rate with more no of samples. One of the most popular approaches to speaker independent [8] speech recognition, is the combination of Vector Quantization (VQ) for the encoding of segments of speech with a Hidden Markov Modelling (HMM) for the classification of sequences of segments [9] as in figure.4. After extracting the features, K-means clustering algorithm is used to iteratively create the vector quantizer codebook until the average distance falls below a preset threshold. A set of such vectors (corresponding to multiple utterances of the same word) is used to re-estimate the Hidden Markov Model [10] for that word. Fig.4. Block Diagram of Hidden Markov Model with Vector Quantization After extracting the features, K-means clustering algorithm is used to iteratively create the vector quantizer codebook until the average distance falls below a preset threshold. A set of such vectors (corresponding to multiple utterances of the same word) is used to re-estimate the Hidden Markov Model [10] for that word. This procedure is repeated for each word in the vocabulary. In the testing mode, the set of Mel Frequency Cepstral Coefficient (MFCC) vectors corresponding to the unknown word is quantized by the vector quantizer to give a vector of codebook indices. This is scored on each word Hidden Markov model (HMM) to give a probability score for each word model. The decision rule is used to choose the word whose model gives the highest probability. VQ is a process of mapping vectors from a large vector space to a finite number of regions in that space. Each region is called a cluster and can be represented by its center called a centroid. The collection of all code words is called a codebook [11]. In speech recognition, vector quantization can be used to train Hidden Markov model (HMMs). RESULTS AND DISCUSSION The proposed speaker independent isolated speech recognition system is implemented in 52
6 MATLAB software. Database containing 0-9 digits and 8 words Dial, Cut, On, Off, Hold, Read, Write and Loudspeaker are considered for voice dialing applications. The digits are used to dial a contact no assigned for speed dial. The words like Dial, Cut and On, Off and Hold are used to control voice dialing application and Mobile phone on and off operation respectively. The words like Read, Write and Loudspeaker are used to enable message read and write application and hands free mode respectively. Database is collected from 25 speakers pronouncing each word 10 times yielding 250 utterances per word and 180 utterances per speaker, totally 4500 samples. The first 8 utterance of each word is used for training and remaining is used for testing. In Training phase, the uttered digits are recorded using 8-bit Pluse Code Modulation (PCM) with a sampling rate of 8 KHz, which holds the information of the communication and converted as a wave file using Total audio converter software The performance of the speech recognition systems is given in terms of a word error rate (%) and is calculated from eqn.4, W= (M/N) x100% (4) Where: N = Total no of samples taken M = Total no of recognized samples In the confusion matrix, each uttered word is compared with all other words and it shows how many times each word was correctly recognized and from which the average recognition accuracy is calculated. The Table.1 shows the result of speech recognition system using Template generation modeling with Mel Frequency Cepstral Coefficient (MFCC) features. As shown in Table.1, digit 0 and 2 have the highest accuracy rate of 60% and digit 4 & 9 and the word cut have the least accuracy rate of 50%. The word off has the second highest accuracy rate of 58%. In Hidden Markov Model (HMM) / Vector Quantization (VQ) model, codebook is generated using Vector Quantization (VQ) method. In this codebook generation, some n no of clusters are generated after different no of iterations as shown in Table 2. As shown in Table.3, digit 0 has the highest accuracy rate of 88% and digit 2 has the least accuracy rate of 78%. Digit 9 and dial has the next highest accuracy rate of 87%. Words hold and off have the accuracy rate of 85%. Words Loudspeaker and on have the accuracy rate of 84%. Digit 5, 8 and word write have the accuracy rate of 83% Digit 3 and word cut have the accuracy rate of 82%. Digit 6 has the accuracy rate of 81% followed by the digit 4 and word write which have the accuracy rate of 80%. Comparison between the recognition rates of two different classification modeling with Mel Frequency Cepstral Coefficient (MFCC) features is shown in Table.4 which implies that Hidden Markov Model (HMM) / Vector Quantization (VQ) has higher recognition rate than Template generation modeling CONCLUSION The proposed system designed using Hidden Markov Model (HMM) / Vector Quantization (VQ) classification modeling with Mel Frequency Cepstral Coefficient (MFCC) features gives recognition rate of 82.77% for the utterances which is higher than Template generation modeling using Dynamic time warping (DTW) which is of 53.88%. Further it can be implemented using DSP processor for hardware implementation so that voice enabling applications can be made. TMS320C2x dsp processor can be used to implement voice enabled mobile phone. 53
7 54
8 References [1] Rabiner, L.R. and R.W. Schafer, Prentice-Hall Inc. Juha Iso-Sipila, Design and Implementation of a Speaker-Independent Voice Dialing System: A Multi- Lingual Approach Ph.D., 1978, Digital Processing of Speech Signals, April [2] Tilo Schurer, An Experimental comparison of different feature extraction for Telephone speech, in proc of 2 nd IEEE workshop on interactive voice technology for 55
9 Telecommunication applications, 1994 [3] Jason Chong and Roberto Togneri, Speaker Independent Recognition of Small Vocabulary, M.S.Thesis, Centre for Intelligent Information Processing Systems, the University of Western Australia [4] L. Rabiner and B. H. Jaung, Fundamentals of Speech recognition, Prentice Hall Englewood Cliffs, New Jersey, [5] S.B. Davis and P. Mermelstein, Comparison of Parametric representations for Monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on Acoustics, Speech, Signal Processing, vol. ASSP-28(4), pp , August 1980 [6] Itakura F. (1975). Minimum prediction residual applied to speech recognition. IEEE Trans. Acoustics, Speech, Signal Proc., ASSP-23 (1), [7] Schafer, R.W. Scientific bases of human-machine communication by voice in Proceedings of the National Academy of Science. Vol. 92, [8] Hoshimi M., Miyata M., and Hiraoka S., Speaker independent speech recognition method using training from a small number of speakers, in proc IEEE international conference on acoustics, speech and signal processing, pp ,1992. [9] Nilsson M., Ejnarsson M., Speech Recognition Using HMM: Performance Evaluation in Noisy Environments, MS Thesis, Blekinge Institute of Technology, Department of Telecommunications and Signal Processing, 2002 [10] Ferrer M.A., Alonso I., Travieso C., Influence of initialization and Stop Criteria on MM based recognizers Electronics letters of IEEE, Vol. 36, pp , June [11] Rabiner L.R., Levinson S.E., Rosenberg A.E., Wilson J.G., "Speaker independent Recognition of isolated words using clustering techniques, IEEE Trans. Acoustic Speech Signal Process Vol.27, pp ,
Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm
Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm Prof. Ch.Srinivasa Kumar Prof. and Head of department. Electronics and communication Nalanda Institute
More informationA Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language
A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language Z.HACHKAR 1,3, A. FARCHI 2, B.MOUNIR 1, J. EL ABBADI 3 1 Ecole Supérieure de Technologie, Safi, Morocco. zhachkar2000@yahoo.fr.
More informationInternational Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012
Text-independent Mono and Cross-lingual Speaker Identification with the Constraint of Limited Data Nagaraja B G and H S Jayanna Department of Information Science and Engineering Siddaganga Institute of
More informationHuman Emotion Recognition From Speech
RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati
More informationClass-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification
Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Tomi Kinnunen and Ismo Kärkkäinen University of Joensuu, Department of Computer Science, P.O. Box 111, 80101 JOENSUU,
More informationAUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION
JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 11/2007, ISSN 1642-6037 Marek WIŚNIEWSKI *, Wiesława KUNISZYK-JÓŹKOWIAK *, Elżbieta SMOŁKA *, Waldemar SUSZYŃSKI * HMM, recognition, speech, disorders
More informationAnalysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier
IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) e-issn: 2278-2834,p- ISSN: 2278-8735.Volume 10, Issue 2, Ver.1 (Mar - Apr.2015), PP 55-61 www.iosrjournals.org Analysis of Emotion
More informationWHEN THERE IS A mismatch between the acoustic
808 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Optimization of Temporal Filters for Constructing Robust Features in Speech Recognition Jeih-Weih Hung, Member,
More informationSpeaker recognition using universal background model on YOHO database
Aalborg University Master Thesis project Speaker recognition using universal background model on YOHO database Author: Alexandre Majetniak Supervisor: Zheng-Hua Tan May 31, 2011 The Faculties of Engineering,
More informationSpeech Emotion Recognition Using Support Vector Machine
Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,
More informationADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION
ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION Mitchell McLaren 1, Yun Lei 1, Luciana Ferrer 2 1 Speech Technology and Research Laboratory, SRI International, California, USA 2 Departamento
More informationSpeech Recognition at ICSI: Broadcast News and beyond
Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI
More informationA study of speaker adaptation for DNN-based speech synthesis
A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, Simon King The Centre for Speech Technology Research (CSTR) University of Edinburgh,
More informationRobust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction
INTERSPEECH 2015 Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction Akihiro Abe, Kazumasa Yamamoto, Seiichi Nakagawa Department of Computer
More informationModeling function word errors in DNN-HMM based LVCSR systems
Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford
More informationModeling function word errors in DNN-HMM based LVCSR systems
Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford
More informationSpeaker Identification by Comparison of Smart Methods. Abstract
Journal of mathematics and computer science 10 (2014), 61-71 Speaker Identification by Comparison of Smart Methods Ali Mahdavi Meimand Amin Asadi Majid Mohamadi Department of Electrical Department of Computer
More informationPhonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project
Phonetic- and Speaker-Discriminant Features for Speaker Recognition by Lara Stoll Research Project Submitted to the Department of Electrical Engineering and Computer Sciences, University of California
More informationNoise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions
26 24th European Signal Processing Conference (EUSIPCO) Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions Emma Jokinen Department
More informationLearning Methods in Multilingual Speech Recognition
Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex
More informationLikelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition
MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition Seltzer, M.L.; Raj, B.; Stern, R.M. TR2004-088 December 2004 Abstract
More informationAutomatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment
Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy Sheeraz Memon
More informationSpeech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence
INTERSPEECH September,, San Francisco, USA Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence Bidisha Sharma and S. R. Mahadeva Prasanna Department of Electronics
More informationSpeech Recognition by Indexing and Sequencing
International Journal of Computer Information Systems and Industrial Management Applications. ISSN 215-7988 Volume 4 (212) pp. 358 365 c MIR Labs, www.mirlabs.net/ijcisim/index.html Speech Recognition
More informationAnalysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription
Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription Wilny Wilson.P M.Tech Computer Science Student Thejus Engineering College Thrissur, India. Sindhu.S Computer
More informationVimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India
World of Computer Science and Information Technology Journal (WCSIT) ISSN: 2221-0741 Vol. 2, No. 1, 1-7, 2012 A Review on Challenges and Approaches Vimala.C Project Fellow, Department of Computer Science
More informationModule 12. Machine Learning. Version 2 CSE IIT, Kharagpur
Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should
More informationThe NICT/ATR speech synthesis system for the Blizzard Challenge 2008
The NICT/ATR speech synthesis system for the Blizzard Challenge 2008 Ranniery Maia 1,2, Jinfu Ni 1,2, Shinsuke Sakai 1,2, Tomoki Toda 1,3, Keiichi Tokuda 1,4 Tohru Shimizu 1,2, Satoshi Nakamura 1,2 1 National
More informationInternational Journal of Advanced Networking Applications (IJANA) ISSN No. :
International Journal of Advanced Networking Applications (IJANA) ISSN No. : 0975-0290 34 A Review on Dysarthric Speech Recognition Megha Rughani Department of Electronics and Communication, Marwadi Educational
More informationSpeech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines
Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Amit Juneja and Carol Espy-Wilson Department of Electrical and Computer Engineering University of Maryland,
More informationSpeaker Recognition. Speaker Diarization and Identification
Speaker Recognition Speaker Diarization and Identification A dissertation submitted to the University of Manchester for the degree of Master of Science in the Faculty of Engineering and Physical Sciences
More informationDigital Signal Processing: Speaker Recognition Final Report (Complete Version)
Digital Signal Processing: Speaker Recognition Final Report (Complete Version) Xinyu Zhou, Yuxin Wu, and Tiezheng Li Tsinghua University Contents 1 Introduction 1 2 Algorithms 2 2.1 VAD..................................................
More informationVoice conversion through vector quantization
J. Acoust. Soc. Jpn.(E)11, 2 (1990) Voice conversion through vector quantization Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano, and Hisao Kuwabara A TR Interpreting Telephony Research Laboratories,
More informationAutomatic Pronunciation Checker
Institut für Technische Informatik und Kommunikationsnetze Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich Ecole polytechnique fédérale de Zurich Politecnico federale
More informationOCR for Arabic using SIFT Descriptors With Online Failure Prediction
OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,
More informationUTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation
UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation Taufiq Hasan Gang Liu Seyed Omid Sadjadi Navid Shokouhi The CRSS SRE Team John H.L. Hansen Keith W. Godin Abhinav Misra Ali Ziaei Hynek Bořil
More informationSpeech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers
Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers October 31, 2003 Amit Juneja Department of Electrical and Computer Engineering University of Maryland, College Park,
More informationWord Segmentation of Off-line Handwritten Documents
Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department
More informationProceedings of Meetings on Acoustics
Proceedings of Meetings on Acoustics Volume 19, 2013 http://acousticalsociety.org/ ICA 2013 Montreal Montreal, Canada 2-7 June 2013 Speech Communication Session 2aSC: Linking Perception and Production
More informationEli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology
ISCA Archive SUBJECTIVE EVALUATION FOR HMM-BASED SPEECH-TO-LIP MOVEMENT SYNTHESIS Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano Graduate School of Information Science, Nara Institute of Science & Technology
More informationLecture 9: Speech Recognition
EE E6820: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition 1 Recognizing speech 2 Feature calculation Dan Ellis Michael Mandel 3 Sequence
More informationBAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass
BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION Han Shu, I. Lee Hetherington, and James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge,
More informationOn the Formation of Phoneme Categories in DNN Acoustic Models
On the Formation of Phoneme Categories in DNN Acoustic Models Tasha Nagamine Department of Electrical Engineering, Columbia University T. Nagamine Motivation Large performance gap between humans and state-
More informationDOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds
DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS Elliot Singer and Douglas Reynolds Massachusetts Institute of Technology Lincoln Laboratory {es,dar}@ll.mit.edu ABSTRACT
More informationSemi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration
INTERSPEECH 2013 Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration Yan Huang, Dong Yu, Yifan Gong, and Chaojun Liu Microsoft Corporation, One
More informationOPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS
OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,
More informationLip reading: Japanese vowel recognition by tracking temporal changes of lip shape
Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Koshi Odagiri 1, and Yoichi Muraoka 1 1 Graduate School of Fundamental/Computer Science and Engineering, Waseda University,
More informationPython Machine Learning
Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled
More informationCircuit Simulators: A Revolutionary E-Learning Platform
Circuit Simulators: A Revolutionary E-Learning Platform Mahi Itagi Padre Conceicao College of Engineering, Verna, Goa, India. itagimahi@gmail.com Akhil Deshpande Gogte Institute of Technology, Udyambag,
More informationA NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren
A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren Speech Technology and Research Laboratory, SRI International,
More informationTRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY
TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY Philippe Hamel, Matthew E. P. Davies, Kazuyoshi Yoshii and Masataka Goto National Institute
More informationACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS
ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS Annamaria Mesaros 1, Toni Heittola 1, Antti Eronen 2, Tuomas Virtanen 1 1 Department of Signal Processing Tampere University of Technology Korkeakoulunkatu
More informationUnvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition
Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Hua Zhang, Yun Tang, Wenju Liu and Bo Xu National Laboratory of Pattern Recognition Institute of Automation, Chinese
More informationQuarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech
Dept. for Speech, Music and Hearing Quarterly Progress and Status Report VCV-sequencies in a preliminary text-to-speech system for female speech Karlsson, I. and Neovius, L. journal: STL-QPSR volume: 35
More informationAffective Classification of Generic Audio Clips using Regression Models
Affective Classification of Generic Audio Clips using Regression Models Nikolaos Malandrakis 1, Shiva Sundaram, Alexandros Potamianos 3 1 Signal Analysis and Interpretation Laboratory (SAIL), USC, Los
More informationIEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH 2009 423 Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition George
More informationMining Association Rules in Student s Assessment Data
www.ijcsi.org 211 Mining Association Rules in Student s Assessment Data Dr. Varun Kumar 1, Anupama Chadha 2 1 Department of Computer Science and Engineering, MVN University Palwal, Haryana, India 2 Anupama
More informationBody-Conducted Speech Recognition and its Application to Speech Support System
Body-Conducted Speech Recognition and its Application to Speech Support System 4 Shunsuke Ishimitsu Hiroshima City University Japan 1. Introduction In recent years, speech recognition systems have been
More informationAuthor's personal copy
Speech Communication 49 (2007) 588 601 www.elsevier.com/locate/specom Abstract Subjective comparison and evaluation of speech enhancement Yi Hu, Philipos C. Loizou * Department of Electrical Engineering,
More informationSupport Vector Machines for Speaker and Language Recognition
Support Vector Machines for Speaker and Language Recognition W. M. Campbell, J. P. Campbell, D. A. Reynolds, E. Singer, P. A. Torres-Carrasquillo MIT Lincoln Laboratory, 244 Wood Street, Lexington, MA
More informationMaster s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors
Master s Programme in Computer, Communication and Information Sciences, Study guide 2015-2016, ELEC Majors Sisällysluettelo PS=pääsivu, AS=alasivu PS: 1 Acoustics and Audio Technology... 4 Objectives...
More informationLearning Methods for Fuzzy Systems
Learning Methods for Fuzzy Systems Rudolf Kruse and Andreas Nürnberger Department of Computer Science, University of Magdeburg Universitätsplatz, D-396 Magdeburg, Germany Phone : +49.39.67.876, Fax : +49.39.67.8
More informationReducing Features to Improve Bug Prediction
Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science
More informationAutomatic segmentation of continuous speech using minimum phase group delay functions
Speech Communication 42 (24) 429 446 www.elsevier.com/locate/specom Automatic segmentation of continuous speech using minimum phase group delay functions V. Kamakshi Prasad, T. Nagarajan *, Hema A. Murthy
More informationCalibration of Confidence Measures in Speech Recognition
Submitted to IEEE Trans on Audio, Speech, and Language, July 2010 1 Calibration of Confidence Measures in Speech Recognition Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IEEE
More informationINVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT
INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT Takuya Yoshioka,, Anton Ragni, Mark J. F. Gales Cambridge University Engineering Department, Cambridge, UK NTT Communication
More informationFUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria
FUZZY EXPERT SYSTEMS 16-18 18 February 2002 University of Damascus-Syria Dr. Kasim M. Al-Aubidy Computer Eng. Dept. Philadelphia University What is Expert Systems? ES are computer programs that emulate
More informationIndividual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION
L I S T E N I N G Individual Component Checklist for use with ONE task ENGLISH VERSION INTRODUCTION This checklist has been designed for use as a practical tool for describing ONE TASK in a test of listening.
More informationBUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING
BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING Gábor Gosztolya 1, Tamás Grósz 1, László Tóth 1, David Imseng 2 1 MTA-SZTE Research Group on Artificial
More informationA New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation
A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation SLSP-2016 October 11-12 Natalia Tomashenko 1,2,3 natalia.tomashenko@univ-lemans.fr Yuri Khokhlov 3 khokhlov@speechpro.com Yannick
More informationhave to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,
A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994
More informationSegregation of Unvoiced Speech from Nonspeech Interference
Technical Report OSU-CISRC-8/7-TR63 Department of Computer Science and Engineering The Ohio State University Columbus, OH 4321-1277 FTP site: ftp.cse.ohio-state.edu Login: anonymous Directory: pub/tech-report/27
More informationA comparison of spectral smoothing methods for segment concatenation based speech synthesis
D.T. Chappell, J.H.L. Hansen, "Spectral Smoothing for Speech Segment Concatenation, Speech Communication, Volume 36, Issues 3-4, March 2002, Pages 343-373. A comparison of spectral smoothing methods for
More informationA GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING
A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING Yong Sun, a * Colin Fidge b and Lin Ma a a CRC for Integrated Engineering Asset Management, School of Engineering Systems, Queensland
More informationA Case-Based Approach To Imitation Learning in Robotic Agents
A Case-Based Approach To Imitation Learning in Robotic Agents Tesca Fitzgerald, Ashok Goel School of Interactive Computing Georgia Institute of Technology, Atlanta, GA 30332, USA {tesca.fitzgerald,goel}@cc.gatech.edu
More informationCourses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access
The courses availability depends on the minimum number of registered students (5). If the course couldn t start, students can still complete it in the form of project work and regular consultations with
More informationEvolutive Neural Net Fuzzy Filtering: Basic Description
Journal of Intelligent Learning Systems and Applications, 2010, 2: 12-18 doi:10.4236/jilsa.2010.21002 Published Online February 2010 (http://www.scirp.org/journal/jilsa) Evolutive Neural Net Fuzzy Filtering:
More informationPREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES
PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES Po-Sen Huang, Kshitiz Kumar, Chaojun Liu, Yifan Gong, Li Deng Department of Electrical and Computer Engineering,
More informationData Fusion Models in WSNs: Comparison and Analysis
Proceedings of 2014 Zone 1 Conference of the American Society for Engineering Education (ASEE Zone 1) Data Fusion s in WSNs: Comparison and Analysis Marwah M Almasri, and Khaled M Elleithy, Senior Member,
More informationOn Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC
On Human Computer Interaction, HCI Dr. Saif al Zahir Electrical and Computer Engineering Department UBC Human Computer Interaction HCI HCI is the study of people, computer technology, and the ways these
More informationModeling user preferences and norms in context-aware systems
Modeling user preferences and norms in context-aware systems Jonas Nilsson, Cecilia Lindmark Jonas Nilsson, Cecilia Lindmark VT 2016 Bachelor's thesis for Computer Science, 15 hp Supervisor: Juan Carlos
More informationOn-Line Data Analytics
International Journal of Computer Applications in Engineering Sciences [VOL I, ISSUE III, SEPTEMBER 2011] [ISSN: 2231-4946] On-Line Data Analytics Yugandhar Vemulapalli #, Devarapalli Raghu *, Raja Jacob
More informationCourse Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE
EE-589 Introduction to Neural Assistant Prof. Dr. Turgay IBRIKCI Room # 305 (322) 338 6868 / 139 Wensdays 9:00-12:00 Course Outline The course is divided in two parts: theory and practice. 1. Theory covers
More informationINPE São José dos Campos
INPE-5479 PRE/1778 MONLINEAR ASPECTS OF DATA INTEGRATION FOR LAND COVER CLASSIFICATION IN A NEDRAL NETWORK ENVIRONNENT Maria Suelena S. Barros Valter Rodrigues INPE São José dos Campos 1993 SECRETARIA
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview
More informationLongest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 6, Ver. IV (Nov Dec. 2015), PP 01-07 www.iosrjournals.org Longest Common Subsequence: A Method for
More informationNon intrusive multi-biometrics on a mobile device: a comparison of fusion techniques
Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques Lorene Allano 1*1, Andrew C. Morris 2, Harin Sellahewa 3, Sonia Garcia-Salicetti 1, Jacques Koreman 2, Sabah Jassim
More informationLarge vocabulary off-line handwriting recognition: A survey
Pattern Anal Applic (2003) 6: 97 121 DOI 10.1007/s10044-002-0169-3 ORIGINAL ARTICLE A. L. Koerich, R. Sabourin, C. Y. Suen Large vocabulary off-line handwriting recognition: A survey Received: 24/09/01
More information1 Use complex features of a word processing application to a given brief. 2 Create a complex document. 3 Collaborate on a complex document.
National Unit specification General information Unit code: HA6M 46 Superclass: CD Publication date: May 2016 Source: Scottish Qualifications Authority Version: 02 Unit purpose This Unit is designed to
More informationLearning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models
Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za
More informationMandarin Lexical Tone Recognition: The Gating Paradigm
Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition
More informationAutomatic intonation assessment for computer aided language learning
Available online at www.sciencedirect.com Speech Communication 52 (2010) 254 267 www.elsevier.com/locate/specom Automatic intonation assessment for computer aided language learning Juan Pablo Arias a,
More informationCOMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION
Session 3532 COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION Thad B. Welch, Brian Jenkins Department of Electrical Engineering U.S. Naval Academy, MD Cameron H. G. Wright Department of Electrical
More informationOn Developing Acoustic Models Using HTK. M.A. Spaans BSc.
On Developing Acoustic Models Using HTK M.A. Spaans BSc. On Developing Acoustic Models Using HTK M.A. Spaans BSc. Delft, December 2004 Copyright c 2004 M.A. Spaans BSc. December, 2004. Faculty of Electrical
More informationQuickStroke: An Incremental On-line Chinese Handwriting Recognition System
QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents
More informationA Case Study: News Classification Based on Term Frequency
A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center
More informationUsing Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing
Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing Pallavi Baljekar, Sunayana Sitaram, Prasanna Kumar Muthukumar, and Alan W Black Carnegie Mellon University,
More informationA Biological Signal-Based Stress Monitoring Framework for Children Using Wearable Devices
Article A Biological Signal-Based Stress Monitoring Framework for Children Using Wearable Devices Yerim Choi 1, Yu-Mi Jeon 2, Lin Wang 3, * and Kwanho Kim 2, * 1 Department of Industrial and Management
More informationSpeech Translation for Triage of Emergency Phonecalls in Minority Languages
Speech Translation for Triage of Emergency Phonecalls in Minority Languages Udhyakumar Nallasamy, Alan W Black, Tanja Schultz, Robert Frederking Language Technologies Institute Carnegie Mellon University
More information