A Hybrid System for Audio Segmentation and Speech endpoint Detection of Broadcast News
|
|
- Herbert O’Neal’
- 1 years ago
- Views:
Transcription
1 A Hybrid System for Audio Segmentation and Speech endpoint Detection of Broadcast News Maria Markaki 1, Alexey Karpov 2, Elias Apostolopoulos 1, Maria Astrinaki 1, Yannis Stylianou 1, Andrey Ronzhin 2 1 Multimedia Informatics Lab, Computer Science Department, University of Crete (UoC), Greece 2 St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Abstract A hybrid speech/non speech detector is proposed for the pre processing of broadcast news. During the first stage speech/non speech classification of uniform overlapping segments is performed. The accuracy in the detection of boundaries is determined by the degree of overlap of the audio segments and it is 250 ms in our case. Extracted speech segments are further processed on a frame basis using the entropy of the signal spectrum. Speech endpoint detection is accomplished with an accuracy of 10 ms. The combination of the two methods in one speech/non speech detection system, exhibits the robustness and accuracy required for subsequent processing stages like broadcast speech transcription and speaker diarization. 1. Introduction Automatic audio classification and segmentation is a research area of great interest in multimedia processing for automatic labeling and extraction of semantic information. In the case of broadcast audio recordings, pre processing for speech/non speech segmentation greatly improves subsequent tasks such as speaker change detection and clustering as well as speech transcription. Regarding speaker diarization systems, elimination of non speech frames is more critical whereas for speech transcription accurate detection of speech is equally important. In broadcast news, silence is usually reduced to a minimum and what mostly appears are other noises and music. Moreover, methods that work well on speech/music discrimination usually do not handle efficiently other non speech classes commonly present in broadcast data such as environmental noises, moving cars, claps, crowd babble, etc. Speech/non speech segmentation can be formulated as a pattern recognition problem where the optimal features and the classifier built on them are application dependent. Reviewing relevant past work, many approaches in the literature have examined various features and classifiers. MFCCs and SVMs have been extensively evaluated and seem to be among the most promising ones [1,2]. Furthermore, it has been shown that for successful audio segmentation and classification, the classification unit has to be a segment i.e. a sequence of frames rather than a single frame [1,2]. In this work we present a hybrid approach which combines a segment based classifier with a framebased speech endpoint detector [3]. We use uniformly spaced overlapping audio segments of 500 ms length during the first classification stage. Mean and standard deviation of MFCCs have been used to parameterize every segment. We have also evaluated two different methods of spectrogram computation before MFCCs extraction. Classification is performed using SVMs [4]. During next stage, only segments characterized as speech are processed on a frame basis (10 ms). Spectrum entropy is the feature we use for the detection of silent frames within speech segments. The organization of the paper is as follows: we review the segment based speech/non speech classification algorithm and the speech endpoint detection method in section 2. In section 3 we
2 describe experimental setup, the database and the experimental results. Finally in section 4 we present our conclusions. 2. Description of the method 2.1. Segment parameterization and classification Mel frequency cepstral coefficients are the most commonly used features in speech and speaker recognition systems. They also have been successfully applied in audio indexing tasks [1,2]. Here we extract 13 th order MFCCs from audio frames of 25 ms with a frame rate of 10 ms, i.e. every 10 ms the signal is multiplied using a Hamming window of 25 ms duration. We perform critical band analysis of the power spectrum with a set of triangular band pass filters as usual. For comparison purposes, we also derive an auditory like spectrogram by applying equal loudness pre emphasis and cube root intensity loudness compression according to Hermansky [5]. In each case, Mel scale cepstral coefficients are computed every 10 ms from the filterbank outputs. We define each segment as a sequence of 50 frames of 10ms each. We estimate the mean and standard deviation of MFCCs over 50 frames, resulting in a 26 element feature vector per segment. We extract evenly spaced overlapping segments every 25 frames (250 ms overlap) for the test dataset whereas for the training dataset segments are extracted every 5 frames (for maximizing training data). Support vector machines (SVMs) are used for the classification of segments. We have used SVM light [4] with a Radial Basis Functions kernel all the other parameters have been set to default values. We also define an hierarchy of classes similar to [2] for resolving conflicts that arise due to the overlap of segments: frames are classified as non speech if they are part of any segment that was classified as non speech; otherwise, they are classified as speech Spectral entropy based speech detector The speech detection method is based on calculation of the information entropy of the signal spectrum as the measure of uncertainty or disorder in a given distribution [6]. The distinction between entropy for speech segments and entropy for background noise is used for speech endpoint detection. Such criterion is less sensitive to the variations of the signal amplitude than the energybased methods. The method is a modification of the speech detection approach proposed by J. L. Shen [7] and includes new levels into the analysis of speech signal (Figure 1). s Fast S p Speech spectrum Fourier normalization Transformation Calculation of the spectral entropy R Logical temporal processing h Median smoothing g Noise Speech Noise Fig. 1. The algorithm for speech detection based on analysis of the entropy of signal spectrum The audio signal is divided into short segments with duration 11.6 ms each with overlapping 25%. Short time signal spectrum is computed using FFT, and normalization of the calculated spectrum over all frequency components is fulfilled giving the probability density function p i. Acceptable values of probability density function are upper and lower bounded. This restriction allows us to
3 exclude noises concentrated in a narrow band as well as noises approximately equally distributed among the frequency components (for instance, white noise). Thus: p i = 0, if p i < δ 2 or p i > δ 1 (1) where δ1 andδ 2 are the upper and lower values of probability density, respectively. They have been experimentally determined to be δ 1 = 0.3 and δ 2 = At the next stage the information spectral entropy h is estimated, and median smoothing in a window of 5 9 segments is applied. Finally, a logical temporal processing of h (Figure 2) takes into account the possible durations of speech and non speech fragments. entropy, h threshold r function h speech nonspeech speech a b c d time, t Fig. 2. Logical temporal processing of the spectral entropy function An adaptive threshold r for the detection of speech endpoints is calculated as follows: max(h) - min(h) r = + min(h) * m (2) 2 where µ is a coefficient empirically chosen depending on the recording conditions. Employing the adaptive threshold we can obtain alternate speech and non speech regions given the function h and apply two criteria to process: (1) R minimal duration of a speech fragment in a phrase; (2) S maximal duration of a non speech fragment in a phrase. These criteria values were experimentally determined taking into account that a human cannot produce very short speech fragments as well as that there are always some pauses in speech (for instance, before explosive consonants). So if the number of consecutive speech segments is greater than R and non speech interval between them is shorter than S then all these segments are considered belonging to speech class. Such logicaltemporal processing is applied iteratively to the whole spectral entropy function automatically segmented for speech/non speech portions. 3. Experiments and Results We tested the algorithms described in section (2) on audio data collected from Greek TV programs (TV++) and music CDs. Speech data consists of broadcast news and TV shows recorded in different conditions such as studios or outdoors; also, some of the speech data have been transmitted over telephone channels. Non speech data consists of music (25%), outdoors noise (moving cars, crowd noise, etc), claps, and very noisy unintelligible speech due to many speakers talking simultaneously (speech babble). Music content consists of the audio signals at the beginning and the end of TV shows as well as songs from music CDs. Audio data are all mono channel and 16 bit per sample, with 16 khz sampling frequency. The database has been manually segmented and labeled at Computer Science Department, UoC. Speech signals have been partitioned into 30 minutes for training and 90 minutes for testing.
4 3.1. Speech / non speech classification results We evaluate system performance using the detection error trade off curve (DET) [8]. DET plot clearly presents detection performance tradeoff between false rejection rate (or speech miss probability) and false acceptance rate (or false alarm probability). Detection error probabilities are plotted on a nonlinear scale which transforms them by mapping to their corresponding Gaussian deviate. Thus DET curves are straight lines when the underlying distributions are Gaussian [8]. We also report the minimum value of the detection cost function for each detection error trade off curve according to [8]. For the speech/non speech segment based classification, the target is speech class having prior probability P t arg et = 50% in our data set. Here the costs of miss and false alarm probabilities are considered equally important ( C miss = C false =1) although they actually depend on the task. For speaker and language recognition C false > C miss, i.e. we should accurately reject non speech audio (low false alarm probability) whereas speech miss probability is less important. For speech transcription on the other hand C false < C miss, i.e. accurate detection of speech is rather more important. The minimum value of the detection cost function (DCF) for the DET curve [8] then, is: DCF opt = min( C miss * P miss * P t arg et + C false * P false *(1- P t arg et )) (3) In the case of common MFCC features, DCF opt = 9.54% and corresponds to P missopt = 6.24% and =12.84%. For the case of MFCC features extracted after loudness equalization and cube P falseopt root compression, a remarkable improvement in all aspects is noticed: DCF opt = 4.96%, P missopt = 4.07% and P falseopt = 5.84%. Another commonly used measure of accuracy is the EER (Equal Error Rate) which corresponds to the decision threshold θ EER at which false rejection rate ( P miss ) equals false acceptance rate ( P false ). Since P miss and P false are discrete, we set: and (4) θ EER = argmin q P miss (q) - P false (q) EER(q) = P miss (q) + P false (q) 2 (5)
5 Figure 3: DET curves for speech/non speech segment based classification. Mean and variance of MFCCs are computed over each segment, with (solid line) or without (dashed line) equal loudness pre emphasis and cube root intensity loudness compression [5]. The minimal values of the corresponding detection cost functions (DCF) are also presented (circles). We report in Table 1 the results for the speech/non speech segment based classification and present in Figure 3 the corresponding DET curves. Since in this case P t arg et = 50% and C miss = C false =1, both values of EER and DCF opt are quite close. MFCC features extracted after loudness equalization and compression are clearly superior according to EER, too. Table 1: Speech/non speech segment based classification results System DCF opt P miss P false EER MFCCs baseline 9.54% 6.24% 12.84% 9.91% equal loudness+compression 4.96% 4.07% 5.84% 5.01% 3.2. Speech endpoint detection results Audio segments classified as speech at the first detection stage are further processed using the entropy based method for speech endpoint detection with 10 ms accuracy (after rounding). This is a pre processing step required for subsequent broadcast speech transcription. In this case, the total number of silence frames is much lower than the total number of speech frames: prior probability of speech class is P t arget = 88.96% for our dataset where speech is the target. If the costs of miss and false alarm probabilities are considered of equal importance, then the minimum value of the detection cost function ( DCF ) for the DET curve is = 6.47% corresponding to = 4.48% and = %. We report in Table 2 the results for speech/silence classification and present in Figure 4 the corresponding DET curve. We can see that EER is significantly higher than DCF opt in this case since it doesn t take into account the highly unequal prior probabilities of speech and silence.
6 Figure 4: DET curve for speech endpoint detection with 10 ms accuracy applied onto extracted speech segments. The minimal value of the corresponding detection cost function (DCF) is presented as circle. Table 2: Speech/silence classification results based on spectrum entropy DCF opt P miss P false EER 6.47% 4.48% 22.52% 10.83% 4. Conclusions In this paper we have applied a two stage speech detection system. During the first stage, segmentbased speech/non speech classification is performed based on MFCC features and Support Vector Machines within 250 ms accuracy. An improvement is reported if we use loudness equalization and cube root compression to the power spectrogram after critical band analysis. Extracted speech segments are further processed through an entropy based method for speech endpoint detection within 10 ms accuracy. The proposed system can successfully address the two fold requirement for robustness and accuracy during the pre processing stages preceding broadcast speech transcription or speaker diarization. Acknowledgements This work has been supported by the General Secretariat of Research and Technology (GGET) in Greece and Russian Foundation for Basic Research in the framework of the project # а. The collaborative research was part of the PhD exchange program of the SIMILAR Network of Excellence project # FP
7 References 1. L. Lu, H.J. Zhang, Stan Li. Content based audio classification and segmentation by using support vector machines. Multimedia Systems 8: , H. Aronowitz. Segmental modeling for audio segmentation. Proc. ICASSP 2007, Hawaii, USA, A. Karpov. A robust method for determination of boundaries of speech on the basis of spectral entropy. Artificial Intelligence Journal. Donetsk, Vol.4. pp , T. Joachims. Making large scale SVM learning practical. In Advances in Kernel Methods Support Vector Learning, MIT Press, H. Hermansky, B. Hanson, H. Wakita. Perceptually based linear predictive analysis of speech. Proc. ICASSP 1985, pp , J. Ajmera, I. McCowan, H. Bourlard. Speech/music segmentation using entropy and dynamism features in a HMM classification framework. Speech Communication, 40, pp , J. L. Shen, J. W. Hung, L. S. Lee. Robust Entropy based Endpoint Detection for Speech Recognition in Noisy Environments. Proc. ICSLP 1998, Sydney, Australia, paper 0232, The NIST Year 2004 Speaker Recognition Evaluation Plan,
FILTER BANK FEATURE EXTRACTION FOR GAUSSIAN MIXTURE MODEL SPEAKER RECOGNITION
FILTER BANK FEATURE EXTRACTION FOR GAUSSIAN MIXTURE MODEL SPEAKER RECOGNITION James H. Nealand, Alan B. Bradley, & Margaret Lech School of Electrical and Computer Systems Engineering, RMIT University,
Speaker Recognition Using Vocal Tract Features
International Journal of Engineering Inventions e-issn: 2278-7461, p-issn: 2319-6491 Volume 3, Issue 1 (August 2013) PP: 26-30 Speaker Recognition Using Vocal Tract Features Prasanth P. S. Sree Chitra
Robust DNN-based VAD augmented with phone entropy based rejection of background speech
INTERSPEECH 2016 September 8 12, 2016, San Francisco, USA Robust DNN-based VAD augmented with phone entropy based rejection of background speech Yuya Fujita 1, Ken-ichi Iso 1 1 Yahoo Japan Corporation
L16: Speaker recognition
L16: Speaker recognition Introduction Measurement of speaker characteristics Construction of speaker models Decision and performance Applications [This lecture is based on Rosenberg et al., 2008, in Benesty
Isolated Speech Recognition Using MFCC and DTW
Isolated Speech Recognition Using MFCC and DTW P.P.S.Subhashini Associate Professor, RVR & JC College of Engineering. ABSTRACT This paper describes an approach of isolated speech recognition by using the
Performance Analysis of Spoken Arabic Digits Recognition Techniques
JOURNAL OF ELECTRONIC SCIENCE AND TECHNOLOGY, VOL., NO., JUNE 5 Performance Analysis of Spoken Arabic Digits Recognition Techniques Ali Ganoun and Ibrahim Almerhag Abstract A performance evaluation of
ONLINE SPEAKER DIARIZATION USING ADAPTED I-VECTOR TRANSFORMS. Weizhong Zhu and Jason Pelecanos. IBM Research, Yorktown Heights, NY 10598, USA
ONLINE SPEAKER DIARIZATION USING ADAPTED I-VECTOR TRANSFORMS Weizhong Zhu and Jason Pelecanos IBM Research, Yorktown Heights, NY 1598, USA {zhuwe,jwpeleca}@us.ibm.com ABSTRACT Many speaker diarization
Pavel Král and Václav Matoušek University of West Bohemia in Plzeň (Pilsen), Czech Republic pkral
EVALUATION OF AUTOMATIC SPEAKER RECOGNITION APPROACHES Pavel Král and Václav Matoušek University of West Bohemia in Plzeň (Pilsen), Czech Republic pkral matousek@kiv.zcu.cz Abstract: This paper deals with
Modulation frequency features for phoneme recognition in noisy speech
Modulation frequency features for phoneme recognition in noisy speech Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Idiap Research Institute, Rue Marconi 19, 1920 Martigny, Switzerland Ecole Polytechnique
Zaki B. Nossair and Stephen A. Zahorian Department of Electrical and Computer Engineering Old Dominion University Norfolk, VA, 23529
SMOOTHED TIME/FREQUENCY FEATURES FOR VOWEL CLASSIFICATION Zaki B. Nossair and Stephen A. Zahorian Department of Electrical and Computer Engineering Old Dominion University Norfolk, VA, 23529 ABSTRACT A
On the Use of Perceptual Line Spectral Pairs Frequencies for Speaker Identification
On the Use of Perceptual Line Spectral Pairs Frequencies for Speaker Identification Md. Sahidullah and Goutam Saha Department of Electronics and Electrical Communication Engineering Indian Institute of
THIRD-ORDER MOMENTS OF FILTERED SPEECH SIGNALS FOR ROBUST SPEECH RECOGNITION
THIRD-ORDER MOMENTS OF FILTERED SPEECH SIGNALS FOR ROBUST SPEECH RECOGNITION Kevin M. Indrebo, Richard J. Povinelli, and Michael T. Johnson Dept. of Electrical and Computer Engineering, Marquette University
On the Use of Long-Term Average Spectrum in Automatic Speaker Recognition
On the Use of Long-Term Average Spectrum in Automatic Speaker Recognition Tomi Kinnunen 1, Ville Hautamäki 2, and Pasi Fränti 2 1 Speech and Dialogue Processing Lab Institution for Infocomm Research (I
COMP150 DR Final Project Proposal
COMP150 DR Final Project Proposal Ari Brown and Julie Jiang October 26, 2017 Abstract The problem of sound classification has been studied in depth and has multiple applications related to identity discrimination,
Speech Accent Classification
Speech Accent Classification Corey Shih ctshih@stanford.edu 1. Introduction English is one of the most prevalent languages in the world, and is the one most commonly used for communication between native
A Flexible Framework for Key Audio Effects Detection and Auditory Context Inference
1026 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 A Flexible Framework for Key Audio Effects Detection and Auditory Context Inference Rui Cai, Lie Lu, Member, IEEE,
VOICE ACTIVITY DETECTION USING A SLIDING-WINDOW, MAXIMUM MARGIN CLUSTERING APPROACH. Phillip De Leon and Salvador Sanchez
VOICE ACTIVITY DETECTION USING A SLIDING-WINDOW, MAXIMUM MARGIN CLUSTERING APPROACH Phillip De Leon and Salvador Sanchez New Mexico State University Klipsch School of Electrical and Computer Engineering
Gender Classification Based on FeedForward Backpropagation Neural Network
Gender Classification Based on FeedForward Backpropagation Neural Network S. Mostafa Rahimi Azghadi 1, M. Reza Bonyadi 1 and Hamed Shahhosseini 2 1 Department of Electrical and Computer Engineering, Shahid
Accent Classification
Accent Classification Phumchanit Watanaprakornkul, Chantat Eksombatchai, and Peter Chien Introduction Accents are patterns of speech that speakers of a language exhibit; they are normally held in common
Auditory Context Recognition Using SVMs
Auditory Context Recognition Using SVMs Mikko Perttunen 1, Max Van Kleek 2, Ora Lassila 3, Jukka Riekki 1 1 Department of Electrical and Information Engineering, 90014 University of Oulu, Finland {first.last}@ee.oulu.fi
Speech Synthesizer for the Pashto Continuous Speech based on Formant
Speech Synthesizer for the Pashto Continuous Speech based on Formant Technique Sahibzada Abdur Rehman Abid 1, Nasir Ahmad 1, Muhammad Akbar Ali Khan 1, Jebran Khan 1, 1 Department of Computer Systems Engineering,
Approaches to Speaker Detection and Tracking in Conversational Speech 1
Digital Signal Processing 10, 93 112 (2000) doi:10.1006/dspr.1999.0359, available online at http://www.idealibrary.com on Approaches to Speaker Detection and Tracking in Conversational Speech 1 Robert
Human Emotion Recognition From Speech
RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati
Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project
Phonetic- and Speaker-Discriminant Features for Speaker Recognition by Lara Stoll Research Project Submitted to the Department of Electrical Engineering and Computer Sciences, University of California
Babble Noise: Modeling, Analysis, and Applications Nitish Krishnamurthy, Student Member, IEEE, and John H. L. Hansen, Fellow, IEEE
1394 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 7, SEPTEMBER 2009 Babble Noise: Modeling, Analysis, and Applications Nitish Krishnamurthy, Student Member, IEEE, and John
AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION
JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 11/2007, ISSN 1642-6037 Marek WIŚNIEWSKI *, Wiesława KUNISZYK-JÓŹKOWIAK *, Elżbieta SMOŁKA *, Waldemar SUSZYŃSKI * HMM, recognition, speech, disorders
Low-Delay Singing Voice Alignment to Text
Low-Delay Singing Voice Alignment to Text Alex Loscos, Pedro Cano, Jordi Bonada Audiovisual Institute, Pompeu Fabra University Rambla 31, 08002 Barcelona, Spain {aloscos, pcano, jboni }@iua.upf.es http://www.iua.upf.es
Low-Audible Speech Detection using Perceptual and Entropy Features
Low-Audible Speech Detection using Perceptual and Entropy Features Karthika Senan J P and Asha A S Department of Electronics and Communication, TKM Institute of Technology, Karuvelil, Kollam, Kerala, India.
Session 1: Gesture Recognition & Machine Learning Fundamentals
IAP Gesture Recognition Workshop Session 1: Gesture Recognition & Machine Learning Fundamentals Nicholas Gillian Responsive Environments, MIT Media Lab Tuesday 8th January, 2013 My Research My Research
U-NORM Likelihood Normalization in PIN-Based Speaker Verification Systems
U-NORM Likelihood Normalization in PIN-Based Speaker Verification Systems D. Garcia-Romero, J. Gonzalez-Rodriguez, J. Fierrez-Aguilar, and J. Ortega-Garcia Speech and Signal Processing Group (ATVS) Universidad
Speech Emotion Recognition Using Support Vector Machine
Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,
COMPARISON OF EVALUATION METRICS FOR SENTENCE BOUNDARY DETECTION
COMPARISON OF EVALUATION METRICS FOR SENTENCE BOUNDARY DETECTION Yang Liu Elizabeth Shriberg 2,3 University of Texas at Dallas, Dept. of Computer Science, Richardson, TX, U.S.A 2 SRI International, Menlo
Alberto Abad and Isabel Trancoso. L 2 F - Spoken Language Systems Lab INESC-ID / IST, Lisboa, Portugal
THE L 2 F LANGUAGE VERIFICATION SYSTEMS FOR ALBAYZIN-08 EVALUATION Alberto Abad and Isabel Trancoso L 2 F - Spoken Language Systems Lab INESC-ID / IST, Lisboa, Portugal {Alberto.Abad,Isabel.Trancoso}@l2f.inesc-id.pt
A comparison between human perception and a speaker verification system score of a voice imitation
PAGE 393 A comparison between human perception and a speaker verification system score of a voice imitation Elisabeth Zetterholm, Mats Blomberg 2, Daniel Elenius 2 Department of Philosophy & Linguistics,
Phonemes based Speech Word Segmentation using K-Means
International Journal of Engineering Sciences Paradigms and Researches () Phonemes based Speech Word Segmentation using K-Means Abdul-Hussein M. Abdullah 1 and Esra Jasem Harfash 2 1, 2 Department of Computer
Speaker Indexing Using Neural Network Clustering of Vowel Spectra
International Journal of Speech Technology 1,143-149 (1997) @ 1997 Kluwer Academic Publishers. Manufactured in The Netherlands. Speaker Indexing Using Neural Network Clustering of Vowel Spectra DEB K.
Detecting Group Turns of Speaker Groups in Meeting Room Conversations Using Audio-Video Change Scale-Space
University of South Florida Scholar Commons Graduate Theses and Dissertations Graduate School 6-30-2010 Detecting Group Turns of Speaker Groups in Meeting Room Conversations Using Audio-Video Change Scale-Space
Spectral Subband Centroids as Complementary Features for Speaker Authentication
Spectral Subband Centroids as Complementary Features for Speaker Authentication Norman Poh Hoon Thian, Conrad Sanderson, and Samy Bengio IDIAP, Rue du Simplon 4, CH-19 Martigny, Switzerland norman@idiap.ch,
Music Genre Classification Using MFCC, K-NN and SVM Classifier
Volume 4, Issue 2, February-2017, pp. 43-47 ISSN (O): 2349-7084 International Journal of Computer Engineering In Research Trends Available online at: www.ijcert.org Music Genre Classification Using MFCC,
AUTOMATIC ARABIC PRONUNCIATION SCORING FOR LANGUAGE INSTRUCTION
AUTOMATIC ARABIC PRONUNCIATION SCORING FOR LANGUAGE INSTRUCTION Hassan Dahan, Abdul Hussin, Zaidi Razak, Mourad Odelha University of Malaya (MALAYSIA) hasbri@um.edu.my Abstract Automatic articulation scoring
Automatic Speaker Recognition
Automatic Speaker Recognition Qian Yang 04. June, 2013 Outline Overview Traditional Approaches Speaker Diarization State-of-the-art speaker recognition systems use: GMM-based framework SVM-based framework
Pass Phrase Based Speaker Recognition for Authentication
Pass Phrase Based Speaker Recognition for Authentication Heinz Hertlein, Dr. Robert Frischholz, Dr. Elmar Nöth* HumanScan GmbH Wetterkreuz 19a 91058 Erlangen/Tennenlohe, Germany * Chair for Pattern Recognition,
WHEN THERE IS A mismatch between the acoustic
808 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Optimization of Temporal Filters for Constructing Robust Features in Speech Recognition Jeih-Weih Hung, Member,
Speech Recognition at ICSI: Broadcast News and beyond
Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI
ROBUST SPEECH RECOGNITION BY PROPERLY UTILIZING RELIABLE FRAMES AND SEGMENTS IN CORRUPTED SIGNALS
ROBUST SPEECH RECOGNITION BY PROPERLY UTILIZING RELIABLE FRAMES AND SEGMENTS IN CORRUPTED SIGNALS Yi Chen, Chia-yu Wan, Lin-shan Lee Graduate Institute of Communication Engineering, National Taiwan University,
Fast Keyword Spotting in Telephone Speech
RADIOENGINEERING, VOL. 18, NO. 4, DECEMBER 2009 665 Fast Keyword Spotting in Telephone Speech Jan NOUZA, Jan SILOVSKY SpeechLab, Faculty of Mechatronics, Technical University of Liberec, Studentska 2,
Advances in Music Information Retrieval using Deep Learning Techniques - Sid Pramod
Advances in Music Information Retrieval using Deep Learning Techniques - Sid Pramod Music Information Retrieval (MIR) Science of retrieving information from music. Includes tasks such as Query by Example,
A method for recognition of coexisting environmental sound sources based on the Fisher s linear discriminant classifier
A method for recognition of coexisting environmental sound sources based on the Fisher s linear discriminant classifier Ester Creixell 1, Karim Haddad 2, Wookeun Song 3, Shashank Chauhan 4 and Xavier Valero.
Abstract. 1. Introduction
A New Silence Removal and Endpoint Detection Algorithm for Speech and Speaker Recognition Applications G. Saha 1, Sandipan Chakroborty 2, Suman Senapati 3 Department of Electronics and Electrical Communication
Acoustic Scene Classification
1 Acoustic Scene Classification By Yuliya Sergiyenko Seminar: Topics in Computer Music RWTH Aachen 24/06/2015 2 Outline 1. What is Acoustic scene classification (ASC) 2. History 1. Cocktail party problem
Analysis of Importance of the prosodic Features for Automatic Sentence Modality Recognition in French in real Conditions
Analysis of Importance of the prosodic Features for Automatic Sentence Modality Recognition in French in real Conditions PAVEL KRÁL 1, JANA KLEČKOVÁ 1, CHRISTOPHE CERISARA 2 1 Dept. Informatics & Computer
Speaker Recognition Using MFCC and GMM with EM
RESEARCH ARTICLE OPEN ACCESS Speaker Recognition Using MFCC and GMM with EM Apurva Adikane, Minal Moon, Pooja Dehankar, Shraddha Borkar, Sandip Desai Department of Electronics and Telecommunications, Yeshwantrao
International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012
Text-independent Mono and Cross-lingual Speaker Identification with the Constraint of Limited Data Nagaraja B G and H S Jayanna Department of Information Science and Engineering Siddaganga Institute of
Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier
IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) e-issn: 2278-2834,p- ISSN: 2278-8735.Volume 10, Issue 2, Ver.1 (Mar - Apr.2015), PP 55-61 www.iosrjournals.org Analysis of Emotion
THE USE OF A FORMANT DIAGRAM IN AUDIOVISUAL SPEECH ACTIVITY DETECTION
THE USE OF A FORMANT DIAGRAM IN AUDIOVISUAL SPEECH ACTIVITY DETECTION K.C. van Bree, H.J.W. Belt Video Processing Systems Group, Philips Research, Eindhoven, Netherlands Karl.van.Bree@philips.com, Harm.Belt@philips.com
Suitable Feature Extraction and Speech Recognition Technique for Isolated Tamil Spoken Words
Suitable Feature Extraction and Recognition Technique for Isolated Tamil Spoken Words Vimala.C, Radha.V Department of Computer Science, Avinashilingam Institute for Home Science and Higher Education for
Programming Social Robots for Human Interaction. Lecture 4: Machine Learning and Pattern Recognition
Programming Social Robots for Human Interaction Lecture 4: Machine Learning and Pattern Recognition Zheng-Hua Tan Dept. of Electronic Systems, Aalborg Univ., Denmark zt@es.aau.dk, http://kom.aau.dk/~zt
Learning Methods in Multilingual Speech Recognition
Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex
DEEP LEARNING FOR MONAURAL SPEECH SEPARATION
DEEP LEARNING FOR MONAURAL SPEECH SEPARATION Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, Paris Smaragdis Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign,
An Artificial Neural Network Approach for User Class-Dependent Off-Line Sentence Segmentation
An Artificial Neural Network Approach for User Class-Dependent Off-Line Sentence Segmentation César A. M. Carvalho and George D. C. Cavalcanti Abstract In this paper, we present an Artificial Neural Network
Sequence Discriminative Training;Robust Speech Recognition1
Sequence Discriminative Training; Robust Speech Recognition Steve Renals Automatic Speech Recognition 16 March 2017 Sequence Discriminative Training;Robust Speech Recognition1 Recall: Maximum likelihood
Comparative study of automatic speech recognition techniques
Published in IET Signal Processing Received on 21st May 2012 Revised on 26th November 2012 Accepted on 8th January 2013 ISSN 1751-9675 Comparative study of automatic speech recognition techniques Michelle
Foreign Accent Classification
Foreign Accent Classification CS 229, Fall 2011 Paul Chen pochuan@stanford.edu Julia Lee juleea@stanford.edu Julia Neidert jneid@stanford.edu ABSTRACT We worked to create an effective classifier for foreign
Voice Recognition based on vote-som
Voice Recognition based on vote-som Cesar Estrebou, Waldo Hasperue, Laura Lanzarini III-LIDI (Institute of Research in Computer Science LIDI) Faculty of Computer Science, National University of La Plata
L18: Speech synthesis (back end)
L18: Speech synthesis (back end) Articulatory synthesis Formant synthesis Concatenative synthesis (fixed inventory) Unit-selection synthesis HMM-based synthesis [This lecture is based on Schroeter, 2008,
Automatic Speech Segmentation Based on HMM
6 M. KROUL, AUTOMATIC SPEECH SEGMENTATION BASED ON HMM Automatic Speech Segmentation Based on HMM Martin Kroul Inst. of Information Technology and Electronics, Technical University of Liberec, Hálkova
Table 1: Classification accuracy percent using SVMs and HMMs
Feature Sets for the Automatic Detection of Prosodic Prominence Tim Mahrt, Jui-Ting Huang, Yoonsook Mo, Jennifer Cole, Mark Hasegawa-Johnson, and Margaret Fleck This work presents a series of experiments
Automated Rating of Recorded Classroom Presentations using Speech Analysis in Kazakh
Automated Rating of Recorded Classroom Presentations using Speech Analysis in Kazakh Akzharkyn Izbassarova, Aidana Irmanova and Alex Pappachen James School of Engineering, Nazarbayev University, Astana
The Construction of Piano Teaching Innovation Model Based on Full-depth Learning
The Construction of Piano Teaching Innovation Model Based on Full-depth Learning https://doi.org/10.3991/ijet.v13i03.8369 Anshi Wei Baoji University of Arts and Sciences, Baoji, China laowei135@163.com
DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds
DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS Elliot Singer and Douglas Reynolds Massachusetts Institute of Technology Lincoln Laboratory {es,dar}@ll.mit.edu ABSTRACT
Spoken Language Identification Using Hybrid Feature Extraction Methods
JOURNAL OF TELECOMMUNICATIONS, VOLUME 1, ISSUE 2, MARCH 2010 11 Spoken Language Identification Using Hybrid Feature Extraction Methods Pawan Kumar, Astik Biswas, A.N. Mishra and Mahesh Chandra Abstract
THE associative memory problem is stated as follows. We
756 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 3, MAY 2007 A Weighted Voting Model of Associative Memory Xiaoyan Mu, Paul Watta, and Mohamad H. Hassoun Abstract This paper presents an analysis
Towards Parameter-Free Classification of Sound Effects in Movies
Towards Parameter-Free Classification of Sound Effects in Movies Selina Chu, Shrikanth Narayanan *, C.-C Jay Kuo * Department of Computer Science * Department of Electrical Engineering University of Southern
Analyzing neural time series data: Theory and practice
Page i Analyzing neural time series data: Theory and practice Mike X Cohen MIT Press, early 2014 Page ii Contents Section 1: Introductions Chapter 1: The purpose of this book, who should read it, and how
L12: Template matching
Introduction to ASR Pattern matching Dynamic time warping Refinements to DTW L12: Template matching This lecture is based on [Holmes, 2001, ch. 8] Introduction to Speech Processing Ricardo Gutierrez-Osuna
PIBTD: Scheme IV 100. FRR curves thresholds
Determination of A Priori Decision Thresholds for Phrase-Prompted Speaker Verication M. W. Mak, W. D. Zhang, and M. X. He Centre for Multimedia Signal Processing, Department of Electronic and Information
Performance improvement in automatic evaluation system of English pronunciation by using various normalization methods
Proceedings of 20 th International Congress on Acoustics, ICA 2010 23-27 August 2010, Sydney, Australia Performance improvement in automatic evaluation system of English pronunciation by using various
Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction
INTERSPEECH 2015 Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction Akihiro Abe, Kazumasa Yamamoto, Seiichi Nakagawa Department of Computer
II. SID AND ITS CHALLENGES
Call Centre Speaker Identification using Telephone and Data Lerato Lerato and Daniel Mashao Dept. of Electrical Engineering, University of Cape Town Rondebosch 7800, Cape Town, South Africa llerato@crg.ee.uct.ac.za,
Learning words from sights and sounds: a computational model. Deb K. Roy, and Alex P. Pentland Presented by Xiaoxu Wang.
Learning words from sights and sounds: a computational model Deb K. Roy, and Alex P. Pentland Presented by Xiaoxu Wang Introduction Infants understand their surroundings by using a combination of evolved
AUTOMATIC SONG-TYPE CLASSIFICATION AND SPEAKER IDENTIFICATION OF NORWEGIAN ORTOLAN BUNTING (EMBERIZA HORTULANA) VOCALIZATIONS
AUTOMATIC SONG-TYPE CLASSIFICATION AND SPEAKER IDENTIFICATION OF NORWEGIAN ORTOLAN BUNTING (EMBERIZA HORTULANA) VOCALIZATIONS Marek B. Trawicki & Michael T. Johnson Marquette University Department of Electrical
In Voce, Cantato, Parlato. Studi in onore di Franco Ferrero, E.Magno- Caldognetto, P.Cosi e A.Zamboni, Unipress Padova, pp , 2003.
VOWELS: A REVISIT Maria-Gabriella Di Benedetto Università degli Studi di Roma La Sapienza Facoltà di Ingegneria Infocom Dept. Via Eudossiana, 18, 00184, Rome (Italy) (39) 06 44585863, (39) 06 4873300 FAX,
Automatic Recognition of Speaker Age in an Inter-cultural Context
Automatic Recognition of Speaker Age in an Inter-cultural Context Michael Feld, DFKI in cooperation with Meraka Institute, Pretoria FEAST Speaker Classification Purposes Bootstrapping a User Model based
Text-Independent Speaker Verification Using Utterance Level Scoring and Covariance Modeling
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 6, SEPTEMBER 2002 363 Text-Independent Speaker Verification Using Utterance Level Scoring and Covariance Modeling Ran D. Zilca, Member, IEEE
Classification of Research Papers Focusing on Elemental Technologies and Their Effects
Classification of Research Papers Focusing on Elemental Technologies and Their Effects Satoshi Fukuda, Hidetsugu Nanba, Toshiyuki Takezawa Graduate School of Information Sciences, Hiroshima City University
Abstract. 1 Introduction. 2 Background
Automatic Spoken Affect Analysis and Classification Deb Roy and Alex Pentland MIT Media Laboratory Perceptual Computing Group 20 Ames St. Cambridge, MA 02129 USA dkroy, sandy@media.mit.edu Abstract This
HMM-Based Emotional Speech Synthesis Using Average Emotion Model
HMM-Based Emotional Speech Synthesis Using Average Emotion Model Long Qin, Zhen-Hua Ling, Yi-Jian Wu, Bu-Fan Zhang, and Ren-Hua Wang iflytek Speech Lab, University of Science and Technology of China, Hefei
Information-extreme intellectual technology
Information-extreme intellectual technology Moskalenko Vyacheslav, PhD, associate professor in computer science and the head of 3D-innovation Lab at Sumy State University, founder of Molfar Technologies
Ganesh Sivaraman 1, Vikramjit Mitra 2, Carol Y. Espy-Wilson 1
FUSION OF ACOUSTIC, PERCEPTUAL AND PRODUCTION FEATURES FOR ROBUST SPEECH RECOGNITION IN HIGHLY NON-STATIONARY NOISE Ganesh Sivaraman 1, Vikramjit Mitra 2, Carol Y. Espy-Wilson 1 1 University of Maryland
Volume 1, No.3, November December 2012
Volume 1, No.3, November December 2012 Suchismita Sinha et al, International Journal of Computing, Communications and Networking, 1(3), November-December 2012, 115-125 International Journal of Computing,
A Lemma-Based Approach to a Maximum Entropy Word Sense Disambiguation System for Dutch
A Lemma-Based Approach to a Maximum Entropy Word Sense Disambiguation System for Dutch Tanja Gaustad Humanities Computing University of Groningen, The Netherlands tanja@let.rug.nl www.let.rug.nl/ tanja
SPECTRUM ANALYSIS OF SPEECH RECOGNITION VIA DISCRETE TCHEBICHEF TRANSFORM
SPECTRUM ANALYSIS OF SPEECH RECOGNITION VIA DISCRETE TCHEBICHEF TRANSFORM Ferda Ernawan 1 and Nur Azman Abu, Nanna Suryana 2 1 Faculty of Information and Communication Technology Universitas Dian Nuswantoro
Segment-Based Speech Recognition
Segment-Based Speech Recognition Introduction Searching graph-based observation spaces Anti-phone modelling Near-miss modelling Modelling landmarks Phonological modelling Lecture # 16 Session 2003 6.345
A Review on Classification Techniques in Machine Learning
A Review on Classification Techniques in Machine Learning R. Vijaya Kumar Reddy 1, Dr. U. Ravi Babu 2 1 Research Scholar, Dept. of. CSE, Acharya Nagarjuna University, Guntur, (India) 2 Principal, DRK College
TANGO Native Anti-Fraud Features
TANGO Native Anti-Fraud Features Tango embeds an anti-fraud service that has been successfully implemented by several large French banks for many years. This service can be provided as an independent Tango
Big Data Analytics Clustering and Classification
E6893 Big Data Analytics Lecture 4: Big Data Analytics Clustering and Classification Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science September 28th, 2017 1
Exploiting speaker segmentations for automatic role detection. An application to broadcast news documents.
Exploiting speaker segmentations for automatic role detection. An application to broadcast news documents. Benjamin Bigot Isabelle Ferrané IRIT - Université de Toulouse 118, route de Narbonne - 31062 Toulouse
Fast Dynamic Speech Recognition via Discrete Tchebichef Transform
2011 First International Conference on Informatics and Computational Intelligence Fast Dynamic Speech Recognition via Discrete Tchebichef Transform Ferda Ernawan, Edi Noersasongko Faculty of Information
The 1997 CMU Sphinx-3 English Broadcast News Transcription System
The 1997 CMU Sphinx-3 English Broadcast News Transcription System K. Seymore, S. Chen, S. Doh, M. Eskenazi, E. Gouvêa, B. Raj, M. Ravishankar, R. Rosenfeld, M. Siegler, R. Stern, and E. Thayer Carnegie
SPEECH segregation, or the cocktail party problem, is a
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 8, NOVEMBER 2010 2067 A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation Guoning Hu, Member, IEEE, and DeLiang