I.INTRODUCTION. Fig 1. The Human Speech Production System. Amandeep Singh Gill, IJECS Volume 05 Issue 10 Oct., 2016 Page No Page 18552

Similar documents
Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Human Emotion Recognition From Speech

Speech Emotion Recognition Using Support Vector Machine

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speaker Identification by Comparison of Smart Methods. Abstract

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Speaker recognition using universal background model on YOHO database

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speaker Recognition. Speaker Diarization and Identification

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Voice conversion through vector quantization

WHEN THERE IS A mismatch between the acoustic

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Modeling function word errors in DNN-HMM based LVCSR systems

Consonants: articulation and transcription

A study of speaker adaptation for DNN-based speech synthesis

Body-Conducted Speech Recognition and its Application to Speech Support System

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Speech Recognition at ICSI: Broadcast News and beyond

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Modeling function word errors in DNN-HMM based LVCSR systems

Phonetics. The Sound of Language

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Proceedings of Meetings on Acoustics

THE RECOGNITION OF SPEECH BY MACHINE

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Learning Methods in Multilingual Speech Recognition

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

On the Formation of Phoneme Categories in DNN Acoustic Models

Segregation of Unvoiced Speech from Nonspeech Interference

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

Speech Recognition by Indexing and Sequencing

Support Vector Machines for Speaker and Language Recognition

Ansys Tutorial Random Vibration

SARDNET: A Self-Organizing Feature Map for Sequences

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Python Machine Learning

Vibration Tutorial. Vibration Tutorial Download or Read Online ebook vibration tutorial in PDF Format From The Best User Guide Database

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Automatic Pronunciation Checker

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Switchboard Language Model Improvement with Conversational Data from Gigaword

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Affective Classification of Generic Audio Clips using Regression Models

Online Publication Date: 01 May 1981 PLEASE SCROLL DOWN FOR ARTICLE

Mandarin Lexical Tone Recognition: The Gating Paradigm

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

Probabilistic Latent Semantic Analysis

Phonological Processing for Urdu Text to Speech System

Calibration of Confidence Measures in Speech Recognition

Evolutive Neural Net Fuzzy Filtering: Basic Description

Using EEG to Improve Massive Open Online Courses Feedback Interaction

A Review: Speech Recognition with Deep Learning Methods

Enduring Understandings: Students will understand that

Learning Methods for Fuzzy Systems

Ph.D in Advance Machine Learning (computer science) PhD submitted, degree to be awarded on convocation, sept B.Tech in Computer science and

Lecture 1: Machine Learning Basics

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Automatic intonation assessment for computer aided language learning

Data Fusion Models in WSNs: Comparison and Analysis

Application of Virtual Instruments (VIs) for an enhanced learning environment

Expressive speech synthesis: a review

A Hybrid Text-To-Speech system for Afrikaans

Appendix L: Online Testing Highlights and Script

Probability and Statistics Curriculum Pacing Guide

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

STA 225: Introductory Statistics (CT)

Human Factors Engineering Design and Evaluation Checklist

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Word Segmentation of Off-line Handwritten Documents

age, Speech and Hearii

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

9 Sound recordings: acoustic and articulatory data

Klaus Zuberbühler c) School of Psychology, University of St. Andrews, St. Andrews, Fife KY16 9JU, Scotland, United Kingdom

A Privacy-Sensitive Approach to Modeling Multi-Person Conversations

Test Effort Estimation Using Neural Network

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Transcription:

www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 5 Issue 10 Oct. 2016, Page No. 18552-18556 A Review on Feature Extraction Techniques for Speech Processing Amandeep Singh Gill Assistant Professor, JMIETI,Radaur Email id: Amanshergill33@gmail.com ABSTRACT: Speech and language are considered uniquely human abilities Speech is a complex signal that is characterized by varying distributions of energy in time as well as in frequency, depending on the specific sound that is being produced. The aim of digital speech processing is to take advantage of digital computing techniques to process the speech signal for increased understanding, improved communication, and increased efficiency and Definition of various types of speech classes, feature extraction techniques, speech classifiers and performance evaluation are issues that requires attention in designing of speech processing system. I.INTRODUCTION The main components of the human speech system are: The lungs, trachea, larynx, pharyngeal cavity, oral cavity, nasal cavity. Normally the pharyngeal and the oral cavity are grouped into one unit called the voral tract. The nasal cavity is normally called the nasal tract. The exact placement of the main organs is shown in figure 1. Fig 1. The Human Speech Production System Amandeep Singh Gill, IJECS Volume 05 Issue 10 Oct., 2016 Page No.18551-18556 Page 18552

Muscle forces are used to press air from the lungs through the larynx. The vocal cords then vibrate, and interrupt the air and produce a quasi-periodic pressure wave. The pressure impulse are called pitch impulse. The frequency of the pressure signal is the pitch frequency or fundamental frequency. The frequency of the pressured signal is the part that define the speech melody[1]. The frequency of the vocal cord is determined by serval factors: The tension exerted by the muscles, it s mass and it s length. These factors vary between sexes and according to age. The pressure impulse are stimulating the air in the oral tract and for certain sounds also the nasal tract. When the cavities resonate, they radiate a sound wave which is the speech signal. Both tracts (Vocal and nasal) act as resonators with characteristic resonance frequencies. Data Recording Pre Processing Feature Extraction Feature Classifiaction Result Fig 2 Speech Processing Block diagram The general block diagram for speech processing is shown in figure 2. II. TYPES OF RECORDED SPEECH DATA A. Isolated speech It requires single utterance at a time. Often, these types of speech have Listen/Not-Listen states, where they require the speaker to have pause between utterances. Isolated word might be better name for this type[2]. B. Connected word Connected word require minimum pause between utterances to make speech flow smoothly. They are almost similar to isolated words. C. Continuous speech Continuous speech is basically computer s dictation. It is normal human speech, without silent pauses between words. This kind of speech makes machine understanding much more difficult. D. Spontaneous speech Spontaneous speech can be thought of as speech that is natural sounding and no tried out before. III. FEATURE EXTRACTION FOR SPEECH PROCESSING Speech feature extraction is responsible for transformation of the speech signals into stream of feature vectors coefficients which contains only that information which is required for the identification of a given Amandeep Singh Gill, IJECS Volume 05 Issue 10 Oct., 2016 Page No.18551-18556 Page 18553

utterance. As every speech has different unique attributes contained in spoken words these attributes can be extracted from a wide range of feature extraction techniques and can be employed for speech recognition task. But extracted feature should meet certain criteria while dealing with the speech signal such as: extracted speech features should be measured easily, extracted features should be consistent with time, and features should be robust to noise and environment [9]. The feature vector of speech signals are typically extracted using spectral analysis techniques such as Mel- frequency cepstral coefficients, linear predictive coding wavelet transforms. The most widely used feature extraction techniques are discussed below: 3.1 LPC (Linear Predictive Coding): It is desirable to compress signal for efficient transmission and storage. Digital signal is compressed before transmission for efficient utilization of channels on wireless media. For medium or low bit rate coder, LPC is most widely used. The LPC calculates a power spectrum of the signal. It is used for formant analysis. LPC is one of the most powerful speech analysis techniques and it has gained popularity as a formant estimation technique. While we pass the speech signal from speech analysis filter to remove the redundancy in signal, residual error is generated as an output. It can be quantized by smaller number of bits compare to original signal. So now, instead of transferring entire signal we can transfer this residual error and speech parameters to generate the original signal. A parametric model is computed based on least mean squared error theory, this technique being known as linear prediction (LP). By this method, the speech signal is approximated as a linear combination of its p previous samples. In this technique, the obtained LPC coefficients describe the formants. The frequencies at which the resonant peaks occur are called the formant frequencies. Thus, with this method, the locations of the formants in a speech signal are estimated by computing the linear predictive coefficients over a sliding window and finding the peaks in the spectrum of the resulting LP filter. We have excluded 0th coefficient and used next ten LPC Coefficients. In speech generation, during vowel sound vocal cords vibrate harmonically and so quasi periodic signals are produced. While in case of consonant, excitation source can be considered as random noise. Vocal tract works as a filter, which is responsible for speech response. Biological phenomenon of speech generation can be easily converted in to equivalent mechanical model. Periodic impulse train and random noise can be considered as excitation source and digital filter as vocal tract[4]. 3.2 MFC (Mel-frequency cepstral coefficients) They are captured from cepstral representation of the audio clip.mfcc is most popular Feature Extraction method. MFCC s are based on the known variation of the human ear s critical bandwidths with frequency. The MFCC technique makes use of two types of filter, namely, linearly spaced filters and logarithmically spaced filters. For phonetically important characteristics of speech signal is expressed in Mel Frequency Scale. Mel(f) = 2595 * log 10 (1 + f/700)1 Mel Frequency scale has linear frequency spacing below 1000 HZ and logarithmic speak above 1000 HZ. MFCC Block Diagram is shown in figure 3. Amandeep Singh Gill, IJECS Volume 05 Issue 10 Oct., 2016 Page No.18551-18556 Page 18554

Figure 3. Block Diagram of MFCC A. Frame Blocking It removes the acoustical interface present in the beginning and ending of sound file. B. Windowing Improves the sharpness of harmonics Removes the discontinuous of signal by tapering beginning and ending of the frame zero Decrease spectral distortion created by the overlap Decrease the error provide by FFT. C. FFT Convert each signal from time domain to frequency domain Its calculation time is about ten times lower than classic DFT D. Mel Frequency wrapping Before this stage mel filter bank is used.each filter gives ceptral coefficient.signal is plotted against the Mel Spectrum to mimic human Hearing. E. Ceptrum Mel Ceptrum is converted back to standard frequency scale. This is the key for speech preconisation[5,6]. DISCRETE WAVELET TRANSFORM (DWT) The basic idea of DWT in which a one dimensional signal is divided in two parts one is high frequency part and another is low frequency part. Then the low frequency part is split into two parts and the similar process will continue until the desired level. The high frequency part of the signal is contained by the edge components of the signal. In the DWT decomposition input signal must be multiple of 2n. Where, n represents the number of level. To analysis and synthesis of the original signal DWT provides the sufficient information and requires less computation time[7,8]. CONCLUSION Different feature extraction techniques and recognition techniques are discussed in this paper and it can be concluded that performance of MFCC technique is superior to LPCC performance. This paper attempts to provide a comprehensive survey on speech feature extraction techniques. Speech processing has attracted scientist as an important regulation and has created a technological influence on society. REFERENCES [1] M.A.Anusuya, Speech Recognition by Machine, International Journal of Computer Science and Information security, Vol.6, No.3, 2009 Amandeep Singh Gill, IJECS Volume 05 Issue 10 Oct., 2016 Page No.18551-18556 Page 18555

[2] S.J.Arora and R.Singh, Automatic Speech Recognition: A Review, International Journal of Computer Applications, vol60-no.9, December 2012 [3] Santosh K.Gaikward and Bharti W.Gawali, A Review on Speech Recognition Technique, International Journal of Computer Applications, vol 10, No.3, November 2010 [4] Lindasalwa Muda, Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques, Journal Of Computing, Volume 2, Issue 3, March 2010 5] Nidhi Srivastava and Dr.Harsh Dev Speech Recognition using MFCC and Neural Networks, International Journal of Modern Engineering Research (IJMER), march 2007 [6] Dr.R.L.K.Venkateswarlu, Dr.R.Vasantha Kumari and A.K.V.Nagavya, Efficient Speech Recognition by Using Modular Neural Network, Int. J. Comp. Tech. Appl., Vol 2 (3) [7] Bishnu Prasad Das and Ranjan Parekh, Recognition of Isolated Words using Features based on LPC, MFCC, ZCR and STE, with Neural Network Classifiers, International Journal of Modern Engineering Research (IJMER), Vol.2, Issue.3, May-June 2012 [8] Om Prakash Prabhakar and Navneet Kumar Sahu, A Survey On: Voice Command Recognition Technique, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 3, Issue 5, May 2013 [9] Milind U. Nemade and Prof. Satish K. Shah, Survey of Soft Computing based Speech Recognition Techniques for Speech Enhancement in Multimedia Applications, International Journal of Advanced Research in Computer and Communication Engineering Vol. 2, Issue 5, May 2013 Amandeep Singh Gill, IJECS Volume 05 Issue 10 Oct., 2016 Page No.18551-18556 Page 18556