Keywords: Spoken Hindi word & numerals, Fourier descriptors, Correlation, Mel Frequency Cepstral Coefficient (MFCC) and Feature extraction.

Similar documents
WHEN THERE IS A mismatch between the acoustic

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Learning Methods in Multilingual Speech Recognition

Human Emotion Recognition From Speech

Speech Recognition at ICSI: Broadcast News and beyond

Speaker recognition using universal background model on YOHO database

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speech Emotion Recognition Using Support Vector Machine

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speaker Identification by Comparison of Smart Methods. Abstract

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

A study of speaker adaptation for DNN-based speech synthesis

Speaker Recognition. Speaker Diarization and Identification

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

On the Formation of Phoneme Categories in DNN Acoustic Models

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

SARDNET: A Self-Organizing Feature Map for Sequences

Proceedings of Meetings on Acoustics

Automatic Pronunciation Checker

Calibration of Confidence Measures in Speech Recognition

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Segregation of Unvoiced Speech from Nonspeech Interference

Speech Recognition by Indexing and Sequencing

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Mandarin Lexical Tone Recognition: The Gating Paradigm

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

A Neural Network GUI Tested on Text-To-Phoneme Mapping

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Body-Conducted Speech Recognition and its Application to Speech Support System

Circuit Simulators: A Revolutionary E-Learning Platform

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Voice conversion through vector quantization

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Automatic segmentation of continuous speech using minimum phase group delay functions

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Lecture 9: Speech Recognition

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

INPE São José dos Campos

First Grade Curriculum Highlights: In alignment with the Common Core Standards

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

THE RECOGNITION OF SPEECH BY MACHINE

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Python Machine Learning

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Rule Learning With Negation: Issues Regarding Effectiveness

On-Line Data Analytics

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Phonological Processing for Urdu Text to Speech System

Support Vector Machines for Speaker and Language Recognition

Evolutive Neural Net Fuzzy Filtering: Basic Description

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Automatic intonation assessment for computer aided language learning

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Generative models and adversarial training

Investigation on Mandarin Broadcast News Speech Recognition

Test Effort Estimation Using Neural Network

Edinburgh Research Explorer

Rule Learning with Negation: Issues Regarding Effectiveness

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Probabilistic Latent Semantic Analysis

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

Affective Classification of Generic Audio Clips using Regression Models

Application of Virtual Instruments (VIs) for an enhanced learning environment

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Data Fusion Models in WSNs: Comparison and Analysis

Transcription:

Volume 3, Issue 5, May 213 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Frequency Analisys of Speech Signals for Devanagari Script and Numerals Using FFT Umesh Kumar Gupta M-Tech Student of Electronics Engg. Department B.V.D.U. College of engineering, Pune, India. Dr. R. K. Prasad Electronics Engg. Department B.V.D.U. College of Engg. Pune, India. Abstract: - This paper contains the frequency analysis of spoen Devnagari script and Numerals from the original speech signals. Devnagari vowels and numerals are playing the vital role in pronunciation of any word or counting. Each vowel & number is classified as starting, middle and end according to the duration of occurrences in the word. The Devnagari script having 12-vowels and 34-consonants are used in some Indian language lie Hindi and 1 numerals (-9) are used in mathematics. Sound samples from multiple speaers were utilized to extract different features. Initial processing of data, i.e., normalizing and time-slicing was done using a combination of Simulin and MATLAB. Afterwards, the same tools were used for calculation of Fourier descriptions and correlations. The correlation allowed comparison of the same words or numeral spoen by the same and different speaers. So the frequency has been calculated in statistical manner and generates a table between amplitude and frequencies. Mean and standard deviation such a system can be potentially utilized in implementation of a voice-driven help setup at call centres of commercial organizations operating in India and other foreign region. The implementation, experiments and result discussions are also existence. Keywords: Spoen Hindi word & numerals, Fourier descriptors, Correlation, Mel Frequency Cepstral Coefficient (MFCC) and Feature extraction. I. INTRODUCTION Fundamental frequency estimation has been a popular topic in many fields of research. Such as speech synthesis, speech processing, speaer identification etc. The Devnagari vowels and numerals cannot pronounce two ways but it can be pronounced only one way e.g. Devnagari 12-vowels are classified with the phonetic transcription structure of phonemes according to organ used in produce the sound. Devnagari is based on phonetics principles which are considered as Place of articulation (POA) vowels. These Devnagari vowels having Frequency analysis of speech signals are estimated in noisy environment (original signals) for analysis and synthesis. The original speech signals are unbalanced to adjustment of an interval with help of some feature extraction techniques or use Sound Forge 9. software. The initial objective is to estimating the pitch of Devnagari vowels and numerals with noisy environments speech signals. When one loos at a person, car or house, one s brain tries to match the incoming plot with hundreds (or thousands) of plot that are already stored in memory. In the speech recognition research literature, no wor has been reported on Devnagari speech processing and numerals. So we consider our wor to be the first such attempt in this direction. The process involves extraction of some distinct characteristics of individual words by utilizing Fourier transforms and their correlations. The system is speaer-independent and is moderately tolerant to bacground noise. II. DEVNAGARI VOWELS The 12-Devnagari vowels are categorised as per IPA (International Phonetics Association) as shown in Table-2. These are used for the speech analysis and synthesis purpose. It describes in different categories such as follows: A. Short Vowels The short vowel is a single vowel (V) in a short word or syllable, that vowel usually maes a short sound. These short vowels usually appear at the beginning of the word or between two consonants. E.g. the short vowels represent character in Marathi and in Hindi. B. Long Vowels The long vowels a short word or syllable ends with a vowel-consonant (VC). The `a at the end of the word is silent. Long vowels when the word or syllable has a single vowel and the vowel appears at the end of the word or syllable, the vowels usually represent maes the long sound in Hindi. C. Conjunct Vowels The conjunct vowels are combination of short and long vowels. These phonemes are produced in Hindi e.g. as shown in Table-2. D. Nasal Vowel 213, IJARCSSE All Rights Reserved Page 471

A nasal vowel is produced with a low tune so that air pressure through nose as well as mouth. The term "nasal" is slightly air pressure which does not come exclusively out of the nose in nasal vowels. E. Visarg Vowel The Visarg symbol is used rarely in Devnagari. The visarg is pronounced as the voiceless sound after the vowels. E.g.in Hindi. TABLE I. RANGE OF HUMAN SPEECH Gender Fundamental Fundamental frequency frequency (F) Min Hz (F )Max Hz Male 8 2 Female 15 35 TABLE II. DEVNAGARI VOWELS CLASSIFIED INTO FIVE TYPES TYPE OF DEVNAGARI 1 2 3 4 VOWELS SHORT - LONG - CONJUN-CT NASAL - - - VISARG - - TABLE III. HINDI CHARACTER SET III. Speech Modelling Using Average Energy In The Zerocrossing Interval The speech production model suggests that the energy of the voiced speech is concentrated about 8 Hz, where as in the case of unvoiced speech, most of the energy is found at higher Frequencies. Since high frequency implies high zerocrossing rate and low frequency implies low zerocrossing rate, there is strong correlation between zerocrossing rate and energy distribution with frequency. This motivates us to model the speech signal using average energy in zerocrossing interval of the signal. Consider the speech segment shown in Figure 1. The ZC i shows the ith zerocrossing and ZC i+1 shows the i+1th zerocrossing of th observation window. The time interval between these two points is called ith zerocrossing interval T i in the th observation window. 213, IJARCSSE All Rights Reserved Page 472

FIGURE 1. SPEECH SEGMENT IN KTH OBSERVATION The average energy in the ith zerocrossing interval can be obtained by the expression:- Zc i+1 E t = 1 T X2 (t)dt t Zc i E i Is the average energy of the signal in T i th zerocrossing interval of th observation window and X(t) is the instantaneous signal amplitude. The aim of the present study is to find a robust coefficient for speech recognition application using the average energy in the zerocrossing interval (AEZI). An XY plot is generated by plotting index number of zero crossing intervals along X axis and Average Energy in the Zerocrossing Interval (AEZI) along Y axis. Figure 1 represents the average energy in the zerocrossing interval vs index number of the zerocrossing interval for the Hindi script. IV.DATA ACQUISITION AND PROCESSING One of the obvious methods of speech data acquisition is to have a person spea into an audio device such as microphone or telephone. This act of speaing produces a sound pressure wave that forms an acoustic signal. The microphone or telephone receives the acoustic signal and converts it into an analog signal that can be understood by an electronic system. Finally, in order to store the analog signal on a computer, it must be converted to a digital signal. The data in this paper is acquired by speaing Hindi Word and numeral into a microphone connected to Windows-7 based PC. The data is saved into.wav format files by the using of MATLAB. The sound files are processed after passing through a (Simulin) filter, and are saved for further analysis such as FFT. We recorded the data form speaers who spoe the same word set, i.e. Devnagari Script & numerals. In general, the digitized speech waveform has a high dynamic range, and can suffer from additive noise. So first, a Simulin model was used to extract and analyze the acquired data; see Fig. 2. Figure2. Simulin Model For Analyzing Hindi Data And Numerals The Simulin model, as shown in Fig. 2, was developed for performing analysis such as standard deviation, mean, autocorrelation, magnitude of FFT, data matrix correlation. We also tried a few other statistical techniques. We would also lie to mention that we had started our experiments by using Simulin, but soon found this GUI-based tool to be somewhat limited because we did not find it easy to create multiple models containing variations among them. This iterative and variable-nature of models eventually led us to MATLAB s (text-based).m files. We created these files semi-automatically by using a Hindi-language script; the script was developed specifically for this purpose. Three main data pre-processing steps were required before the data could be used for analysis. 213, IJARCSSE All Rights Reserved Page 473

A. Pre-emphasis By pre-emphasis, we imply the application of a normalization technique, which is performed by dividing the speech data vector by its highest magnitude. B. Data Length Adjustment FFT execution time depends on exact number of the samples (N) in the data sequence [x ], and that the execution time is K minimal and proportional to N*log 2 (N), where N is a power of two. Therefore, it is often useful to choose the data length equal to a power of two. C. Endpoint Detection The goal of endpoint detection is to isolate the word to be detected from the bacground noise. It is necessary to trim the word utterance to its tightest limits, in order to avoid errors in the modeling of subsequent utterances of the same word. As we can see from the upper part of Fig. 2, a threshold has been applied at both ends of the waveform. The front threshold is normalized to a value that all the spoen numbers trim to a maximum value. These values were obtained after observing the behaviour of the waveform and noise in a particular environment. We can see the difference in frequency characteristics of the words. D. Fourier Transform The MATLAB algorithm for the two dimensional FFT routine is as follows: fft2(x) =fft (fft (x)); Thus the two dimensional FFT is computed by first computing the FFT of x, that is, the FFT of each column of x, and then computing the FFT of each row of the result. Note that as the application of fft2 command produced even symmetric data, we only show the lower half of the frequency spectrum in our graphs. E. Correlation Calculations for correlation coefficients of different speaers were performed. As expected, the cross-correlation of the same speaer for the same word did come out to be 1. The correlation matrix of a spoen number was generated in a three-dimensional form for generating different simulations and graphs. V. Related Wor This section of paper We will represent the wors such as implement an experimental, speaer dependent, real-time for the Hindi language (Devnagari Script).Words using the Dynamic Time Warp (DTW) technique. The presented wor emphasized on template-based recognizer approach using linear predictive coding with dynamic programming computation based recognizers in isolated tas. A. Standard MFCC Mel cepstral feature extraction is used in some form or another in virtually every state of the art speech and Frequency analysis system. First, speech samples are divided into overlapping frames. The usual frame length is 25 ms and the frame rate is 1 ms. each frame is usually processed by pre-emphasis filter to amplify higher frequencies. In the next step count the voiced samples and then tae the Fourier spectrum is computed for the signal. A Mel spaced ban of filters is then applied to obtain a vector of log energies. Usually 2 to 4 filters are used depending on application. The output of the filter-ban is then converted to cepstral coefficients by using discrete cosine transform (DCT), where only the first 12 coefficients are retained for computing the feature vector. Finally the feature vector consists of 39 values including the 12 cepstral coefficients with one energy. B. Extended MFCC Thirteen extra triple delta features are added in standard 39 MFCC features forming a feature vector of 52 values. These 52 values are then reduced to 39 by applying any feature reduction technique. These techniques are based on linear transformation schemes lie principal component analysis (PCA), linear discriminate analysis (LDA) and Hetroscedastic linear discriminate analysis (HLDA). HLDA, first proposed by N. Kumar has been widely used for various feature combination techniques. It maximizes the lielihood of all the training data in the transformed space and each training sample contributes equally to the objective function. We have used HLDA for feature reduction and this procedure is named extended MFCC as shown in Figure 3. C. Robust Features In noisy environments when training and testing conditions are severely mismatched, these features cannot wor well. Therefore, feature domain signal processing methods are applied to enhance the distorted speech. Spectral subtraction is widely used as a simple technique to reduce additive noise in the spectral domain, In order to eliminate the convolutive channel effect and noise distortion. D. Gaussian Mixture HMM In this method continuous density hidden Marov models are used to match the phonetic information of speech signal with the feature vectors derived at front end. Multivariate Gaussian mixtures are used to calculate the lielihood of observation vectors (i.e. spectral features). Representation of phonetic information, HMM topology and number of Gaussian mixtures are the ey issues for the implementation of these statistical techniques. 213, IJARCSSE All Rights Reserved Page 474

FIGURE 3. EXTENDED MFCC VI.ANALYSIS & RESULTS We observed that Fourier descriptor feature was independent for the spoen Devnagari Script and numerals with the combination of the Fourier transform and correlation technique commands used in MATLAB, a high accuracy recognition system can be realized. Recorded data was used in Simulin model for introductory analysis. 1 Time Series for a.5 -.5-1 5 1 15 Spectrum of speech a 15 1 5 5 1 15 2 25 3 35 4 FIGURE 4. THE FFT WAVEFORM OF THE WORD अ IN DEVNAGARI SCRIPT X = 15,It s having 15 numbers of data points. It s denoted by X. and having a 5 peas values for each & every word same for अ in Devnagari script. 1 Time Series for aa.5 -.5-1 2 4 6 8 1 12 14 16 18 Spectrum of speech aa 1 5 5 1 15 2 25 3 35 4 FIGURE 5. THE FFT WAVEFORM OF THE WORD IN DEVNAGARI SCRIPT X = 18, It s having 18 numbers of data points. It s denoted by X. and having a 5 peas values for each & every word same for in Devnagari script. 213, IJARCSSE All Rights Reserved Page 475

.4 Time Series for zero.2 -.2 -.4 2 4 6 8 1 12 14 16 18 2 Spectrum of speech zero 6 4 2 5 1 15 2 25 3 35 4 FIGURE 6. THE FFT WAVEFORM OF THE ZERO IN NUMERALS It s having 2 numbers of data points. It s denoted by X. and having a 5 peas values for each & every word same for Zero in Numerals. 1 Time Series for one.5 -.5-1 2 4 6 8 1 12 14 Spectrum of speech one 6 4 2 5 1 15 2 25 3 35 4 FIGURE 7. THE FFT WAVEFORM OF THE ONE IN NUMERALS It s having 14 numbers of data points. It s denoted by X. and having a 5 peas values for each & every word same for One in Numerals. VI. Conclusion And Future Wor In conclusion, an efficient, abstract and fast ASR system for regional languages lie Hindi is need of the hour. The wor implemented in the paper is a step towards the development of such type of systems. The wor may further be extended to large vocabulary size and to continuous speech recognition. As shown in results, the system is sensitive to changing spoen methods and changing scenarios, so the accuracy of the system is a challenging area to wor upon. Hence, various Speech enhancements and noise reduction techniques may be applied for maing system more efficient, accurate and fast. TABLE.IV. PEAKS AND ITS CORRESPONDING FREQUENCIES SR.NO SPEECH WORD PEAK FREQUENCY IN (HZ) 1 FOR A P1 1.9565 F1 424 P2 71.196 F2 429 P3 57.5883 F3 567 P4 46.613 F4 415 P5 37.328 F5 434 2 FOR AA P1 95.4134 F1 596 P2 77.7759 F2 61 P3 7.3393 F3 65 P4 46.3746 F4 578 P5 44.5413 F5 12 3 FOR ZERO P1 46.355 F1 62 213, IJARCSSE All Rights Reserved Page 476

P2 4.2583 F2 67 P3 33.3799 F3 68 P4 32.5196 F4 129 P5 29.6611 F5 13 4 FOR ONE P1 51.631 F1 162 P2 46.8251 F2 164 P3 34.845 F3 16 P4 33.1 F4 41 P5 32.7313 F5 4 References [1]. S K Husain, Perez Ahter, Digital Signal Processing, Theory and Wored Examples, January 27. [2]. Samuel D Stearns, Ruth A David, Signal Processing Algorithms in MATLAB, Prentice Hall, 1996. [3] S K Husain, Nighat Jamil, Implementation of Digital Signal Processing real time Concepts Using Code Composer Studio 3.1, TI DSK TMS 32C6713 and DSP Simulin Blocsets, IC-4 conference, Indian Navy Engineering College, Goa, Nov. 27. [4] M. Habibullah Pagarar, Lashmi Gopalarishnan, et.al. Language Independent Speech Compression using Devnagari Phonetics, 22. [5] D. O Shaughnessy, Interacting with Computers by Voice-Automatic Speech Recognitions and Synthesis, (Invited Paper), Proceedings of the IEEE, Vol. 91, No. 9, 23, pp. 1272-135. 213, IJARCSSE All Rights Reserved Page 477