Speech to Text Conversion in Malayalam

Similar documents
Human Emotion Recognition From Speech

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Emotion Recognition Using Support Vector Machine

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speaker recognition using universal background model on YOHO database

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

A study of speaker adaptation for DNN-based speech synthesis

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

WHEN THERE IS A mismatch between the acoustic

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speaker Identification by Comparison of Smart Methods. Abstract

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Speaker Recognition. Speaker Diarization and Identification

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Learning Methods in Multilingual Speech Recognition

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Automatic Pronunciation Checker

Support Vector Machines for Speaker and Language Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Calibration of Confidence Measures in Speech Recognition

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Proceedings of Meetings on Acoustics

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition by Indexing and Sequencing

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Mandarin Lexical Tone Recognition: The Gating Paradigm

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

INPE São José dos Campos

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Python Machine Learning

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Affective Classification of Generic Audio Clips using Regression Models

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

On the Formation of Phoneme Categories in DNN Acoustic Models

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Segregation of Unvoiced Speech from Nonspeech Interference

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

SARDNET: A Self-Organizing Feature Map for Sequences

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Probabilistic Latent Semantic Analysis

Word Segmentation of Off-line Handwritten Documents

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Lecture 9: Speech Recognition

Generative models and adversarial training

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Learning Methods for Fuzzy Systems

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

A Case Study: News Classification Based on Term Frequency

An Online Handwriting Recognition System For Turkish

Grade 6: Correlated to AGS Basic Math Skills

Test Effort Estimation Using Neural Network

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

CSL465/603 - Machine Learning

Edinburgh Research Explorer

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Artificial Neural Networks written examination

Assignment 1: Predicting Amazon Review Ratings

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Australian Journal of Basic and Applied Sciences

Rule Learning With Negation: Issues Regarding Effectiveness

Circuit Simulators: A Revolutionary E-Learning Platform

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

CS Machine Learning

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Specification of the Verity Learning Companion and Self-Assessment Tool

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Transcription:

Speech to Text Conversion in Malayalam Preena Johnson 1, Jishna K C 2, Soumya S 3 1 (B.Tech graduate, Computer Science and Engineering, College of Engineering Munnar/CUSAT, India) 2 (B.Tech graduate, Computer Science and Engineering, College of Engineering Munnar/CUSAT, India) 3 (Assistant Professor, Computer Science and Engineering, College of Engineering Munnar/CUSAT, India) Abstract: The Speech recognition and related tasks for many languages are getting more common in the present scenario. In the case of Malayalam language, recognizing speech is a tedious task. This paper presents a speech to text conversion system for Malayalam language. The system considers only isolated words and is a speaker dependent one with limited vocabulary. The uttered word which is the input for the system is displayed in Malayalam as the output. After the recording phase, features are extracted using Mel-Frequency Cepstral Coefficients (MFCC). Audio files are trained using the Gaussian Mixture Model (GMM). Keywords: Feature extraction, GMM, Malayalam, MFCC, Speech to text I. INTRODUCTION The most common method to interact with computers is using keyboard and mouse. When a large amount of data is to be entered, this is a time consuming process. The mode of interaction can be changed in order to solve this problem. According to human beings, the best way of communication between them is speech. If a system can understand what a human speaks, then it is the best method of interaction between a human and a computer. Many works related to natural language processing are going on these days. Speech to text systems take speech as input, recognize it and convert it into text. A speech to text system support many applications such as an aid for blind persons, telephone directory assistance, in hospitals for health care instruments, in banking, in mobile phones etc. Malayalam [1][2][3] is one among the Dravidian family languages. It is the official language of state Kerala. The alphabet of Malayalam contains 37 consonants and 16 vowels. The consonants are arranged based on the mode of speech production and the flow of air. Numerous works have been taken place in many of the Indian languages. However, very less work has been recorded in Malayalam. In this paper, we are introducing a speech to text system for Malayalam. Our system is considering 5 isolated words for the training. This is a speaker dependent system. Initially the words are to be stored and trained. For each word, record a number of samples and store it. The word uttered will be compared with these stored words. The main stage of a speech recognizer is the feature extraction. Many feature extraction techniques[4][5] are being used such as Linear Predictive Coding (LPC)[2], Linear Discriminant Analysis(LDA), Independent Component Analysis (ICA), Principal Component Analysis(PCA), MFCC[6][1], Kernal based feature extraction, Wavelet Transform and spectral subtraction. The most commonly used technique is MFCC. MFCC considers frequencies with the human perception sensitivity, and therefore it is a best tool for speech recognition. For the process of recognizing speech, Hidden Markov Model[3][7][8], GMM[6][9], Vector Quantization(VQ), Artificial Neural Networks(ANN)[10] are various techniques[5][8] used. In our system, Gaussian component densities are weighted up to a sum and are represented to a parametric probability density function which acts as GMM. Python is used for the implementation and a user interface is provided. II. Methodology 2.1. Recording First, the words are recorded and stored. For recording, in Python, we are using a tool called Pyaudio which has inbuilt audio related functions. For each word, a good number of samples are to be stored as separate files. The parameters are to be specified in the audio stream opened such as frame size, format, channels, rate etc. www.ijlera.com 2017 IJLERA All Right Reserved 113 Page

2.2. Feature Extraction The efficient method for the feature extraction is MFCC. Various steps are involved in the MFCC algorithm. After the execution of this phase, Mel-frequency cepstral coefficients are obtained. The steps involved in this algorithm are discussed in the coming section. There are seven steps involved in MFCC: Fig 1.Block diagram of MFCC 2.2.1. Pre-emphasis Higher frequency parts that are suppressed when human produces sound are compensated in the preemphasis phase. Moreover, it can amplify the importance of high frequency syllables i.e. the word spoken. 2.2.2.Framing The input is segmented into frames of 20~30 ms with an optional overlap. To facilitate the use of FFT, usually the frame size is equal to the power of two. 2.2.3.Windowing Hamming window is applied to each frame in order to keep the last and first points and its continuity in the frame. Let the signal in a frame be denoted by s(n), n=0,...,n-1, then after Hamming windowing, the signal can be given as s(n)*w(n), where w(n) is the Hamming window. 2.2.4. FFT In speech signals different timbres correspond to different energy distribution in the spectral analysis over frequencies. Therefore FFT is performed to obtain the magnitude of frequency of each frame. For performing FFT [11] on a frame, we assume that the signal in the frame is continuous and periodic when wrapping around. Even though this is not the case, FFT can be performed but the frame's first and last points incontinuity can introduce undesirable effect in the frequency responses. We have two strategies to compromise with this problem, 1. Hamming window is multiplied with each frame in order to improve the continuity of last and first points. 2. Take a variable size frame in a way that it always contains integer multiple number of the fundamental intervals of the speech signal. 2.2.5. Mel-Filter Bank We use a set of 20 triangular bandpass filters. The spectral envelope is extracted using these filters. 2.2.6.Logging The magnitude frequency response is multiplied by the filters to find the log energy of each bandpass filter. Along the Mel frequency, these filters are equally spaced. www.ijlera.com 2017 IJLERA All Right Reserved 114 Page

2.2.7. Inverse DFT Since we have performed FFT, Discrete Cosine Transform (DCT) makes a transformation from the frequency domain into a time-like domain called quefrency domain. The features obtained are similar to a cepstrum, thus it is referred as the mel-scale cepstral coefficients. 2.2. GMM The GMM[12] is used as a classifier to compare the extracted features from MFCC with stored templates. A Gaussian Mixture Model is a probabilistic model. It assumes all the data points are found from a mixture of a finite amount of Gaussian distributions with unknown parameters. A weighted sum of M component densities bi is the Gaussian mixture density and it is expressed as: (1) The mean,vectors, co-variance matrices and mixture weights from all component densities are used for describing GMM. Euclidian distance between various recordings is found for the matching purpose and hence a correct match is found. III. Implementation The speech signal is recorded by using a convenient package called Pyaudio as we are developing the system using Python. It is stored as a.wav file. The uttered word is identified by removing the silence. For extracting the features of the speech signal, MFCC is applied. The number of samples chosen in a frame is 256.After this phase, we will get the Mel-frequency cepstral coefficients. As a next step, GMM model parameters are produced. Euclidean distance between the various recordings in the databases is found and matching word is found. The matched word is then displayed in Malayalam. A graphical user interface is provided for recording the voice by the use of PyQt4. Fig 2.User Interface for recording IV. Results The system is trained for five words. The words are അമ മ, പ ഥ വ, കഥ, വവഴ മ പല, വകരള. Each word has 5 samples recorded. The outputs for the word are tested 25 times and accuracy percentage is calculated. www.ijlera.com 2017 IJLERA All Right Reserved 115 Page

Fig 3.The word is recorded Fig 4.The word is displayed. www.ijlera.com 2017 IJLERA All Right Reserved 116 Page

The accuracy of the system is found out using confusion matrix as shown in Table 1. Table 1 Testing and Accuracy of Results Train Data Number of test Number of Correct test Error Percentage of accuracy അമ മ 25 19 6 76 ക രള 25 18 7 72 ഥ 25 20 5 80 കവഴ മ പല 25 17 8 68 പ ഥവ 25 19 6 76 Conclusion The system is established as an initiative towards an advanced speech to text conversion system for Malayalam. The system is giving an accuracy of about 75% when modelled using GMM. We would like to enlarge this project to a speaker independent system which also deals with a large vocabulary system. References [1]. Cini Kurian, Kannan Balakrishnan, Speech recognition of Malayalam numbers,world Congress on Nature & Biologically Inspired Computing(NaBIC2009),2009. [2]. Maya Moneykumar,Elizabeth Sherly,Win Sam Varghese, Malayalam word identification for speech recognition system,an International Journal of Engineering Sciences,Special Issue idravadian,december2014,vol. 15 [3]. Cini Kurian, Kannan Balakrishnan, Development & evaluation of different acoustic models for Malayalam continuous speech recognition, International Conference on Communication Technology and System Design 2011. [4]. R. K. Aggarwal, Mayank Dave, Implementing a Speech Recognition System Interface for Indian Languages. [5]. Miss.PrachiKhilari,Prof.BhopeV.P"A Review on Speech to Text Conversion Methods", International Journal of Advanced Research in Computer Engineering & Technology Volume 4 Issue 7, July 2015. [6]. Tahira Mahboob, Memoona Khanum, Malik Sikandar Hayat Khiyal, Ruqia Bibi4, Speaker Identification Using GMM with MFCC, IJCSI International Journal of Computer Science Issues, Volume 12, Issue 2, March 2015. [7]. Nuzhat Atiqua Nafis and Md. Safaet Hossain, Speech to text conversion in real time, International Journal of Innovation and Scientificc Research ISSN 2351-8014 Vol. 17 No. 2 Aug. 2015. [8]. L. R..Rabiner, B. H. Juang, An Introduction to Hidden Markov Models, IEEE ASSP magazine, January 1986. [9]. Virendra Chauhan, Shobhana Dwivedi, Pooja Karale, Prof. S.M. Potdar, Speech to text converter using Gaussian Mixture Model(GMM), International Research Journal of Engineering and Technology (IRJET), Volume: 03 Issue: 02 Feb-2016 [10]. MayaMoneykumar, ElizabethSherly, WinSamVarghese, Isolated Word Recognition System for Malayalam using Machine Learning,2016 [11]. AnshulGupta, NileshkumarPatel, ShabanaKhan, Automatic Speech Recognition Technique For Voice Command. [12]. Virendra Chauhan, Shobhana Dwivedi, Pooja Karale, Prof. S.M. Potdar, Speech To Text Converter Using Gaussian Mixture Model(Gmm), International Research Journal of Engineering and Technology, Volume: 03 Issue: 02, Feb-2016 www.ijlera.com 2017 IJLERA All Right Reserved 117 Page