Phonemes based Speech Word Segmentation using K-Means

Similar documents
Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Human Emotion Recognition From Speech

Learning Methods in Multilingual Speech Recognition

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

CS Machine Learning

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Speaker Identification by Comparison of Smart Methods. Abstract

Modeling function word errors in DNN-HMM based LVCSR systems

Python Machine Learning

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Lecture 1: Machine Learning Basics

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

A study of speaker adaptation for DNN-based speech synthesis

Assignment 1: Predicting Amazon Review Ratings

On the Formation of Phoneme Categories in DNN Acoustic Models

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

WHEN THERE IS A mismatch between the acoustic

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Mandarin Lexical Tone Recognition: The Gating Paradigm

Word Segmentation of Off-line Handwritten Documents

Speaker Recognition. Speaker Diarization and Identification

Learning Methods for Fuzzy Systems

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Speech Emotion Recognition Using Support Vector Machine

A Case Study: News Classification Based on Term Frequency

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Generative models and adversarial training

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

SARDNET: A Self-Organizing Feature Map for Sequences

Corpus Linguistics (L615)

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Rule Learning With Negation: Issues Regarding Effectiveness

PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Switchboard Language Model Improvement with Conversational Data from Gigaword

Comment-based Multi-View Clustering of Web 2.0 Items

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Proceedings of Meetings on Acoustics

Artificial Neural Networks written examination

Australian Journal of Basic and Applied Sciences

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Eyebrows in French talk-in-interaction

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Mining Association Rules in Student s Assessment Data

Circuit Simulators: A Revolutionary E-Learning Platform

Probabilistic Latent Semantic Analysis

Edinburgh Research Explorer

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

Problems of the Arabic OCR: New Attitudes

Evolutive Neural Net Fuzzy Filtering: Basic Description

Investigation on Mandarin Broadcast News Speech Recognition

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Speech Recognition by Indexing and Sequencing

A Neural Network GUI Tested on Text-To-Phoneme Mapping

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Rule Learning with Negation: Issues Regarding Effectiveness

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Universal contrastive analysis as a learning principle in CAPT

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

Linking Task: Identifying authors and book titles in verbose queries

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Learning From the Past with Experiment Databases

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

AQUA: An Ontology-Driven Question Answering System

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Issues in the Mining of Heart Failure Datasets

Affective Classification of Generic Audio Clips using Regression Models

INPE São José dos Campos

Beyond the Pipeline: Discrete Optimization in NLP

Corrective Feedback and Persistent Learning for Information Extraction

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

arxiv: v1 [cs.cl] 2 Apr 2017

(Sub)Gradient Descent

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Speaker recognition using universal background model on YOHO database

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

On the Combined Behavior of Autonomous Resource Management Agents

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Transcription:

International Journal of Engineering Sciences Paradigms and Researches () Phonemes based Speech Word Segmentation using K-Means Abdul-Hussein M. Abdullah 1 and Esra Jasem Harfash 2 1, 2 Department of Computer Science, College of Science, University of Basrah, IRAQ 1 Abdo6_24@yahoo.com and 2 Esra_jasem_211@yahoo.com Publishing Date: April 25, 216 Abstract Phoneme Speech segmentation is an important task in more speech sound processing applications. In this work uses k-mean algorithm to segment speech word sound to its phonemes.we are run k-means to separate between vowel regions and another the regions in the input word signal. Then the segments points between vowel phoneme and phonemes can be determined, and each phoneme can extracted easily. We are measure the performance of k- means by using different data type of waveform of word: Time domain, FFT, and wavelet transformation. We apply our system on 1 different words, then the accuracy of determine succeed segment points is 8.33% Keywords: Audio segmentation, Automatic Speech Segmentation, Clustering, K-Means Algorithm. 1. Introduction Automatic speech segmentation has many advantages in more applications in speech processing, e.g., in automatic speech recognition and automatic annotation of speech corpora[1].the good of the segmentation affects the recognition performance in several ways: Speaker adaptation and speaker clustering methods assume that a segment is spoken by a single speaker. The language model performs better if segment boundaries correspond to boundaries of sentence-like units [2]. This development to the speech systems created a demand for new and better speech databases (using new voices, new dialects, new special features to consider, etc.), often with phonetic level annotation information (and others). This trend re-enforces the importance of automatic segmentation and annotation tools because of the drastic time and cost reduction in the development of speech corpora even when some little human action is needed[3]. Speech can be represented phonetically by a limit set of symbols called the phonemes of the language, the number of which depends upon the language and the refinement of the analysis. For most languages the number of phonemes lies between 32 and 64. Each phoneme is distinguished by its own unique pattern and different phonemes are distinguishable in terms of their formant frequencies[4].speakers of a language can easily dissect its continuous sounds into words. With more difficulty, they can split words into component sounds segments (phonemes). The phoneme segmentation is an approach to isolating component word sounds to its distinctive unit sounds or phonemes. Then the Automatic Speech Segmentation is the process of taking the phonetic transcription of an audio speech segment and determining where in time particular phonemes occur in the speech segment, by using appropriate algorithms in computer[5]. One good approach that can be used in segmentation process is the principle of clustering. Among the formulations of partitional clustering based on the minimization of an objective function, k-means algorithm is the most widely used clustering and studied. Where each data object must be describable in terms of numerical coordinates. This algorithm partitions the data points (objects) to C groups (clusters), so as to minimize the sum of the (squared) distances between the data points and the center (mean) of the clusters [6,7].In this paper, a tool for automatic phoneme segmentation using k- means algorithm. 2. K-Means Clustering Clustering is an unsupervised classification that is the partitioning of a data set in a set of 23

International Journal of Engineering Sciences Paradigms and Researches () meaningful subsets. Each object in dataset shares some common property- often proximity according to some defined distance measure. Among various types of clustering techniques, K- Means is one of the most popular algorithms. The objective of K-means algorithm is to make the distances of objects in the same cluster as small as possible. K-means is a prototype-based, simple partition clustering technique which attempts to find a user-specified k number of clusters. These clusters are represented by their centroids. It Divide n object into this K clusters, to create relatively high similarity in the cluster and, relatively low similarity between clusters. And minimize the total distance between the values in each cluster to the cluster center. A cluster centroid is typically the mean of the points in the cluster. This algorithm is simple to implement and run, relatively fast, easy to adapt, and common in practice[8,9]. The basic steps of k-means clustering are simple. In the beginning we determine number of cluster and we assume the centroid or center of these clusters. We can take any random objects as the initial centroids or the first k objects in sequence can also serve as the initial centroids. The k- means algorithm will do the two steps below until convergence: 1. Each instance Xi is assigned to its closest cluster. 2. Each cluster center Cj is updated to be the mean of its constituent instances. Where and the K selected initial cluster means. This algorithm aims at minimizing an objective function, (in this case a squared error function). The objective function, where is a chosen distance measure between a data point and the cluster Centre, is an indicator of the distance of the n data points from their respective cluster centers[1,11,12]. The main steps of k-means clustering algorithm can be describe as follows[9,13]: 1. Randomly select k data object from dataset D as initial cluster centers. 2. Repeat: a. Calculate the distance between each data object d i, where (1<=i<=n) and all k cluster centers c j, where (1<=j<=k), and assign data object d i to the nearest cluster. b. For each cluster c j, recalculate the cluster centers. c. Until no change in the cluster Centre. 3. Segmentation Framework The k-mean has been used here to determine the segmentation points in the sound word signal. And does so by grouping frames of the signal X into two groups, one of them for and the other for s. In the following, the steps that is followed to determine the segments points: 1. Input: Each input word speech signal X is recorded in the environment of the room. The sampling rate is 8KHz and eachsampleof8 bits length. 2. Preprocessing: this stage includes the following stapes: a. Normalize the speech signal as follows: where x(i) is sample in the sound signal, n is the overall number of samples. b. Divide the speech signal X into N blocks (frame 1, frame 2,...,frame N ) where each block of length M samples, using Hamming window. 3. Run the k-means: a. Generate the initial values of the centers C 1 and C 2 randomly, where the length of C i is M. b. Calculate the distance between the each frame i and the centers C 1 and C 2 separately: Where i=1 to N. c. Select the minimum distance value between (D 2i, D 1i ) to identify which cluster the frame i belong to it. d. According to a new distribution of the frames on the two groups, the values of the centers (C1, C2) is recalculated as follows: 24

Amplitud means of data in time domain International Journal of Engineering Sciences Paradigms and Researches () 3.5 Where p is the number of frames in the cluster1 and q is the number of frames in the cluster2. e. Repeat b, c and d respectively, until the model stable. 3 2.5 2 1.5 1 In this work, the k-means algorithm implemented on three types of features extracted from word speech signal, are: Feature set extracted in time domain of sound signal. Fast Fourier Transform Coefficients. Wavelet Transform coefficients. 4. Discussion and the experimental results Through the experiments,we are tested to determine the appropriate type of features that more efficient in giving good separation between and regions frames.the following Discussion show the overall results of phonemes segmentation after run the k-means. (A) In time Domain According to this type of data, the following cases of experiments carried out: Case 1: The input to the k-means is the N of frames, each frame with full M samples. Case 2: Take the average of each frame, then the input to k-means is N frames with one value. For the word signal in Fig. (1), Fig. (2) show the distributed of frames of this word between the vowel and regions,and table 1 shows the measurement accuracy of these above two case:.5 1 2 3 4 5 6 7 8 9 1 11 12 13 14 15 16 17 18 19 2 21 22 23 24 25 26 27 28 29 3 31 32 33 34 35 Figure 2 The results of distribute of clusters of Data in time domain Table 1 Case Case1 Case2 TD 5.21 9.89 (B) In Fourier transform After apply the Fast Fourier transform calculation on each frame, the output for each is frames M / 2 coefficients. These coefficients are the input data adopted here.and the following cases have been tried with these transactions: Case 1:, The input to the k-means is N of points each point is the mass of a length M/2coefficients. Case 2: Take the average of each frame of FFT coefficients, then the input to the k-means is the N of points each point along a single value of mean coefficients. Case 3: Reduce the number of coefficients of FFT to the length of (M / 2) / r, where r=2 no and no is an integer, and r should be less than or equal to (M / 2). This reduction process performed by taking the largest value out of each r coefficient, as follows: Consona Vowe Conson Vowe Consona 1.5 1.5 -.5-1 -1.5 5 1 15 2 25 3 35 4 45 5 55 6 Times Figure 1 The input signal of word /hasan/ Then the input here to the k-means is the N of points each point along the mass (M / 2) / r. Case 4:Also we take the Average coefficients in case 3. Figure (3) show the result distribute of frames of the word in figure(1) depending on this type of features, and table 2 show the measurement accuracy: 25

frames International Journal of Engineering Sciences Paradigms and Researches () 14 12 1 8 6 Case2:Take the mean of 11,11,21,21,31,31, 41,41, then the input is Nx8. Case 3: Take the mean of each nodes in the net in fig(4), the result is Nx3 vector. Figure (5) and table 3 shows the result with this type of features. 4 2 15 frames 1 2 3 4 5 6 7 8 9 1 11 12 13 14 15 16 17 18 19 2 21 22 23 24 25 26 27 28 29 3 31 32 33 34 35 Figure 3 results of distribute of clusters of Fast Fourier Transform coefficients 1 Table2 Case Case1 Casee2 Casee3 Case4 FT 47.14 77.85 68.39 8.18 5 1 2 3 4 5 6 7 8 9 1 11 12 13 14 15 16 17 18 19 2 21 22 23 24 25 26 27 28 29 3 31 32 33 34 35 (C) In Wavelet transformation We are use Discrete Wavelet transformation DWT to performs a 4-level one-dimensional wavelet decomposition with respect to the wavelet db3, where DWT computes the approximation coefficients vector and detail coefficients vector, obtained by a wavelet decomposition as show in Figure(4). 34 48 48 22 34 47 47 11 33 46 22 46 33 45 45 Fram i 32 21 44 44 32 43 11 43 21 31 42 42 Figure 4 The wavelet transformation to four level 31 41 41 The following the experiments that give the best result on the wavelet coefficients: Case 1: Take the mean of 11, 21, 31 and 41, the input to the K-means Nx4 vector values. Figure 5 The results of distribute of clusters of Wavelet Transform coefficients Table 3 Case Case1 Case2 Case3 WT 94.36 92.786 86.724 5. Conclusion We presented a approach of phoneme segmentation Depending on several types of features using K-means algorithm, regarding the use of natural speech recorded in real situations, and we are found the following: 1. This method gives good ability to separate the vowel phonemes (in one cluster) and the phonemes in another cluster. 2. The ability of k-means model is increased when the dimension of each input data point X i is smallest to small or one point by take the mean (or may standard deviation or variance...ect) to this input data, where the features are became more clear. 3. In all the cases, as we see in above section 4, the performance always is good and acceptable, but the best result, we are obtain when dealing with wavelet coefficients. 26

International Journal of Engineering Sciences Paradigms and Researches () References [1] O. Johannes, U. Kalervo, and T. Altosaar,"An Improved Speech Segmentation Quality Measure: the R-value, Department of Signal Processing and Acoustics, Helsinki University of Technology, Finland, 28. [2] F. Kubala, T. Anastasakos, H. Jin, L. Nguyen, and R. Schwartz, Transcribing radio news in Proc. ICSLP, Philadelphia, PA, USA, Oct., pp. 598 61, 1996. [3] Luis Pinto," AUTOMATIC PHONETIC SEGMENTATION AND LABELLING OF SPONTANEOUS SPEECH", Zaragoza, Del 8 al 1, November, journal of Technology, Habla, 26 [4] M. Sarma and K. K. Sarma, Segmentation of Assamese phonemes using SOM", Conference Paper January, 212. [5] B. Bigi, "Automatic Speech Segmentation of French: Corpus Adaptation, "LPL - Aix-en- Provence France, 212. [6] J. Burkardt,"K-means Clustering", Advanced Research Computing, Interdisciplinary Center for Applied Mathematics, Virginia Tech, September, 29. [7]. M. B. Al- Zoubi, A. Hudaib, A. Huneiti and B. Hammo, New Efficient Strategy to Accelerate k-means clustering Algorithm, American Journal of Applied Science 5 (9): 1247-125, ISSN 1546-9239, 28. [8] R. Yadav and A. Sharma," Advanced Methods to Improve Performance of K-Means Algorithm: A Review", Global Journal of Computer Science and Technology Volume 12 Issue 9 Version 1. April, 212. [9] H.S. Behera, A. Ghosh,and S. K. Mishra," A New Improved Hybridized K-MEANS Clustering Algorithm with Improved PCA Optimized with PSO for High Dimensional Data Set", International Journal of Soft Computing and Engineering (IJSCE), Volume-2, Issue-2, May,212. [1] K. Teknomo, Numerical example of k- means clustering, CNV media, 26. [11] K. Wagstaff,C. Cardie, S. Rogers,and S. Schroedl, Constrained K-means Clustering with Background Knowledge, Proceedings of the Eighteenth International Conference on Machine Learning, p. 577-584, 21. [12] R. C. de Amorim,"Learning feature weights for K-Means clustering using the Minkowskimetric", Department of Computer Science and Information Systems Birkbeck, University of London,April,211. [13] O. Nagaraju, B.Kotaiah, R.A. Khan dna M.RamiReddy," Implementing and compiling clustering using Mac Queens alias K-means apriori algorithm", International Journal of Database Management Systems (IJDMS ) Vol.4, No.2, April 212. 27