Transactions on Information and Communications Technologies vol WIT Press, ISSN

Similar documents
Speech Emotion Recognition Using Support Vector Machine

Learning Methods for Fuzzy Systems

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Rule Learning With Negation: Issues Regarding Effectiveness

On-Line Data Analytics

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Linking Task: Identifying authors and book titles in verbose queries

A Case Study: News Classification Based on Term Frequency

Mining Association Rules in Student s Assessment Data

Word Segmentation of Off-line Handwritten Documents

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Mandarin Lexical Tone Recognition: The Gating Paradigm

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Human Emotion Recognition From Speech

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Evolutive Neural Net Fuzzy Filtering: Basic Description

Rule Learning with Negation: Issues Regarding Effectiveness

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Computerized Adaptive Psychological Testing A Personalisation Perspective

Modeling function word errors in DNN-HMM based LVCSR systems

A Comparison of Standard and Interval Association Rules

SIE: Speech Enabled Interface for E-Learning

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Assignment 1: Predicting Amazon Review Ratings

GACE Computer Science Assessment Test at a Glance

Mining Student Evolution Using Associative Classification and Clustering

Australian Journal of Basic and Applied Sciences

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Major Milestones, Team Activities, and Individual Deliverables

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

Python Machine Learning

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Disambiguation of Thai Personal Name from Online News Articles

Data Fusion Models in WSNs: Comparison and Analysis

Operational Knowledge Management: a way to manage competence

Wenguang Sun CAREER Award. National Science Foundation

Test Effort Estimation Using Neural Network

A study of speaker adaptation for DNN-based speech synthesis

Learning Methods in Multilingual Speech Recognition

Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews

Applications of data mining algorithms to analysis of medical data

Automating the E-learning Personalization

Identification of Opinion Leaders Using Text Mining Technique in Virtual Community

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Teaching Algorithm Development Skills

Software Maintenance

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

An OO Framework for building Intelligence and Learning properties in Software Agents

An Online Handwriting Recognition System For Turkish

A Hybrid Text-To-Speech system for Afrikaans

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Identifying Novice Difficulties in Object Oriented Design

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Ph.D in Advance Machine Learning (computer science) PhD submitted, degree to be awarded on convocation, sept B.Tech in Computer science and

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

The Good Judgment Project: A large scale test of different methods of combining expert predictions

SARDNET: A Self-Organizing Feature Map for Sequences

On the Combined Behavior of Autonomous Resource Management Agents

Individual Differences & Item Effects: How to test them, & how to test them well

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Reducing Features to Improve Bug Prediction

Lecture 1: Basic Concepts of Machine Learning

Evolution of Symbolisation in Chimpanzees and Neural Nets

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Speaker Identification by Comparison of Smart Methods. Abstract

Khairul Hisyam Kamarudin, PhD 22 Feb 2017 / UTM Kuala Lumpur

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Lecture 10: Reinforcement Learning

Truth Inference in Crowdsourcing: Is the Problem Solved?

Xinyu Tang. Education. Research Interests. Honors and Awards. Professional Experience

Lecture 1: Machine Learning Basics

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique

CREATING SHARABLE LEARNING OBJECTS FROM EXISTING DIGITAL COURSE CONTENT

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Running Head: STUDENT CENTRIC INTEGRATED TECHNOLOGY

Universiteit Leiden ICT in Business

Eye Movements in Speech Technologies: an overview of current research

Patterns for Adaptive Web-based Educational Systems

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Team Formation for Generalized Tasks in Expertise Social Networks

Modeling user preferences and norms in context-aware systems

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Simulation of Multi-stage Flash (MSF) Desalination Process

Beyond the Blend: Optimizing the Use of your Learning Technologies. Bryan Chapman, Chapman Alliance

Is operations research really research?

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Transcription:

Using Data Mining to Learn the Patterns of Pitch Variation in Chinese Speech Tingshao Zhu&Wen Gao Institute of Computing Technology, Academia Sinica Beijing, 18. RR. China Abstract Pitch model is very important in speech synthesis, and it mainly describes the variation of pitch. In order to synthesize speech with high intelligibility and naturalness, a system should be with an appropriate pitch model. We try to find the pitch model from actual speech samples by data mining. A prototype system called SpeechDM has been implemented to extract patterns from two-word phrase of Chinese. Dataset is used for data management, and multi-thread training tasks are controlled by training manager. This paper gives the processfirstly,then each step is introduced in detail. Some results and conclusions are given at last. 1 Introduction In the field of speech signal process, pitch is one of the most important parameters. Tone is decided by pitch variation, and it denotes different meaning in some languages especially in Chinese. When two or more Chinese syllables combine together, each syllable's tone will be changed. In speech synthesis, it is very useful to know the pattern of pitch variation to improve the intelligibility of synthesis speech^. Although there are many pitch models, they cannot describe the variation completely, because they are made by hand through some statistics on some speech samples. These speech samples cannot cover all the speech phenomena. On the other hand, it is difficult to deal with massive speech data. Since the pitch variation is very important for speech synthesis, the reasonable patterns of pitch variation can improve the quality of synthesis speech. It is naturally to think that maybe the patterns which are extracted

from actual speech samples can be helpful for speech synthesis. Because the size of speech database is always very large, it is impossible for people to extract these patterns by hand. Knowledge Discovery in Database(KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understanding patterns in data^l It employs statistical and computational technology to extract useful patterns from large database^. KDD can also be called data mining, and the term data mining will be used in this paper. In order to extract pitch variation patterns, a data mining system called SpeechDM has been implemented to extract these patterns. In the prototype, dataset is used to manage training and testing examples efficiently, and for scheduling the learning tasks, training manager is designed to control all of them. 2 SpeechDM Process The prototype is used to deal with speech data and find the patterns of pitch variation. It consists of data preprocessing, data management, data mining and training management, and it provides tools to carry out these operations. Figure 1 depicts the process of SpeechDM. SpeechDM separates end user from data miner by using of new data management method. For end users, they can extract and process data according to knowledge discovery tasks, and pay no attention to the learning algorithm. To data miner, they can concentrate on the development of new algorithm or the improvement of existed ones. The algorithm accesses training examples through the uniform interfaces which are predefined by the system. End users fully understand the requirements of task, and they can give the data scope exactly. For the data miner, they can test their algorithms to see which one is the best and thus can extract more efficient knowledge. There are some steps in the whole process. Task analysis is done by data miner and domain experts to specify the data and learning target relate to the task. Algorithm selection aims at the optimize selection of algorithm according to the task. Data extraction accesses database for preparing the learning data. Data preprocessing manipulate the extracted data to fit for the requirement of learning. Dataset management administrates the dataset used by learning algorithms. Data mining extracts knowledge from data. Result analysis evaluate the result to see whether the task has been achieved. Outputting just displays the knowledge conveniently. 316

b_ n 11 m i ^** h H,*f Extraction Prcprocess DataSct Result Analyiif DataBase Figure 1: SpeechDM Process. n 3 System Description 3.1 Data Preprocessing The Speech Database that we are using is a Chinese speech synthesis database called CoSS-1. CoSS-1 includes the pronunciation of all isolate syllables, the 2-4 word phrases and some sentences. The number of isolate syllables with tone is 1268, and that of two-word phrase is 64. CoSS-1 records the speech wave and laryngograph synchronously. The sampling rate is 16/s, and each sample is stored in two bytes. The data now being used are made by a young man. The two-word phrase covers almost the whole tone collocations in Chinese pronunciation. To learn the patterns, the pitch should be calculated atfirst.there are many methods to extract pitch from speech wave, but the precision is very low^l Since we want to learn the patterns and use them to generate pitch after training, the accuracy is very important. A tool called Pitcher is implemented to extract pitch from laryngograph. It works by annotating each cycle's beginning and ending point, then calculating the pitch, let X* be the beginning point of one cycle and Xj be the ending point, then the pitch of this cycle should be 16/(Xj - XJ. Pitcher can also be used to split phrases and play the speech data. In SpeechDM, Neural Network is used to learn the patterns. It is well known that the number of neural net's input units is fixed, but the length of pitch which acts as the training examples differs from each other significantly. A new algorithm is designed to wrap the pitches to make them suitable for the training. 317

For the speech data we used, the pitches' value domain is between 5-26. The following equation is used to normalize the pitch value. Normalized_Pitch = (Pitch_value - min) / (max - min) Where max is the maximum of all pitchs' value and min is the minimum. Pitch_value stores the pitch to be calculated and Normalized Pitch is the normalized value. 3.2 Dataset Management In SpeechDM we propose dataset to separate the data from learning algorithms, that is to say, make the data stand-alone. One dataset isn't used by only one special algorithm, and an algorithm can use many datasets according to data interface. Dataset describes the data that relates to the learning task. It just gives the definition of the data referenced by knowledge extracting task, and has no relation with algorithm, and it also specifies the methods for the creation of training and testing examples. When one algorithm is used to extract knowledge, the training and testing data should be created firstly according to the methods in the description of the dataset. Since it only gives the scope of related data, there may be some redundant data, so the algorithm should make some decision for optimization. To implement dataset, a structure is defined to describe the properties and the method to create the training and testing data. In SpeechDM, two data format have been defined, and each algorithm can access the training examples according to one of them. The dataset structure is following. Struct DataSet{ Char *name; Char ^comments; Char *base; Int CreatNethod; // Dataset name // the comments of dataset // the used dataset base // other properties of dataset // method create training and testing data A dataset manager is implemented for creating, deleting and example creation. To manage efficiently, we use dataset base to define the original scope, and it stores the data in one or more tables which can be 318

used by some operations. When the training and testing data are to be created, some operation can be executed to process the data from the base, such as filter and maintenance. The training and testing data then can be created from the processed dataset base. 3.3 Data Mining [7,8,9] There are many kinds of neural networks which can be used for learning. We intend to learn the patterns from the mapping between the pitches in isolate syllable and those in phrase, since backpropagation network has implicit input layer and output layer, and it can also give very good result^, thus it is chosen to be trained in SpeechDM. To learning the pitch variation patterns from the mapping, the pitches of the syllable in phrase are extracted firstly, and the pitches of the same syllable but in isolation are extracted too. The pitches from phrase are acted as output and those from isolation as input. When syllables are combined, their pitches are modified according to some tone variation rules, so their tones can be regarded as the factors that influence the pitches of the two syllables. The tones of the two syllables are also included in input layer. Two networks are built to extract the patterns from all the twoword phrases in CoSS-1. One is designed to learn the patterns from all the first syllables in two-word phrases in CoSS-1, and the other is for the second ones. For the network learning the first syllables, its input layer consists of 67 units, and the hidden layer consists of 1 units. There are 36 units in output layer. The other network's input layer consists of 67 units, and the hidden layer consists of 1 units. For the output layer, it includes 42 units. In order to generate training and testing data, you should split the phrase firstly, calculate the pitches, wrap the pitches to the same length and normalize pitches' value. Then the data can be used to train and test neural network. Most of them act as training examples and others as testing. 3.4 Training Manager In our system, each algorithm is defined as a class including some properties and functions. Training task is defined as a learning thread class, and it is composed by the specification of algorithm and some interfaces for the system to schedule. In the thread class, the interface to all learning algorithm is also defined. The properties and function of the 319

thread class is following: class TtrainAlgorithm { char algoname[2]; // algorithm name char comments [2]; // comments of the task int threadid; // thread ID int CurrentTrainTime; // display the progress of training char datasetname[2]; // dataset name union Algorithm { TBPTrainPitch * pitchtrainbp; TBPTrainDuration * durationtrainbp; }TrainAlgoList; // all usable learning algorithm void Information)); // parameter setting of each algorithm void InitializeO; // initialize funtion of each algorithm void ExecuteQ; // the mainfunction of thread Based on the thread class, we design a training manager to control all learning tasks. The training manager acts as master thread to control the other learning algorithm. It can create the instance of training thread, and schedule the training task, such as pause, resume and stop. The progress of training task can also be displayed in the manager. A new task should be created by the following steps: creating a new training thread object, this can be done by instantiating the training thread; choosing the suitable training algorithm, and set some parameters of the algorithm, such as thread ID, the algorithm name and dataset name; calling the initialize function of the selected algorithm for initialization; After the parameters have been set, the training manager will put the new thread into its thread array and display the progress of training when the task has been selected by user. 3.5 Postprocessing A tool is designed to show the test examples graphically. From the graph, it is easy to see whether the calculated pitch coincides with the actual pitch or not, and it can display as much as ten syllables ' pitch at the same time. Figure 2 depicts one of the testing results. For iterative learning, the dataset and train manager can be used to 32

Y* < * ^'/ \- ^_^ «f % '< X X ; f X ^ s " - ]H k^, ^<^"jll i ^V«. ^»*t* *** ij_; \ '^N«*«^8^\. jf*j i*i ), 4*; ' ' /» : < \ W^tf fff -ry MS > ^ jw j- v*"., j < i Isolate In phrase Testing result Figure 2: Result of gu3ban3. refine the data and patterns learning. In our experiment, We found that when we take all thefirstsyllables and the second ones for training a net, the result is poor. So two new datasets are created separately, and after training the new result is better than the previous. In refining, only new datasets are defined, the training program can be used without modification. 4 Result Table 1 gives some statistics of the results from one test. The calculations are based on the deduction between the original syllable's pitches in phrase and those calculated by the network, and of course they have been wrapped to the same length. F I R S T S E C o N D Phrase Chuanglshan4 Jin4zhan4 Liu2xie4 Gonglhui4 Wang2pai2 Zhi4xun2 Bing3yao4 Bao3xian3 Lalsuo3 Xiao4you3 Max 4 13 18 15 9 6 14 7 7 19 Min Mean -1.47726 -.24121-4.525616 2.656483 2.16155 -.193-3.638539 -.527984.168661 4.453946 Table 1: The results of one test. Variance 4.37695 42.273769 15.284344 1.66288 7.791716 2.958961 23.4646 1.8218 7.9292 24.817568 321

5 Conclusion SpeechDM is the data mining system which is designed to learn the patterns of pitch variation in Chinese two-word phrase. It provides tools which can be used in the stage of preprocessing, data management, data mining and training management. The dataset and multi-thread training technologies have made the system easy to extract knowledge and extend. We hope that by learning from actual speech examples, it is possible to improve the intelligibility and naturalness of Chinese synthesis speech. REFERENCES 1. Lin Tao, Wang Lijia, Acoustics course of study, Peking University Press, 1994. 2. Chu Min, Research on Chinese TTS system with high intelligibility and naturalness, Ph.D thesis, Institute of Acoustics, Academia Sinica, September, 1995. 3. Usama M.Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, Ramasamy Uthurusamy, EDITORS, Adavance In Knowledge Dicovery And Data Mining, AAAI/MIT Press, 1996. 4. George H.John, Enhancements to the Data Mining Process, Ph.D thesis of Stanford University, 1997. 5. Yang Xingjun, Chi Huisheng, Speech Signal Digital Process, Publishing House of Electronic Industry, 199. 6. Wang Wei, Principle of Artificial Neural Network - rudiment and implement, Beijing University of Aeronautics and Astronautics Press, 1995. 7. Kero, B, L. Russell, S. Tsur, W.M. Shen, An Overview of Data Mining Technologies, The KDD Workshop in the 4th International Conference on Deductive and Object-Oriented Databases, Singapore, 1995. 8. Famili, The Role of Data Pre-processing in Intelligent Data Analysis, Proceedings of the IDA-95 Symposium, Baden-Baden, Germany. P. 54-58, 1995. 9. J. Han, Y. Fu, Y. Huang, Y. Cai, N. Cercone, DBLearn: A system prototype for knowledge discovery in relational databases, Proc. 1994 ACM-SIGMOD Int'l Conf. on Management of Data (SIGMOD'94), Minneapolis, MN, May, 1994. 322