Utterance intonation imaging using the cepstral analysis

Similar documents
AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speaker Recognition. Speaker Diarization and Identification

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Speaker Identification by Comparison of Smart Methods. Abstract

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Human Emotion Recognition From Speech

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Speech Emotion Recognition Using Support Vector Machine

Automatic segmentation of continuous speech using minimum phase group delay functions

Speaker recognition using universal background model on YOHO database

Segregation of Unvoiced Speech from Nonspeech Interference

WHEN THERE IS A mismatch between the acoustic

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

THE RECOGNITION OF SPEECH BY MACHINE

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Body-Conducted Speech Recognition and its Application to Speech Support System

Evaluation of Various Methods to Calculate the EGG Contact Quotient

Proceedings of Meetings on Acoustics

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

Automatic Pronunciation Checker

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Lecture 9: Speech Recognition

Case study Norway case 1

A study of speaker adaptation for DNN-based speech synthesis

PowerTeacher Gradebook User Guide PowerSchool Student Information System

Cal s Dinner Card Deals

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Learning Methods in Multilingual Speech Recognition

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Voice conversion through vector quantization

Getting Started with TI-Nspire High School Science

SARDNET: A Self-Organizing Feature Map for Sequences

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Automatic intonation assessment for computer aided language learning

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Modeling function word errors in DNN-HMM based LVCSR systems

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Cambridgeshire Community Services NHS Trust: delivering excellence in children and young people s health services

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Writing a composition

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

STUDENT MOODLE ORIENTATION

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Author's personal copy

Statewide Framework Document for:

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Teacher: Mlle PERCHE Maeva High School: Lycée Charles Poncet, Cluses (74) Level: Seconde i.e year old students

Using SAM Central With iread

A Privacy-Sensitive Approach to Modeling Multi-Person Conversations

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Mandarin Lexical Tone Recognition: The Gating Paradigm

Getting Started Guide

age, Speech and Hearii

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Introduction to the Practice of Statistics

Application of Virtual Instruments (VIs) for an enhanced learning environment

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Using a Native Language Reference Grammar as a Language Learning Tool

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Interpreting ACER Test Results

Tour. English Discoveries Online

Longman English Interactive

SURVIVING ON MARS WITH GEOGEBRA

Exploring Derivative Functions using HP Prime

Experience College- and Career-Ready Assessment User Guide

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

NCEO Technical Report 27

The Indices Investigations Teacher s Notes

Speech Recognition at ICSI: Broadcast News and beyond

Inquiry Space: Using Graphs as a Tool to Understand Experiments

Ansys Tutorial Random Vibration

Mathematics Success Level E

Affective Classification of Generic Audio Clips using Regression Models

Learning Methods for Fuzzy Systems

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Arabic Orthography vs. Arabic OCR

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

On-the-Fly Customization of Automated Essay Scoring

Applying Fuzzy Rule-Based System on FMEA to Assess the Risks on Project-Based Software Engineering Education

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

CEFR Overall Illustrative English Proficiency Scales

Transcription:

Annales UMCS Informatica AI 8(1) (2008) 157-163 10.2478/v10065-008-0015-3 Annales UMCS Informatica Lublin-Polonia Sectio AI http://www.annales.umcs.lublin.pl/ Utterance intonation imaging using the cepstral analysis Ireneusz Codello *, Wiesława Kuniszyk-Jóźkowiak, Tomasz Gryglewicz, Waldemar Suszyński Institute of Computer Science, Maria Curie-Sklodowska University, pl. M.Curie-Skłodowskiej 1, 20-031 Lublin, Poland Abstract Speech intonation consists mainly of fundamental frequency, i.e. the frequency of vocal cord vibrations. Finding those frequency changes can be very useful for instance, studying foreign languages where speech intonation is an inseparable part of a language (like grammar or vocabulary). In our work we present the cepstral algorithm for F0 finding as well as an application for facilitating utterance intonation learning. 1. Introduction We can divide human speech into two categories: voiced speech the air from lungs causes vocal cords vibration. The frequency of these vibrations is called fundamental frequency, vocal tone or zero formant (F0); unvoiced speech the air from lungs goes untouched throughout vocal cords. No vibrations are caused, therefore no fundamental frequency is created. As we can see in Fig. 1, the vowel a as an example of voiced speech, is very regular (due to regular vocal fold vibrations) contrary to the consonant s, which is very irregular, noisy (due to noise excitation by the air from lungs untouched by vocal folds). Fundamental frequency determines the intonation of speech. These intonation changes (increasing, decreasing) can have huge influence on the meaning of a spoken sentence for example, we can distinguish a question from an ordinary sentence. We can recognize intentions of a speaker, whether he is mad, polite or curious. In many languages (English, Japanese) intonation (like vocabulary or grammar) is an inseparable part of language. * Corresponding author: e-mail address: irek.codello@gmail.com

158 Ireneusz Codello, Wiesława Kuniszyk-Jóźkowiak Fig. 1. Oscillogram of the vowel 'a' (top) and the consonant 's' (bottom) 2. Computation procedure Human vocal tone varies between 50 Hz and 1000 Hz: 50 Hz 250 Hz ordinary speaking man, 150 Hz 350 Hz ordinary speaking women, 300 Hz 500 Hz ordinary speaking child, up to 1000 Hz opera singer (soprano). The cesptral analysis needs a few periods of vocal cord vibrations to determine it in speech. The signal of 50 Hz 500 Hz frequency has a period between 20 ms and 0.5 ms, therefore the cepstral analysis computation frame has to last from 40 ms even to 100 ms (if we expect to analyze mele voice). The basic cepstral analysis algorithm consists of the following steps: 1) windowing we divide the signal x(t) into frames (windows) of the same length. Consecutive frames can overlap each other (usually with 50% frame length). After that each frame is analyzed independently of the other ones. Then the frame is multiplied by the window function (for instance Hamming window); 2) FFT we compute frame spectrum using Fast Fourier Transform; 3) filtering we can filter the spectrum X(t) (in our work we use a low-pass filter with 5,5kHz cut-off); 4) decibels we change the amplitude scale of X(t) from linear to logarithmic. Because we use a real cepstrum (instead of a complex one) we compute a real logarithm using the equation: 10 2 2 ( ) Y( k). re= 20log X( k). re + X( k). im Ykim ( ). = 0 where X(k) k-th complex spectral line of the frame instead of the complex logarithm: (1)

Utterance intonation imaging using the cepstral analysis 159 2 2 ( ) Y( k). re= 20log 10 X( k). re + X( k). im (2) Ykim X( k). im ( ). arctg = X ( k ). re 5) ifft we compute an inverse FFT of Y(k) obtaining the frame cepstrum C(k); 6) F0 finding we find an extremum of the cepstrum within a range of 50 Hz 1000 Hz. The cepstrum horizontal axis is time t which can be easily transformed into herz f using the formula: 1 f = (3) t The final result is a graph of F0 changes, where on the X-axis we put time (consecutive frames) and on the Y-axis we put frequency (extremum frequency of each frame cepstrum). 3. An example Here we have an exemplary utterance the vowel a said by a female (her voice intonation is increasing and then decreasing in time). Fig. 2. The source signal the Polish vowel a said by a female. Her voice intonation is increasing and then decreasing in time Let us choose three arbitrary frames t 1, t 2, t 3 (vertical lines in Fig. 2) and then compute its Fourier Transform (46ms frame length, 25% window length overlap, Hamming window). We can see regular of amplitude fluctuations and a period of those fluctuations. This period is the fundamental frequency and can be obtained from the inverse of FFT of those frames. In Fig. 4 we can see those cepstrums each with the maximum amplitude and its time transformed into frequency (equation (3)). We can also see that the beginnings and ends of those graphs are equal to zero. These sections were set to zero due to the fact that in the (0.1)ms range corresponding to (+,1000)Hz and in the (20.23)ms range corresponding to (50.43)Hz there is no base tone.

160 Ireneusz Codello, Wiesława Kuniszyk-Jóźkowiak Fig.3. Spectra of the frames t1, t2, t3 of the source signal Fig. 4. Cepstrums of the frames t1, t2, t3 of the source signal. The X-axis is the time from the range (0.23)ms. (23.46)ms range is a mirror reflection due to the Fourier property and thus it is not depicted in the graph

Utterance intonation imaging using the cepstral analysis 161 By combining all extrema (one from every cepstrum) we obtain our result F0 changes the graph. Fig. 5. F0 changes in time of the source signal As we can see in Fig. 5, the result is not clear. Firstly, we see the F0 before t 1 frame, where there is no signal so the silence detection must be made. In our work we simply compute envelope of the signal and assume that the silence envelope is less than some value (threshold) which is the input parameter of an algorithm. Secondly, not all cepstrums have their extrema corresponding to the base tone that is why there is some discontinuity after t 3 frame. Therefore we need to use some sort of filtering to smooth the result for instance we can use a low-pass filter. There is the third problem. The input signal can contain not only silence parts and voiced speech but also unvoiced speech as well as noisy speech. Unvoiced speech has no base tone, therefore it has to be treated as a silence the cepstrum maximum should not be taken into account. Noisy speech is problematic too it has additional frequencies which can be taken as base tone. Besides envelope, there are a few more factors that can be useful in F0 filtering, like: signal oscillation number per frame we can roughly distinguish voiced and unvoiced speech, SNR of a cepstrum we can estimate the quality (significance) of the cepstrum maximum (whether it is above other peaks the cepstrum or not), a number of high local extrema in the cepstrum we can count a number of significant extrema in the cepstrum (when there are many extrema there is greater probability that the highest one is not a base tone), but research on their usability is still in progress.

162 Ireneusz Codello, Wiesława Kuniszyk-Jóźkowiak Fig. 6. Potentially useful coefficients for F0 tracking 4. Application We developed a simple tool for speech intonation learning. Fig. 7. The application screenshot. The parameters of an algorithm: Hamming window, frame length 46 ms, overlap 100%. The input signal: aaaaa bbbbb ccccc said by two men

Utterance intonation imaging using the cepstral analysis 163 It is divided into 3 sections: teacher A, student B and algorithm C. A user can open teacher'y speech file in section A and his own speech file in section B. Then he can change algorithm parameters in section C like a window width (in samples or in milliseconds), frame overlap, window function and volume of speech playing. While computing the cepstrum (button in section C) a user can compare teacher s intonation A3 with his own B3. Moreover, he can change envelope threshold of both files independently by A4 and B4 scrolls making the graph clearer (as discussed in section 3 in this article). Of course, one can record the speech samples by B5 buttons which later can take part in intonation comparison. From the example in Fig. 7 we can see that utterance intonations aaaaa bbbbb ccccc of the teacher and the student roughly match each other. It is the sign for the student that he said it correctly and could pass to the next sample. Conclusions In our opinion the software is very helpful in intonation learning. Intonation comparison is easier and more objective if it is based on intonation graph rather than on hearing. As a consequence, one can learn alone (without a teacher) more often undoubtedly, it is a big advantage. References [1] Rabiner L.R., Schafer R.W. Digital Processing of Speech Signals. New Jersey, Prentice-Hall, Inc., (1978). [2] Gold B., Morgan N., Speech and Audio Signal Processing. John Wiley & Sons Inc., New York, (2000). [3] Basztura Cz., Źródła, sygnały i obrazy akustyczne. Wydaw. Komunikacji i Łączności, Warszawa, (1988), in Polish. [4] Tadeusiewicz R., Sygnał mowy. Wydaw. Komunikacji i Łączności, Warszawa, (1988), in Polish. [5] Pawłowski Z., Foniatryczna diagnostyka. Oficyna Wydawnicza Impuls, Kraków, (2005), in Polish.