Human-Machine Dialogue. Takashi YOSHIMURA, Satoru HAYAMIZU, Hiroshi OHMURA and Kazuyo TANAKA Umezono, Tsukuba, Ibaraki 305, JAPAN

Similar documents
Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Speech Emotion Recognition Using Support Vector Machine

Human Emotion Recognition From Speech

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Voice conversion through vector quantization

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Learning Methods in Multilingual Speech Recognition

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Recognition at ICSI: Broadcast News and beyond

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Speaker recognition using universal background model on YOHO database

user s utterance speech recognizer content word N-best candidates CMw (content (semantic attribute) accept confirm reject fill semantic slots

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Eyebrows in French talk-in-interaction

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Modeling function word errors in DNN-HMM based LVCSR systems

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Modeling function word errors in DNN-HMM based LVCSR systems

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Dynamic Pictures and Interactive. Björn Wittenmark, Helena Haglund, and Mikael Johansson. Department of Automatic Control

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

A study of speaker adaptation for DNN-based speech synthesis

WHEN THERE IS A mismatch between the acoustic

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Word Segmentation of Off-line Handwritten Documents

On the Formation of Phoneme Categories in DNN Acoustic Models

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

/$ IEEE

Ohio s New Learning Standards: K-12 World Languages

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Word Stress and Intonation: Introduction

CEFR Overall Illustrative English Proficiency Scales

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

GOLD Objectives for Development & Learning: Birth Through Third Grade

Calibration of Confidence Measures in Speech Recognition

Teaching ideas. AS and A-level English Language Spark their imaginations this year

Journal of Phonetics

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Investigation on Mandarin Broadcast News Speech Recognition

Mandarin Lexical Tone Recognition: The Gating Paradigm

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

IEEE Proof Print Version

Switchboard Language Model Improvement with Conversational Data from Gigaword

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Multi-modal Sensing and Analysis of Poster Conversations toward Smart Posterboard

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Affective Classification of Generic Audio Clips using Regression Models

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

L1 Influence on L2 Intonation in Russian Speakers of English

Cambridgeshire Community Services NHS Trust: delivering excellence in children and young people s health services

Language Acquisition Chart

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

Learning Methods for Fuzzy Systems

The Verbmobil Semantic Database. Humboldt{Univ. zu Berlin. Computerlinguistik. Abstract

INPE São José dos Campos

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Quantitative Evaluation of an Intuitive Teaching Method for Industrial Robot Using a Force / Moment Direction Sensor

Probability and Statistics Curriculum Pacing Guide

MULTIMEDIA Motion Graphics for Multimedia

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

REVIEW OF CONNECTED SPEECH

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Teachers: Use this checklist periodically to keep track of the progress indicators that your learners have displayed.

Psychology and Language

Prototype Development of Integrated Class Assistance Application Using Smart Phone

Dialog Act Classification Using N-Gram Algorithms

Discourse Structure in Spoken Language: Studies on Speech Corpora

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

2 Mitsuru Ishizuka x1 Keywords Automatic Indexing, PAI, Asserted Keyword, Spreading Activation, Priming Eect Introduction With the increasing number o

English Language and Applied Linguistics. Module Descriptions 2017/18

Letter-based speech synthesis

Designing a Speech Corpus for Instance-based Spoken Language Generation

Automatic Pronunciation Checker

Problems of the Arabic OCR: New Attitudes

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Clouds = Heavy Sidewalk = Wet. davinci V2.1 alpha3

Course Law Enforcement II. Unit I Careers in Law Enforcement

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

A Reinforcement Learning Variant for Control Scheduling

Jacqueline C. Kowtko, Patti J. Price Speech Research Program, SRI International, Menlo Park, CA 94025

Using dialogue context to improve parsing performance in dialogue systems

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

Citation for published version (APA): Veenstra, M. J. A. (1998). Formalizing the minimalist program Groningen: s.n.

Transcription:

Pitch Pattern Clustering of User Utterances in Human-Machine Dialogue Takashi YOSHIMURA, Satoru HAYAMIZU, Hiroshi OHMURA and Kazuyo TANAKA Electrotechnical Laboratory 1-1-4 Umezono, Tsukuba, Ibaraki 305, JAPAN E-mail: yoshimur@etl.go.jp ABSTRACT This paper argues about pitch pattern variations of user utterances in human-machine dialogue. For intelligent humanmachine communication, it is essential that machines understand prosodic characteristics which imply a user's various attitude, emotion and intention beyond vocabulary. Our original focus is on particularly distinct pitch patterns and their roles in the actual dialogues. We used human-machine dialogues collected by a Wizard of OZ simulation, Many utterance segments belonged to clusters that were prosodically at patterns. From the result, we considered that utterances which belonged to the other clusters and those which were far from the centroids included non-verbal information. In these utterances, there were talks to themselves and questions to the machine including emotional expressions of a puzzle or a surprise. These pitch patterns were not only rich in ups and downs, but also their slopes were upward, while the pitch pattern were generally even or a little downward. These results indicate that peculiar pitch period patterns show non-verbal expressions. In order to actually utilize such information on human-machine interactions, the representative pitch patterns should be investigated concerning their relationship to various types of communication. 1. INTRODUCTION In human-human communication by speech, we emphasize a essential point in dialogue, turn a topic and express various attitudes, emotions and intentions, which are not able to be transmitted with vocabulary, by changes of prosodic characteristics. As well as for intelligent human-machine communication, it is essential that machines understand some prosodic characteristics and also generate some prosodic characteristics which satisfy a demand in each scene[1]. Past works for prosodic analysis, however, mainly dealt with human-human imitated dialogues or read sentences, that is not human-machine communication data. They detected phrase boundaries[2] and estimated the sentence structure[3]. We analyze actual human-machine communication data, and examine prosodic features. Prosodic features has several aspects[4]: accent features[5], intonation features and so on. In this paper, we focus on pitch pattern variations of the user's utterances in human-machine dialogues. We suppose that when users want to convey information beyond expression, they use specic pitch patterns which are dierent from common pitch patterns. So we classify pitch patterns of user utterances and argue the relation with verbal information about particularly distinct pitch patterns from others. First, in the next section, we describe dialogue data we used. In the third section, we discuss in detail our method used for pitch pattern clustering of user utterances. We show the experiments of pitch pattern clustering in the fourth section. The relationship of expressions about some peculiar utterances which have dierent pitch patterns are also discussed. 2. HUMAN-MACHINE DIALOGUE DATA 2.1. Data collection In this research, we used human-machine dialogues[6]. The data were collected by a Wizard of OZ simulation. The task was related to town information of Shibuya in Tokyo. In this dialogue data, user utterances and synthesized machine utterances were sepalatedly recorded to dierent recording channels at the same time. Only user utterances were analyzed in this work. 2.2. Speech segments This sample set contained 7,495 utterances units pronounced by 40 speakers (males and females), both which were detected as a speech segment and where pitch pattern were extracted. These units were detected using power and zerocrossing values of speech wave forms. There were considered to segment boundaries when silences continued more than 300 msec. We supposed that one speech segment corresponds to one utterance unit. Table 1 shows the length distribution of utterance units. The result indicates that most utterances were less than 2 sec.

length number less than 1 sec. 4154 1 sec. to 2 sec. 2328 2 sec. to 3 sec. 768 3 sec. to 4 sec. 179 4 sec. to 5 sec. 39 more than 5 sec. 27 Total 7495 Table 1: The length distribution of utterance units 3. PITCH PATTERN CLUSTERING As automatic pitch extraction, we used a fundamental wave ltering method which is especially eective to ne pitch pattern extraction[7]. In former methods, avarage pitch were mainly obtained in every moving observation window whose width and period were previously dened. In this method, we obtain a pitch period that is equivalent to the length of a fundamental wave. Representative pitch patterns of utterance units were calculated as follows: pitch pattern cluster number No. 0 1274 No. 1 578 No. 2 2899 No. 3 547 No. 4 1034 No. 5 298 No. 6 630 No. 7 235 Total 7495 Table 2: Number of utterance units in each pitch pattern cluster (VQ size: 8) puzzle or a surprise. These pitch patterns were rich in ups and downs. (1) Log pitch periods were warped to 32 sample points for each unit. If the dierence between a sample pitch period and the interpolated value of adjoining pitch periods was over the threshold, the interpolated period was substituted for the sample period. (2) The rst regression coecients and normalized residuals of these smoothed pitch period patterns were calculated. The normalized residual pitch patterns were used for making representative pitch patterns. The slope data were for observation of prosodic characteristics. (3) The normalized residual pitch patterns were regarded as 32 dimensional vectors, and then quantized by vector quantization. Each centroid was the representative pitch pattern. 4. CLUSTERING EXPERIMENTS Using human-machine dialogue data, as described in Section 2.1 and 3, representative pitch patterns were automatically generated from several VQ codebooks, whose sizes were 4, 8 and 16. Fig. 1 and Fig. 2 show representative pitch patterns whose codebook size is 8. Table 2 also shows number of utterance units in each pitch pattern cluster whose codebook size is 8. The results indicate that many pitch patterns belong to cluster No.0 and No.2, so many utterrance segments are prosodically at patterns. From the results, we considered that utterances which belonged to other clusters and those which were far from the centroids included non-verbal information. We conrmed their transcriptions and checked by listening. In these utterances, there were talks to themselves and questions to the machine including emotional expressions of a Figure 1: Representative pitch patterns (VQ size: 8, No.0 to No.2) Table 3 and Table 4 show example transcriptions including utterances of peculiar pitch patterns. Fig 3 and Fig 4 also show the pitch patterns of example 1. and 2. These two utterances belong to cluster No.6 and each pitch pattern rises and falls two times. In the underlined utterance of example 1., we conrmed that changes of the pitch pattern just after laughing are much bigger than other common declarative utterances. In the underlined utterance of example 2., we checked that vowels continue longer at the part of "...shi, de..." by listening. The gure also tells us the situ-

User: [Nto] hachikoomaeno koosatenkara, dooyatteikuka wakannai ([well] I don't know the way from the crossroad in front of the gure of Hachi) System: hachikoomaeno koosatenyori migikata, eeto (to the right of crossroads in front of the gure of Hachi, well) senroto heikooninobirumichio susumuto (walk down the road along a railway and) shingoonoarukoosatenga arimasu (you will nd crossroads with a signal) User: senrotoheekoo(nomi,)no michi^. (senrotoheekoo) ((the ro,) the road along a railway? (along a railway)) System: motto kantanna shitsumonni shitekudasai (please phrase that more simply) User: (warai) yokuyuuna. ((laugh) you say such a thing.) Figure 2: Representative pitch patterns (VQ size: 8, No.3 to No.7) ation that the user is considering and talking at the part of "...shi, de...". Table 5 shows the avarage slope distribution of pitch patterns by the linear regression analysis. The results indicate that the slope value of most utterances are from -0.02 to 0.01. So common utterance units have at or slightly downward slopes. In few utterances whose slopes were sharp increases, there were utterances including emotinal expressions which were the same as former examples. They were dierent from the general mode of expression. The result indicates that the slope information is possible to be elements for non-verbal information extraction. In few segments whose pitch patterns were far from any representative pitch patterns or were upward about the slopes, however, synthesized machine utterances were accidentally recorded together. In the future, the segments should be excluded from the objects of analysis and the user utterance set should be reclassied. 5. CONCLUDING REMARKS For investigation of prosodic characteristics which are essential to human-machine communication, we analyzed humanmachine dialogue data. From the result of pitch pattern clustering, most pitch patterns of user utterances are prosodically at and do not include information beyond expression. So we examined some peculiar pitch period patterns in the data, which were far from the centroids of clusters. These Table 3: Example utterances 1.; an underlined utterance has a peculiar pitch pattern results indicate that the pitch patterns are full of ups and downs and show some non-verbal expressions. In the view point of avarage slope by the linear regression analysis, the pitch patterns are also dierent from common pitch patterns which are generally at or slightly downward. In order to actually utilize such information on humanmachine interactions, the representative pitch patterns should be investigated about relations to various types of communication. We also try to reexamine peculiar pitch patterns about the role in the actual dialogues and to extract para-linguistic information through the principal element analysis of pitch patterns. 6. ACKNOWLEDGEMENTS The authors are indebted to Dr.N.Otsu, Director of the Machine Understanding Division in ETL, for his continued valuable support, and to the members of the Speech Processing Section for their useful discussions and technical assistance. 7. REFERENCES 1. S.Hayamizu: \Lively Communications with Spoken Dialogue Systems Utilizing Acoustic-Prosodic Information" LIMSI Technical Report, No.94-26 (1994-12) 2. M.Nakai and H.Shimodaira: \Accent Phrase Segmentation by Finding N-Best Sequences of Pitch Pattern Templates" Proc. ICSLP-94, S08-10, pp.347-350 (1994-9)

User: wakarimashitaka^ (do you make a sense?) System: daitai, wakarimashita (generally, I can understand) User: korede, iidesuka^ (now, are you alright?) System: tochuuninanika, mejirushiwa arimasuka (on the way, are there any landmarks?) User: mejirushi, desuka^. (you mean landmarks?) User: omoiukabimasen (I can't come to mind) Table 4: Example utterances 2.; an underlined utterance has a peculiar pitch pattern Figure 3: An example of peculiar pitch pattern; utterance unit 1. (User: (warai) yokuyuuna.) 3. A.Imiya, T.Sekiya and A.Ichikawa: \Estimation of Sentence Structure by Intonation" JSAI Technical Report, SIG-SLUD-9501-2 (1995-6) in Japanese 4. C.W.Wightman and M.Ostendorf: \Automatic Labeling of Prosodic Patterns" IEEE Trans. Speech and Audio Processing, Vol.2, No.4, pp.469-481 (1994-10) 5. T.Yoshimura, S.Hayamizu and K.Tanaka: \Word Accent Pattern Modelling by Concatenation of Mora Hidden Markov Models" Proc. IEEE ICASSP-94, 11.10, pp.(i)69- (I)72 (1994-4) 6. K.Itou, T.Akiba, O.Hasegawa, S.Hayamizu and K.Tanaka: \Collecting and Analyzing Nonverbal Elements for Maintenance of Dialog Using a Wizard of OZ Simulation" Proc. ICSLP-94, S17-10, pp.907-910 (1994-9) 7. H.Ohmura: \Fine Pitch Contour Extraction by Voice Fundamental Wave Filtering Method" Proc. IEEE ICASSP-94, 76.6, pp.(ii)189-(ii)192 (1994-4) slope (log pitch/point) number less than -0.02 109-0.02 to -0.01 894-0.01 to 0.00 3607 0.00 to 0.01 2601 0.01 to 0.02 201 more than 0.02 83 Total 7495 Table 5: The avagage slope distribution of pitch patterns of utterance units Figure 4: An example of peculiar pitch pattern; utterance unit 2. (User: mejirushi, desuka^.)

Sound File References: [a415_1.wav] [a415_2.wav]