VOICE-ACTIVATED HOME BANKING SYSTEM AND ITS FIELD TRIAL

Similar documents
Learning Methods in Multilingual Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

Speech Emotion Recognition Using Support Vector Machine

A study of speaker adaptation for DNN-based speech synthesis

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Lower and Upper Secondary

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Calibration of Confidence Measures in Speech Recognition

Using dialogue context to improve parsing performance in dialogue systems

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Human Emotion Recognition From Speech

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Lecture 1: Machine Learning Basics

Academic Choice and Information Search on the Web 2016

Modeling function word errors in DNN-HMM based LVCSR systems

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Modeling function word errors in DNN-HMM based LVCSR systems

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

user s utterance speech recognizer content word N-best candidates CMw (content (semantic attribute) accept confirm reject fill semantic slots

An Evaluation of E-Resources in Academic Libraries in Tamil Nadu

Generative models and adversarial training

On-Line Data Analytics

Speaker recognition using universal background model on YOHO database

Switchboard Language Model Improvement with Conversational Data from Gigaword

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Word Segmentation of Off-line Handwritten Documents

Parent Information Welcome to the San Diego State University Community Reading Clinic

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

"On-board training tools for long term missions" Experiment Overview. 1. Abstract:

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

Sample Goals and Benchmarks

Radius STEM Readiness TM

Probabilistic Latent Semantic Analysis

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Transfer of Training

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

SOFTWARE EVALUATION TOOL

Improvements to the Pruning Behavior of DNN Acoustic Models

MAKINO GmbH. Training centres in the following European cities:

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Jacqueline C. Kowtko, Patti J. Price Speech Research Program, SRI International, Menlo Park, CA 94025

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Rule Learning With Negation: Issues Regarding Effectiveness

Children are ready for speech technology - but is the technology ready for them?

Circuit Simulators: A Revolutionary E-Learning Platform

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Teachers: Use this checklist periodically to keep track of the progress indicators that your learners have displayed.

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Abstract. Janaka Jayalath Director / Information Systems, Tertiary and Vocational Education Commission, Sri Lanka.

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

English for Specific Purposes World ISSN Issue 34, Volume 12, 2012 TITLE:

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

GACE Computer Science Assessment Test at a Glance

Eye Movements in Speech Technologies: an overview of current research

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

TEKS Correlations Proclamation 2017

Prototype Development of Integrated Class Assistance Application Using Smart Phone

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Edinburgh Research Explorer

MMOG Subscription Business Models: Table of Contents

Lecture 1: Basic Concepts of Machine Learning

School of Innovative Technologies and Engineering

Deep Neural Network Language Models

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Investigation on Mandarin Broadcast News Speech Recognition

Characterizing and Processing Robot-Directed Speech

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Development of an IT Curriculum. Dr. Jochen Koubek Humboldt-Universität zu Berlin Technische Universität Berlin 2008

Engineers and Engineering Brand Monitor 2015

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Body-Conducted Speech Recognition and its Application to Speech Support System

Learning to Schedule Straight-Line Code

Automatic Pronunciation Checker

CATALOG WinterAddendum

ENEE 302h: Digital Electronics, Fall 2005 Prof. Bruce Jacob

Artificial Neural Networks written examination

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Applications of memory-based natural language processing

The stages of event extraction

Multi-modal Sensing and Analysis of Poster Conversations toward Smart Posterboard

CS 598 Natural Language Processing

Regan's Resume Last Edit : 31 March 2008

An Introduction to Simio for Beginners

Bayllocator: A proactive system to predict server utilization and dynamically allocate memory resources using Bayesian networks and ballooning

Virtual Seminar Courses: Issues from here to there

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Transcription:

VOICE-ACTIVATED HOME BANKING SYSTEM AND ITS FIELD TRIAL Toshihiro Isobe, Masatoshi Morishima, Fuminori Yoshitani, Nobuo Koizumi, Ken ya Murakami Laboratory for Information Technology NTT DATA COMMUNICATIONS SYSTEMS CORPORATION ABSTRACT Speech recognition techniques are most useful when used over the phone. A telephone speech recognizer was developed and many field trials were carried out[1][2][3]. We have developed telephone speech recognition hardware for a voice-activated home banking system based on a client-server network configuration[4]. The speech recognition unit is a workstation with six boards for dealing with simultaneous multi-channel processing. The speech recognition algorithm implemented in the boards, each of which has three DSPs and an MPU, handles various tasks, such as recognizing connected digits, bank name, branch name, money amount, and confirmation for completing the service dialogs. Experimental field trials on 90 subjects showed that with proper instructions and guidance, the service task was successfully achieved in 85% of trials. We sent out a questionnaire, and one third of the subjects replied that speech recognition was useful. 1. HOME BANKING SYSTEM The configuration of the client-server banking system is illustrated in Figure 1. A registered user need only call and talk to the system by telephone to transfer money to another bank account or to get balance information. The speech recognition unit, which can accept six calls at a time, is a workstation with six speech recognition boards. The workstation controlling speech dialogs, shown in Figure 2, is connected to the bank network by LAN and is operated by bank systems. Network of Banks LAN (TCP/IP) "Hello! This is the telephone service center. What kind of service do you want?<-"money transfer." "Did you say money transfer?" "Please say your account number." <-"1234567" "Did you say 1234567?" "Please say your code number." <-"****" "You are accepted." "To which bank do you want to transfer?" <-"Fuji bank" "Did you say Fuji bank?" "To which branch do you want to transfer?"<-"kyoto" "Did you say Kyoto?" "Please say the account number to which you want to transfer." <-"9876543" "Did you say 9876545?" <-"No." "Please say the account number to which you want to transfer." <-"9876543" "Did you say 9876543?" "Please say the amount of money you want to transfer." <-" 13, 000 yen." "Did you say 13,000 yen?" "I will send the information on your transfer to your FAX. Please check it. Thank you." Figure 2: Speech dialog 2. SPEECH RECOGNITION BOARD The hardware configuration of the board is shown in Figure 3. Each board is connected to one telephone circuit and has three DSPs (TMS320C31) and one MPU (MC6). The first DSP (DSP1) is used for telephone network control, touch tone detection, A/D conversion, feature extraction, and for calculating the likelihood of basic Gaussian distributions for the code books of tied-mixture HMMs. The second DSP calculates the output probabilities of HMM states. The last DSP computes Viterbi scores. The MPU extracts recognized words by tracking back Viterbi scores and controls all of the DSPs. TEL. Public switched telephone network 6 circuits 6 boards Speech recognition unit PBX TEL I/F TMS320C31 DSP1 TMS320C31 DSP2 TMS320C31 DSP3 LM 1MB LM 1MB LM 1MB MC6 MPU MM 16MB VME Figure 1: Outline of voce-activated banking system Figure 3: Speech recognition board

3. RECOGNITION ALGORITHM We use context-dependent tied-mixture phone HMMs that provide three phone states and 501 states for all phonemes. A phone HMM consists of one state for a left context-dependent HMM, another state for a context-independent HMM, and a third state for a right context-dependent HMM (see Figure 4). context independ model DSP2 computes the output probabilities of the HMM states by multiplying the likelihood scores from the DSP1 code book by mixture weight. DSP3, which has a finite state automaton network, computes the Viterbi score, using the beam-search algorithm. 4. SPEECH DATABASE FOR TRAINING SPEECH MODELS left context depend model Figure 4: Phone model right context depend model To make the speech models, we collected telephone speech data from 0 males and 0 females living in seven major cities in Japan, taking into account the balance age group and region dialect[5]. The database consists of 8,000 names of banks and credit associations, 8,000 names of branches, 4,0 phrases of four connected numbers, 4,000 phrases of seven connected numbers, 4,0 phrases that represent amounts of money, and 0 sets of six words needed in banking services. We trained the speech models with half of the database, using maximum likelihood estimation, and tested the system using the other half. 5. EXPERIMENT In DSP1, in order to reduce the amount of calculation for code book probabilities, we have two layers of code books (Figure 5). One is a normal code book (layer 2) having 1,0 Gaussian distributions that are basic distributions of tied-mixture HMMs; the other is a small code book (layer 1) that has 64 Gaussian distributions estimated from the distributions in layer 2 by the k- mean VQ method in the Baum-Weltch algorithm. When the system receives speech, DSP1 first computes the probabilities of the 64 distributions in layer 1 and selects the one with the highest score. It then calculates the probabilities of the 500 distributions in layer 2 that are the nearest to the one selected in layer 1. The scores of the distributions not calculated in layer 2 are set to the closest distributions from layer 1. 5.1. Benchmark test Using the telephone speech database we measured the performance of the speech recognition board. Recognition tasks included seven connected digits, bank name, and amount of money. The results are shown in Table 1. Task V. size perplexity Rec. rate [%] 7 connected digits 12 12.0 92.4 Bank name 0 4.94 90.8 Amount of money 46 24.3 77.4 Table 1: Result of bench mark test (V. size: Vocabulary size) Nearest M distributions Maximal one Layer 1 (mixture:64) Not-calculated distributions Calculated distributions (500) Layer 2 (mixture:1,0) Input vector Mixture:10 5.2. System field trial We tested the voice-activated home banking system using the public switched telephone network, with human-machine dialogs collected from 90 subjects. We asked the subjects to play-act making a bank transfer by telephone. Trials were done twice for each test. Between the first and second attempts, we presented to each subject an explanation of how to speak to the system and a sample dialog from a skilled user (Figure 6). at bank transfer # Explanation of how to speak to the system # Presentation of sample dialog at bank transfer Figure 5: Tree structure of mixture Figure 6: Field trial sequence for each subject

The results of recognition tasks for the dialogs are shown in Figures 7 through 9. In these figures, the x-axis represents the number of repeat utterances, and the y-axis represents the cumulative accuracy rate. People get used to the system during repeat attempts, which leads to an increase in accuracy as users repeat and to convergence when the number of repetitions is equal to three, in all tasks. The explanation and the skilled user s dialogue are efficient for adapting subjects to the system, so accuracy rates for second attempts are higher than those for first attempts in all figures. Accuracy rate [%] 1 2 3 4 5 6 7 Figure 7: Accuracy rate vs number of repeat attempts in recognizing seven connected digits Accuracy rate [%] 1 2 3 4 5 6 7 Figure 8: Accuracy rate vs number of repeat attempts in recognizing bank name Accuracy rate [%] 1 2 3 4 5 6 7 Figure 9: Accuracy rate vs number of repeat attempts in recognizing money amount Figure 10 presents the success rate of the service. In the figure, the x-axis represents the maximum number of repetitions allowed by the system. When users had three chances to repeat, about 85% of them completed the bank transfer service successfully. But three repeats is the maximum that users can endure. The accuracy rate for "yes"/"no" recognition, which is the most important in controlling speech dialogs, was 89.3% in the first trial and 92.3% in the second. Success rate of the service[%] 20 0 0 1 2 3 4 5 6 Figure 10: Success rate of money transfer service vs number of repeat attempts in each recognition task 6. QUESTIONNAIRE We investigated how users felt about the SR (Speech Recognition) system by putting questions to the subjects after the field trial. The questionnaire was multiple choice. 19% Useful 15% 29% Not useful 8% Figure 11: Is SR useful? 29% Too long 25% 53% Too short 22% Figure 12: How about the duration of SR service? Figure 11 shows the response of users as to the usefulness of the SR system. About one third of them thought it was useful, and the same number of people thought it was not useful. Apart from the recognition performance, the reason was the long service duration that results from having "yes"/"no" confirmation after every item, such as bank name, account number and so on, for security in dealing with money (Figure 12). Easy Difficult Couldn t 2% 3% 95% Easy Difficult Couldn t 8% 3% 5% 5% 9% 70% Figure 13: Is it easy to think of Figure 14: Is it easy to think of what to say when you are what to say when you are asked your account number? asked amount of money? Figures13 and 14 show that users were more at a loss for what to

say when giving responses which can be said in various ways, such as the amount of money, than they were for those responses with fewer possible variations, such as the account number. This is the reason for the low recognition rate for the money amount at less than three attempts (see Figure 9). 1~2 3~5 6~(All-1) No All 2% 3% 5% 8% 82% Figure 15: How many times did you say words not in the vocabulary? SR (If hansdet has PB) SR PB Other modality 8% 6% 2% 31% 53% Figure 16: Do you prefer SR or PB for inputting account number and money amount? Figure 15 shows the percentages of people who uttered words not in the specified vocabulary. Figure 16 shows which method users prefer, SR or PB (touch tone detection), when inputting connected digits and money amount. As a result of the popularization of PB and the secrecy of money transfer services, 51% of people chose PB. Willing (If accustomed) Willing Unwilling 21% Example dialog Explanation Both 37% speech recognition board, algorithm, field trial, and how users feel about the system. In the last few years, SR techniques have reached the level of practical usefulness, but not of spontaneous speech recognition. So it is important to give users knowledge of, and help them get accustomed to, the system. Giving examples of skilled users dialogues is highly effective for helping newlyregistered customers. Both in the experiment using a telephone speech database and in the field trial, the accuracy of recognizing the money amount was less than that for other tasks. To maintain accuracy in an actual banking service requiring a high level of security, recognition of a spoken money amount can be replaced by another convenient method such as detection of a tone. SR is available for recognizing the bank name or branch name when users don t remember the code number easily. The popularization of personal computers has enabled customers to use various modes of service like home banking. We are going to continue system trials and establish the most effective way to use speech recognition techniques with a human-machine interface. 8. REFERENCES 1. G. Ortel, "Observed long-term changes in customer calling in a telephone application using automatic speech recognition", Proc. EUROSPEECH *95, pp273-276, Sep. 1995. 2. M. Lennig and G. Bielby, "Directory assistance automation in Bell Canada: Trial results," Proc. Workshop on IVTTA, pp9-13, Sep. 1994. 3. S. Yamamoto, K. Takeda, N. Inoue, S. Kuroiwa and M. Naitoh, "A voice-activated telephone exchange system and its field trial," Proc. Workshop on IVTTA, pp21-26, Sep. 1994. 4. T. Isobe, M. Morishima, F. Yoshitani, N. Koizumi and K. Murakami, "Voice-activated home banking system," Proc. Workshop on Automatic speech recognition, pp163-164, Dec. 1995. 5. T. Isobe and K. Murakami, "Telephone speech data corpus and performances of speaker independent recognition system using the corpus," Proc. Workshop on IVTTA, pp101-104, Sep. 1994. 13% 36% 53% Figure 17: Are you willing to use SR as an actual system? Figure 18: Which is better for support explanation or example dialog? About % of subjects answered that they would be willing to use SR for actual services on the condition that they were accustomed to the system (Figure 17). Sample dialogues made by trained speakers were preferred to explanations of how to speak to the system (Figure 18). These two figures show that getting users used to the system is an important factor for putting SR to practical use. 7. CONCLUSIONS We have reported on our voice-activated home banking system,

Sound File References: [a265s1.wav]