Reinforcement Learning-based Spoken Dialog Strategy Design for In-Vehicle Speaking Assistant

Reinforcement Learning-based Spoken Dialog Design for In-Vehicle Speaking Assistant Chin-Han Tsai 1, Yih-Ru Wang 1, Yuan-Fu Liao 2 1 Department of Communication Engineering, National Chiao Tung University, e-mail : yrwang@cc.nctu.edu.tw 2 Department of Electrical Engineering, National Taipei University of Technology, e-mail : yfliao@ntut.edu.tw Abstract In this paper, the simulated annealing Q-learning (SA-Q) algorithm is adopted to automatically learn the optimal dialogue strategy of a spoken dialogue system. Several simulations and experiments considering different user behaviors and speech recognizer performance are conducted to verify the effectiveness of the SA-Q learning approach. Moreover, the automatically learned strategy is applied to an in-vehicle speaking assistant prototype system with real user response inputs to enable a driver to easily control various in-car equipments including a GPS-based navigation system. Key Word: reinforcement learning, Q-learning, dialogue strategy, spoken dialog system 1. Introduction Speech, especially spoken language, is the most natural, powerful and universal human-machine/computer interface. For example, an in-vehicle spoken dialogue system (SDS) could assist a driver to safely navigate the car or access real-time traffic information (see Fig. 1). For elderly homecare, it is crucial to provide elderly people a friendly SDS to request various services or to operate complex assistant equipments. Usr: I want to know the nearest gas station. Sys: The nearest gas station is at the corner of the King plaza and about 1KM away. Usr: Please activate the navigation system Sys: GPS is on, please turn left in the next street. resulting in a set of slots to be filled, which are usually used to make a database query or update. For example, in the SDS-based railway ticket reservation system, the system might have the following slots that need to be filled to successfully reserve a ticket: (i) the departure city, (ii) the arrival city, (iii) date on which to travel and (iv) time at which to travel. However, how to automatically design an efficient mixed-initiative dialog strategy to assist the user to quickly fill in these slots is never a trivial problem. In fact, the dialog flow of the SDSs could be mapped onto a Markov decision process (MDP) [2] with an appropriate set of states and actions. And the automatic learning of the dialog strategy could be further described as an intelligent control problem that involves an agent learning to improve the performance of SDSs by interaction with different underlying speakers/environment. Therefore, in this paper, a reinforcement-based simulated annealing Q-learning (SA-Q) [3] approach were used to automatic learn the optimal dialogue strategy of a spoken dialogue system (see Fig. 2) and, more specifically, the spoken dialogue strategy of an intelligent in-vehicle speaking assistant. Moreover, a probabilistic user model (also see Fig. 2) considering various user behaviors is built to train the SA-Q learning algorithm. The generated strategy is finally applied to an in-vehicle speaking assistant prototype system to allow a driver to easily use spoken language to control the global position system (GPS)-based navigation system and other complex in-car equipments. Figure 1. The application scenario of a spoken dialogue system for GPS-based car navigation. The current state-of-the-art SDSs are often mixed-initiative slot-filling systems [1]. This means that both the user and system may take the initiative to provide information or ask follow up questions in a dialog session to jointly complete certain tasks. These kinds of SDSs are useful in domains where certain bits of information need to be elicited from the user, Figure 2. The block diagram of the reinforcement learning-based dialogue strategy design for spoken dialogue systems. The organization of this paper is as follows. The

SA-Q learning algorithm is briefly described in the second section. Several simulations and experiments considering different user behaviors are shown to verify the effectiveness of the SA-Q learning method in Section 3. The automatically learned strategy is applied to the in-vehicle speaking assistant prototype system in Section 4. Finally, some conclusions are given in the last section. 2. Reinforcement learning-based dialog strategy design As shown in Fig. 2, the SA-Q learning method [3] is a reinforcement learning algorithm that does not need a model of its working environment and can be used to adapt system behaviors on-line. Therefore, it is very suitable to deal with various user behaviors for implementing SDSs. The Q-learning algorithm essentially works by estimating the strength of correlation between states and actions (so called state-action pairs). The quality value Q(s,a) is defined to be the expected discounted sum of future reward obtained by taking action a from state s and following an optimal policy thereafter. Once these values have been learned, the optimal action from any state is the one with the highest Q-value. The values of state-action pairs Q(s,a) can be found by the following steps. First, when the system moves to state s and takes action a, a quality value function Q(s,a) is defined as follows: Q * (, s a) R(, s a) T(, s a, s')max Q * = + γ (', s a') s' S (1) where T(s,a,s ) is the state transition probability from current state s to next state s. R(s,a) is the reward been given for this action. And, γ is the discount parameter for the reward will be given in the forgoing process. Then, the optimal action with maximum Q(s,a) in state s could be found as ( ) max (, ) * * a a' V s = Q s a, (2) and the optimal strategy of the dialogue system, is * * π s = arg max Q s, a (3) ( ) ( ) a Q-values are estimated on the basis of experience as follows: (i) From the current state s, select an action a. This will cause a receipt of an immediate payoff r, and arrival at a next state s'. (ii) Update Q(s,a) based upon this experience as follows: Q( s, a) : = ( 1 α) Q( s, a) + α r+ γ max Q( s, a ) (4) ( = Q( s, a) + α r+ γ max Q( s, a ) Q( s, a) a where α is the learning rate. (iii) Repeat steps (1) and (2) until converged. ( a ) ) This algorithm is guaranteed to converge to the optimal strategy when each state is visited potentially infinitely often and the learning rate satisfied 2 α = and α <. For large amount of state t t spaces and actions and finite computing time, however, its performance strongly depends on the time course of α. How to decide the learning rate is an important issue in the Q-learning. Therefore, in this paper, the Simulated annealing Q learning (SA-Q) [3] was used. SA algorithm is usually applied to the search procedure in order to control the balance between exploration and exploitation. In SA algorithm, the state transition probability between state i and j can be set as 1, if f ( j) > f ( i) P( i, j) = f ( j) f ( i) (5) t e, otherwise where f ( i ), f ( ) j is the cost function and t is the temperature parameter. And, it can be applied to Q-learning method to control the balance between exploration and exploitation search. In the Q learning, we can follow the explored action, a p, determined by the strategy or change to an exploited random action, a r. The probability to change to an exploited random action, a r, will be set to 1, if Q( s, ar) > Q( s, ap) P( ap, ar) = Q( s, ar) Q( s, ap) (6) t e, otherwise Because the Q function satisfied Q( s, ar) Q( s, ap), such that Q( s, ar) Q( s, a ) p P( ap, ar) = exp (7) t Furthermore, we can use the temperature-dropping criterion, t = k 1 λtk, k = 0,1,2, +, λ [0.5,1.0], to update the temperature t after the goal state was reached. The SA-Q learning can in fact be treated as a ε-greedy method using a dynamic ε value. 3. Simulations and experiments of the automatic spoken dialogue strategy design In this section, several simulations and experiments of SDSs were done to verify the effectiveness of dialogue strategy learning by the SA-Q learning method. A general dialogue system with N slots and K tasks were used in all simulations. That means there are K functions in the spoken dialogue system. And for activating one specific function, different numbers of slots including some necessary and un-necessary slots are needed to be filled. In the following simulation, we set K to 4 and N to 6. As shown in Fig.

3, there are some joint and disjoint slots for each function. The un-necessary slots may be the task identifier, for example in intelligent in-vehicle speaking assistant system, the user may say U: I want to activate the navigation function. However, if the user gives some distinctive necessary slots, for example, the destination or routing option in a car navigation task, the system should immediately figure out the requested function and does not need the information of un-necessary slot. FIGURE 3. The general slots diagram of a spoken dialogue system In all following simulations, there are four possible values for each slot, i.e., Unknown, Unconfirmed, Grounded and Cancelled. Thus, the total possible states will be 4 10 =1M. Since the number of states are so large, to automatically find the optimal strategy, if not impossible, a very difficult problem. Then, all the possible actions might be taken in the system are listed as follows: (i) Greeting (ii) Query the task (iii) Confirm the task (iv) Give the slot information (v) Confirm the slot information (vi) Close the system Therefore, the total possible actions will be 151 in our simulations. To find a suitable spoken dialogue strategy using SA-Q learning method, real user (environment) responses to each system move and action are needed. However, for practice reason, it is reasonable to first use a simulated user behavior model to train an initial dialogue system. Then the system could be further trained on-line. Therefore, in our simulation, a probabilistic user model as shown in Fig. 2 was built. The system reward function, R(s,a) are given as following R = WDD+ WRRF + WM MIS+ WcancelCS+ WGG, (8a) which depends on the distance between current state and the goal state D, the number of mismatch slots between system query and user answer - MIS, whether the user force to close the system - CS and whether the task is complete or not - G. And the detail definition of D is D = R u U i i + R g G R C N R i + c i i f Mi (8b) where U, G, C is the number of Unconfirmed Grounded and Cancelled necessary slots, N is the number of necessary slots, M is the number of Unknown un-necessary slots and i is the task index. The weighting factor was setting to following values in our simulation. Ru = 0.5, Rg = 1, Rc = 1, Rf = 3 WD = 1, WR = 85, WM = 6, Wcancel = 85, WG = 20, WT = 1 However, how to build a good user model is another difficult problem and is out of the scope of this paper. Therefore, a simple probabilistic user model was used in all simulations. In the following, three experiments considering (i) convergence of the SA-Q learning algorithm, (ii) different user behaviors and (iii) the performance of the speech recognizer, respectively, were investigated to examine the effectiveness and robustness of the dialogue strategy found by the SA-Q learning algorithm. EXPERIMENT 1: In this simulation, we consider the convergence of the SA-Q learning algorithm, therefore, the recognition rate for the speech recognizer was assumed to be 100% and the confidence measure threshold for Grounded was set to 0.65, which means only 65% user s answer will set the corresponding slot to Grounded and 35% will set the slot to Unconfirmed. Then the probability that the user will give the information of each necessary slot needed for the task is 0.8 after greeting and the probability user will give the answer system query is also 0.8. Fig. 4 shows the convergence curve of SA-Q learning. Because the size of state space was 220, the system needed about 40000 epoch of training dialog sessions to converge to the optimal strategy. FIGURE 4. Convergence of the SA-Q learning algorithm. If we further analyze the found dialogue strategy,

a reasonable dialogue strategy as shown in Fig. 5 could be shown. It is worthy noting the system will always try to ask the user the task-specific slots since the system did know that speech recognizer was perfect. different recognition rates and confidence measures in both training and evaluation phases. The results shown that the dialogue strategies trained under match condition in fact didn t have the best performance. On the other hand, in order to increase the robustness of the system, it is preferred to train the dialogue strategy under lower recognition rate and confidence level. Moreover, if the recognition rates and confidence measures of speech recognizer were too low; the system could learn nothing. 4. In-vehicle speaking assistant prototype system FIGURE 5. A typical example of the dialogue strategy automatically learned by the SA-Q learning algorithm. EXPERIMENT 2: In this experiment, we want to examine whether different user behaviors will affect the learned dialogue strategy. Four types of user s behavior models were used in the following simulations, (i) User1 only one answer will be given no matter how many slots were asked. (ii) User2 the number of answers will given was less than three. (iii) User3 the user will give all answers that system asks for. (iv) User4 not only the answers that system asks for but also other slots needed for the specified task will be given. Using the above four user behavior models, four dialogue strategies were learned. In Table 1 to 4, the performances of applying the learned 4 strategies to different users are shown. From those tables, we could see that (i) the performance was best when the training user and test user were the same (ii) the strategy trained by User2 was the most robust one (iii) Never used User4 to train the system, because system doesn t need to do anything. The average objective values were also shown in Table 1 to 4. As we can see that the weighting factor defined in Eq. (8a) and (8b) will enable the system to find an optimal dialogue strategy which minimizes the average number of dialogue turns. EXPERIMENT 3: In this experiment, the robustness issues of the SA-Q algorithm were studied considering different recognition performance and confidence level of a speech recognizer. In Table 5 to 9, several experiments using user model User3 were done to test the learned dialogue strategies by varying the speech recognizer with Since the SA-Q learning algorithm performs very well, a prototype spoken dialogue system, i.e., the in-vehicle speaking assistant system was established and fist trained using probabilistic user models and further trained using real users. The block diagram of the system is shown in Fig. 6. The major functions of the system include: (i) GPS/GIS car navigation assistant (ii) points of interesting database query, i.e., gas stations and parking slots (iii) mobile phone directory assistant, i.e., direct dialing by name and phone number query The assignment between the functions and slots are shown in Fig. 7. Therefore, there are three functions and in total 18 slots. FIGURE 6. The block diagram of the in-vehicle speaking assistant prototype system. FIGURE 7. The assignment between the functions and slots. The found spoken dialogue strategy using SA-Q

learning is shown in Fig. 8. Could be seen from Fig. 8, the dialogue strategy is reasonable and does make senses. Since the system expects median recognition performance and confidence level, the system will try to identify the task if it doesn t know or it will directly ask the remaining slot information after the underlying task is known. Finally, Fig. 9 shows a snapshot of the prototype system in action. prototype system in action. 5. Conclusions In this paper, the SA-Q learning algorithm was shown to be capable to automatically learn the optimal dialogue strategy for designing a spoken dialogue system. The automatically learned dialogue strategy was then applied to an in-vehicle speaking assistant prototype system to enable a driver to easily control various in-car equipments including a GPS-based navigation system. The extension of the SA-Q learning algorithm to more complex spoken dialogue system is now under exploring. 6. References FIGURE 8. The dialogue strategy learned by SA-Q learning [1] E. Levin, R. Pieraccini, W. Eckert, G. Fabbrizio, S. Narayanan, Spoken Language Dialogue: from Theory to Practice,, Proc. of ASRU99, IEEE Workshop, Keystone, Colorado, Dec. 1999. [2] E. Levin, R. Pieraccini, and W. Eckert, Using Markov decision process for learning dialogue strategies, Proc. ICASSP 98, Seattle, WA, May, 1998.C. Watkins, Learning from Delayed Rewards. Ph.D. Thesis, Psychology Department, Cambridge University, Cambridge, England, 1989. [3] Maozu Guo, Yang Liu, and Jacek Malec, A new Q-learning algorithm based on the metropolis criterion, in IEEE Transactions on systems, man, and cybernetics-part B, VOL.34, NO.5, Oct 2004. 7. Acknowledgements This work was supported by National Science Council of the R.O.C. under grant no. NSC 93-2218-E-009-062. FIGURE 9. In-vehicle speaking assistant Table 1. Performance of different of dialogue strategies under user behavior model User1 (SR Success rate, NQ Average number of queries used in the dialogue, O average objective measure). Used NR NQ O NS NQ O NS NQ O NS NQ O 1 0.98 6.18-164 0.95 8.73-242 0.99 8.34-264 0.98 9.07-246 2 1.00 6.92-180 0.95 9.54-301 0.99 8.08-254 0.97 9.03-253 3 0.98 7.87-217 0.76 11.8-431 0.62 11.75-459 0.50 14.56-625 4 0.81 12.78-501 0.03 19.53-913 0.24 16.79-770 0.02 19.64-905 Table 2. Performance of different of dialogue strategies under user behavior model User2. Used SR NQ O 1 097 6.76-188 0.93 8.72-250 0.72 9.44-358 0.95 8.71-248 2 1.00 7.25-193 0.94 9.25-294 1.00 5.69-164 0.99 8.68-230 3 0.99 6.48-169 0.76 10.03-375 0.95 7.17-232 0.55 12.13-510 4 0.92 9.02-309 0.05 19.43-976 0.22 16.67-774 0.01 19.70-902 Table 3. Performance of different of dialogue strategies under user behavior model User3. Used SR NQ O

1 0.99 7.12-211 0.96 8.15-229 1.00 5.41-195 0.99 8.06-202 2 1.00 6.75-177 0.95 8.13-299 1.00 5.85-164 0.95 8.03-231 3 0.99 6.51-165 0.99 7.29-209 1.00 5.20-143 0.98 7.28-196 4 0.96 8.07-263 0.26 16.78-890 0.26 15.64-736 0.07 18.98-868 Table 4. Performance of different of dialogue strategies under user behavior model User4. Used SR NQ O 1 1.00 6.90-263 0.99 8.21-336 1.00 5.39-228 1.00 8.30-352 2 1.00 6.87-248 0.98 7.94-375 1.00 5.81-179 1.00 7.57-327 3 1.00 6.48-209 1.00 7.03-246 1.00 5.30-161 1.00 7.29-266 4 1.00 7.24-221 0.15 17.96-800 0.41 13.58-613 0.14 18.08-786 Table 5. Performance of strategy 3 (trained under recognition rate (R)=1.00 and confidence measure (C)= 0.90) R=1.0 0.990 5.08-160 0.940 7.44-408 0.555 15.44-1509 R=0.9 0.985 5.38-173 0.938 7.83-419 0.515 15.80-1555 R=0.8 0.993 5.75-189 0.903 8.51-496 0.473 16.13-1608 R=0.6 0.985 6.59-233 0.913 9.38-571 0.448 16.91-1680 Table 6. Performance of strategy 3 (trained under recognition rate (R)=1.00 and confidence measure (C)= 0.65) R=1.0 0.995 4.73-120 0.993 6.70-186 0.823 12.96-426 R=0.9 0.995 5.05-141 0.990 6.97-213 0.835 13.19-429 R=0.8 1.000 5.57-161 0.993 7.33-233 0.833 13.67-476 R=0.6 0.990 6.34-216 0.993 8.56-325 0.745 14.84-618 Table 7. Performance of strategy 3 (trained under recognition rate (R)=0.9 and confidence measure (C)= 0.9) R=1.0 0.978 5.04-156 0.853 7.84-456 0.480 13.97-1421 R=0.9 0.973 5.22-168 0.855 8.20-476 0.478 14.46-1414 R=0.8 0.965 5.49-179 0.858 8.47-463 0.398 14.61-1480 R=0.6 0.950 6.75-262 0.828 9.72-564 0.350 15.76-1666 Table 8. Performance of strategy 3 (trained under recognition rate (R)=0.9 and confidence measure (C)= 0.65) R=1.0 0.950 5.57-166 0.895 8.05-273 0.733 13.81-476 R=0.9 0.958 5.69-175 0.880 8.56-298 0.713 14.32-516 R=0.8 0.930 6.53-223 0.885 9.05-319 0.668 13.96-547 R=0.6 0.915 7.25-260 0.853 10.25-410 0.645 15.10-660 Table 9. Performance of strategy 3 (trained under recognition rate (R)=0.6 and confidence measure (C)= 0.65) R=1.0 0.718 8.86-334 0.273 15.85-688 0.025 19.32-895 R=0.9 0.718 9.10-342 0.278 15.77-680 0.045 19.24-891 R=0.8 0.653 10.15-395 0.228 16.62-732 0.030 19.35-903 R=0.6 0.583 11.78-477 0.175 17.50-778 0.018 19.61-928