Reinforcement Learning-based Feature Selection For Developing Pedagogically Effective Tutorial Dialogue Tactics

Renforcement Learnng-based Feature Selecton For Developng Pedagogcally Effectve Tutoral Dalogue Tactcs 1 Introducton Mn Ch, Pamela Jordan, Kurt VanLehn, Moses Hall {mc31, pjordan+, vanlehn+, mosesh}@ptt.edu Learnng Research Development Center & Intellgent System Program, Unversty of Pttsburgh Abstract. Gven the subtlety of tutoral tactcs, dentfyng effectve pedagogcal tactcal rules from human tutorng dalogues and mplementng them for dalogue tutorng systems s not trval. In ths work, we used renforcement learnng (RL to automatcally derve pedagogcal tutorng dalog tactcs. Past research has shown that the choce of the features sgnfcantly affects the effectveness of the learned tactcs. We defned a total of 18 features whch we classfed nto four types. Frst, we compared fve feature selecton methods and overall upper-bound method seems to be most effcent. Then we compared the four types of features and found that temporal stuaton and autonomy related features are sgnfcantly more relevant and effectve to tutoral decsons than ether performance or stuaton related features. One challengng ssue confrontng dalogue tutorng systems s how to select the best tutorng actons for any student state so as to maxmze learnng gans. We call ths decson makng tutorng tactcs snce t generally governs bref epsodes of tutorng dalogue, such as a sngle step, and seems to be crucal for achevng further mprovements n pedagogcal effectveness. Bad tutoral decsons have the potental to bore or frustrate students to the pont that they fal to learn. Many exstng dalogue systems use hand-crafted tactcal rules derved from analyzng human dalogues [6][13]. However, dentfyng and mplementng these tactcal rules s not trval. It has been shown that expert human tutors generally employ a range of tutorng actons and ther decsons depend n subtle ways on the student s competence, student s self-confdence and other factors. For example, one mportant tutorng tactc s whether the tutor should tell the student the answer drectly for a step or whether he should elct the step from the student wth a prompt or a seres of questons. Collns et al. suggest that when a student s unfamlar wth the target knowledge, the tutor should tell t drectly; when a student becomes more famlar wth the knowledge, the tutor should elct the answer va questons; and once a student has mastered the knowledge, t doesn't matter whether t s told or elcted [4][5]. Most exstng dalogue tutorng systems wth hand-crafted rules react sensbly when students exhbt less than optmal behavors; however, ther tutoral nteractons are much stffer than those of human tutors and they have not yet acheved the same effectveness that one-on-one, face-toface expert human tutors have [1]. In recent years, work n desgnng spoken dalogue systems has proposed several data-drven methodologes; one of them s Renforcement Learnng (RL [3][12][13]. It has been shown that RL s effectve at automatcally learnng the best acton to take at any state n a dalogue. And such success nspred us to use RL for dalogue tutorng systems. Our work can be dvded nto three stages. In the frst stage, we bult an ntal tutoral dalogue system and collected an exploratory corpus by usng the system to tran 64 students. In ths ntal system, decsons on certan types of tutorng actons were made randomly. In the second stage, we used RL on the exploratory corpus to derve tutorng tactcs for these actons and then ncorporated the learned tactcs nto the ntal system. The modfed system s dentcal to the

ntal one except that the tutorng actons that were prevously made randomly are now based on the learned tutorng tactcs. In the last stage, the modfed system s beng used to tran a new group of students so that we compare ther learnng performance to that of the students prevously traned on the ntal system. We expect the modfed system to be so well attuned to students that t wll be more effectve than the ntal system. One tactcal decson nvestgated n ths study was whether the tutor should tell the next step to the student or elct t from the student. We defned 18 features to model the dalogue and students state. They were selected based upon the four types of features that were shown n [7] to be relevant for human tutors to makng ther tutoral decsons: autonomy, temporal stuaton, stuaton, and performance respectvely. Among 18 features, features 1-3 are autonomy related based upon the amount of the work that the tutor has let students do; features 4-8 cover tmerelated nformaton such as tme spent on the tranng so far; features 9-11 are stuaton related ncludng whether the student and tutor are now problem solvng or n post dscusson, the dffculty level of the problem and so on; and fnally, features 12-18 cover the performance nformaton such as the correctness of student s prevous answers and the ablty of student. In an RL model, the sze of the state space ncreases exponentally as the number of nvolved features ncreases. In order to learn effectve tutorng tactcs, we should have a corpus that covers each of these states at least once, whch means 2 18 n our case. However, t s almost mpossble to do so gven the hgh cost of collectng educatonal data. On the other hand, the learned polcy may become too subtle to be necessary. Fgure 1 shows an example of a learned polcy nvolvng fve features: [1, 2, 4, 15, 16], n whch feature 2 s an autonomy feature defned as the percentage of elctaton the students have receved so far. Each of the fve features was converted from a real number to a bnary value based upon ther medan scores, for example, f2='.4982' means f feature 2 value s above.4982, t s 1 otherwse, t s 0. There were a total of 32 rules learned: n 10 stuatons the tutor should elct, n 19 t should tell; n the remanng 3 cases ether wll do. For example, when all of the features are 0 then the tutor should tell as 0:0:0:0:0 the frst n the lst of tells. As you can see, fve features provde a large space s already qute complex and s already much more subtle than most of the tutoral tactcs derved from analzng human turoral dalogues [4][5]. KC22 'features'=[1, 2, 4, 15,16], 'cutoff'=[f1= 1.0000' f2='.4982' f4='56.0000' f15='.6154' f='.2683 ], 'polcy': 'elct: [0:0:0:0:1, 0:0:0:1:0, 1:0:0:0:0, 1:0:0:0:1, 1:0:0:1:0, 1:1:0:1:1, 0:0:1:0:0, 0:0:1:0:1, 1:0:1:0:0, 1:0:1:0:1], 'tell: [0:0:0:0:0, 0:0:0:1:1, 0:1:0:0:0, 0:1:0:0:1, 0:1:0:1:0, 0:1:0:1:1, 1:1:0:0:1, 1:1:0:1:0, 0:0:1:1:0, 0:0:1:1:1, 0:1:1:0:0, 0:1:1:0:1, 0:1:1:1:0, 0:1:1:1:1, 1:0:1:1:1, 1:1:1:0:0, 1:1:1:0:1, 1:1:1:1:0, 1:1:1:1:1] 'else: [1:0:0:1:1, 1:1:0:0:0, 1:0:1:1:0] Fgure 1. A Learned Polcy Based On Fve Features On KC22 Based on the sze of the exploratory corpus we collected n the frst stage, we decded that no more than sx features should be used. Prevous research nvolvng RL and feature selecton n dalogue systems ether focused on selectng features wth certan characterstcs [2] [7][8] or nvestgated a relatvely small number of features [10][11]. Therefore, n ths study, we proposed four general RL-based feature selecton methods.

2 Background Past research on usng RL to mprove spoken dalogue systems has commonly used Markov Decson Processes (MDP s. An MDP [8][9] descrbes a stochastc control process and formally corresponds to a 4-tuple (S,A,T,R, where S = {s } =1..m, s a fnte set of process states; A={a k } k=1..m s a fnte set of actons; T : S A S [0, 1] s a set of transton probabltes between states that descrbe the dynamcs of the modeled system; for example: P t (s s j, a k s the probablty that the model would transton from state s j to state s by takng acton a k at tme t; and R : S A S R denotes a reward model that assgns rewards to state transtons and models payoffs assocated wth such transtons. The goal of usng MDPs s to determne the best polcy π*: the set of actons the model should take at each state s to maxmze ts expected cumulatve utlty (V-value, whch can be calculated from the followng recursve equaton: V( s = Ps s j π*( s s j [ R π*( s ss j + γv( s As long as a proper state space, acton space and reward functon are set up, an MDP allows one to automatcally derve and compute the best polcy. For dalogue tutorng systems, dervng effectve tutoral tactcs from tutorng dalogues can be naturally cast n the MDP formalsm: for the state space, each s can be vewed as a vector of features representng the tutorng dalogue context, the student s knowledge level and so on. Acton space corresponds to tutorng actons, e.g. elct or tell; the reward functon corresponds to students learnng gans. For tutorng dalogues, the reward s a delayed reward because for each state the reward s not known untl students have completed the tutorng and taken the post-test. In ths work, we used an MDP package that was desgned for Tetreault & Ltman s studes snce t has proven to be both relable and successful [10][11]. In prevous studes, Tetreault & Ltman prmarly nvestgated methods for evaluatng whether certan features would mprove polcy effectveness. There are two man dfferences between our study and thers: frstly, they used a prevously collected corpus that was not exploratory wth respect to tutoral tactcs to tran an MDP model n that the tutor often used only one type of acton n many dalogue states, whch severely lmted the types of questons that they could nvestgate[10][11]; whle we use a more exploratory corpus by tranng students on a dalogue tutorng system n whch multple actons can often be taken and random tactcal decsons were made. Thus t s better suted for creatng an MDP. Secondly, they dd not need to address the problem of general feature selecton methods snce they only used fve features whle we had to select up to sx of 18. To evaluate the learned polces, we use three crtera: Expected Cumulatve Reward (ECR and lower-bound and hgher-bound of the 95% confdence nterval. ECR s the average reward one would expect n the MDP and t s calculated by normalzng the V-value of each state by the number of tmes t occurs as a start state n a dalogue and then summng over all states [11]. The confdence nterval represents how relable the ECR of a learned polcy s [10]. The wder t s, the less relable the polcy s ECR s. For example, for a learned polcy A derved from feature 2 alone on defnton of sprng potental energy (KC22 s f the percent of elct so far s less than 49.82%, tutor should elct otherwse the tutor should tell. ; A has ECR = 3.02 (range [-100, 100] wth a 95% confdence nterval= [- 2.71, 8.45], whch means there s a 95% chance that the ECR of the learned polcy s between a lower-bound of -2.71 and an upper-bound of 8.45. In Fgure 1, also on KC22 but nvolves fve features: 1, 2, 4, 15 and 16. It has ECR = 44.29 wth a 95% confdence nterval= [23.49, 50.51], whch s much more effectve than polcy A because even ts lower-bound s much hgher than polcy A s upper-bound. j ]

Sometmes we encounter stuatons n whch the ECR for A s the same as for B; but the confdence nterval of A s much narrower than that of B. By only usng the three crtera descrbed above, polcy A and B cannot be compared. Therefore, we defne the hedge of a learned polcy as: ECR Hedge = ({ upper bound} { lower bound} By usng the hedge, we can say that polcy A s better than polcy B snce the hedge of A s hgher than the hedge of polcy B. 3 Experment The doman chosen for ths study s work and energy covered n college physcs. The procedure was as follows students: (1 read a short textbook, whch descrbes the major prncples and concepts; (2 take a pre-test; (3 work through seven open-ended tranng problems wth a dalogue tutorng system (Cordllera [14]; and (4 take a post-test that s dentcal to the pre-test. In the stage one, 64 college students who had not taken any college physcs completed the experment, recevng payment for ther partcpaton. Students needed 8-15 hours to complete the procedures and usually requred 4-6 sessons of about 2 hours each. Thus, the collected corpus comprses 64 human-computer tutorng dalogues and each dalogue s an extended nteracton wth one student that covers seven dfferent college-level physcs problems. The number of stateacton pars n each of the 64 collected tutoral dalogues vares from 700 to 900. All tranng problems, and all pre- and post-tests problems were selected from 127 quanttatve and qualtatve problems collected from varous physcs lterature. In order to solve these 127 problems, 32 unque knowledge components (KCs were dentfed as necessary. A KC s a generalzaton of everyday terms lke concept, prncple, fact, or skll, and cogntve scence terms lke schema, producton rule, msconcepton, or facet [14]. For example, KC22 1 s both a concept and a prncple whle KC23 2 s a fact. KC22 conssts of both procedural knowledge and declaratve knowledge whle KC23 s manly declaratve knowledge and thus learnng these two KCs clearly nvolves dfferent cogntve sklls. Therefore, for dfferent KCs, we expect that dfferent features should be consdered when makng a tactcal decson and dfferent tactcal decsons should be derved. In order to learn KC-specfc tutoral tactcs, students pre- and post-test scores were also calculated per KC. It turned out our tranng, pre- and post-problems covered 27 of the 32 possble KCs. Three of the 27 KCs showed up n the tutorng dalogue but were not assocated wth any acton decson ponts; fve KCs concded only once or twce wth decson ponts; and one KC dd not appear n the pre- and post-tests. Therefore, we were left wth 18 KCs for whch t s possble for us to derve tutorng tactcs. Comparsons of pre- and post-test scores ndcated that students dd learn durng ther tranng wth Cordllera: ther post-test scores were sgnfcantly hgher than ther pre-test scores: t(126= 3.36, p= 0.001. Smlar to Tetreault & Ltman s work, we used students normalzed learnng gans (NLGs for each KC as the delayed reward functon, whch s defned as: Posttest( KC Pr etest( KC NLG( KC =. 1 Pr etest( KC 1 If an object and a sprng are n a system, then ther sprng potental energy s 0.5* k*d^2, where k s the sprng constant and d s the dsplacement of the object relatve to the equlbrum poston of the sprng. 2 The unt for energy s the Joule.

For each KC, the 64 students were splt nto two groups by the medan value of ther NLG on t, so the better-performng students dalogues were gven a postve reward of +100, whle the other half were assgned a negatve reward of -100. The rewards were assgned n the fnal dalogue state. Addtonally, we annotated our tutorng dalogues and acton decsons based on whch KCs a tutor acton or tutor-student par of turns covered; Table 1 shows an example. Here lnes 1, 5, 6, 7, 9 and 11 are tutor turns, the acton type and ther target KCs are lsted on the rght. For example, n the frst utterance s an elct acton regardng KC22 whle the ffth s a tell acton drectly statng KC22. Lnes 2 and 12 are student responses and they are both correct (as ndcated by the + ; row 2 s about KC22 whle row 12 s about KC27. Table 1. Examples of Tutoral Dalogue wth KC Labels 1. T: At T2 we also know the sprng s uncompressed. So, what type of energy elct, can we easly fnd now? KC22 2. S: spe +KC22 3. T: Yes, I agree. 4. T: At ths pont we know the values for KE2 and GPE2. It s tme to compute the value for SPE2. 5. T: It s mportant to remember that we cannot fnd the SPE of an OBJECT. We tell, KC22 need to fnd the sprng potental energy of a par: object-sprng. 6. T: It s possble to compute the sprng potental energy at T2 because we know tell the sprng s uncompressed. That s, d2=0 m. 7. T: Usng ths nformaton we can nfer that the sprng potental energy of the tell, KC22 block-sprng par at T2 s SPE2=0 J. KC23 8. T: Now we have all we need to 9. T: TME of the system at T2 s the sum of the block`s KE and the block-earth par`s GPE because the value for the sprng potental energy s zero. In ths case TME2=11.88 J+1.62 J =13.5 J. 10. T: At ths pont we have found the value for TME2. We can now easly fnd the system`s TME at all other tme ponts snce the system s solated. 11. T: What prncple gves us the system`s total mechancal energy at every tme pont, knowng ts value at T2 and also gven that the system s solated? tell, KC23 KC24 elct, KC27 12. S: student conservaton of energy + KC27 13. T: Yes. 4 Methods The four feature selecton approaches we propose n ths paper are farly straghtforward and share the same procedure, whch conssts of the three phases descrbed below. Phase 1: For each of 18 features n the state space, use MDP to get a snglefeature-polcy. Phase 2: Sort all of the features based on the learned sngle-feature-polces from hgh to low by one of the followng measures: ECR Lower-bound Upper-bound Hedge Phase 3: Start wth the frst feature n the sorted feature lst, add one feature at a tme to the MDP and learn a new polcy. Repeat ths process repeat 5 tmes.

Based on the sortng crtera n phase 2, we named our four feature selecton methods: ECR, lower-bound, upper-bound, and hedge respectvely. Snce these values were calculated from the sngle-feature-polces learned usng MDP n phase 1, the four methods are RL-based. We expect them to be more effectve than a random selecton. To test our expectatons, we created 120 random polces by runnng the MDP package on randomly selected features (20 rounds for each randomly selected feature set, where each set contaned between one and sx features. Therefore, for each KC and each feature set sze, we learned one tutorng tactc for each RL-based method plus 20 for random feature selecton. Ths gave us (1*4+20*6 =144 polces. For each KC, we selected one polcy that had the hghest ECR, lower-bound and upper-bound from the 144 learned polces and named t the best polcy. To quanttatvely evaluate how much less effectve a learned tutorng tactc s than the best polcy, we defned a normalzed ECR (NECR for a learned tutorng tactc as: NECR(KC, N, Method Max _ ECR( KC = Mn _ ECR( KC k j 1 ECR(KC, N, Method k - Mn_ECR(KC = C Max_ECR(KC - Mn_ECR(KC = C max N {1..6}, m { all methods} mn N {1..6}, m { all methods} ( ECR( KC, N, Method ( ECR( KC, N, Method Max_ECR(KC and Mn_ECR(KC are the maxmum and mnmum ECR among all 144 of the learned polces for KC. C s a constant wth C=1 for each of the four RL-based methods and C=20 for random feature selecton. The maxmum NECR for a learned polcy s 1 f the learned polcy s the best one and the mnmum s 0 f the learned polcy s the worst one. 5 Results In order to evaluate the feature selecton methods and how the feature effectveness, we have two man goals. Frst, we wll compare the four RL-based feature selecton methods aganst random feature selecton; second we wll nvestgate whch features seem to be more mportant for dervng the best tutorng tactcs for decdng to elct/tell across all KCs. 5.1 Comparng four RL-based methods aganst random feature selecton Table 2. Comparng the Average NECR of Fve Selecton Methods for Increasng Feature Set Szes Number Features Involved Upper-bound ECR Hedge Lower-bound Random 1 0.345 0.372 0.337 0.355 0.119 2 0.355 0.370 0.335 0.314 0.196 3 0.416 0.447 0.400 0.352 0.275 4 0.550 0.515 0.520 0.419 0.348 5 0.682 0.579 0.573 0.435 0.422 6 0.673 0.614 0.594 0.485 0.480 m m NECR s defned as how much a learned tutorng tactcs s less effectve than the best polcy. Table 2 shows the average NECR gven the number of selected features and feature selecton

method across all 18 KCs. As expected, random feature selecton has the worst average NECR regardless of the number of features nvolved. Overall, f the number of features 3, then the ECR approach s the best feature selecton method; but f the number of features s between 4 and 6, then the upper-bound method s the best. As the number of features nvolved ncreases, the effectveness of the learned polces for elct/tell tends to become better, except for upper-bound, whch has a better average NECR when there are fve features than when there are sx. Overall, usng the upper-bound method to select fve features seems to be the most effectve method across all KCs on elct/tell acton decson. On the other hand, across all 18 KCs the number of tmes that each method found the best polcy s: fourteen tmes for random, three for upper-bound, two for ECR, one for hedge and none for lower-bound method. The total dd not add up to 18 KCs because for KC1, three methods, ecr, hedge, and upper-bound all found the same best polcy. Therefore, although random feature selecton has the worse average NECR than other four RL-based methods, t s an effectve way to fnd best polces. However, note that the random feature selecton method was repeated 120 tmes for each KC and so t has a total of 120*18=2160 chances to fnd the 14 best polces; whle the upper-bound method was appled only 6 tmes for each KC and thus has a total of 108 chances. Addtonally, because our state space s stll relatvely small, we expect that the performance of random selecton would decrease sgnfcantly as the number of features n the state space ncreases. However, for the four RL-based feature selecton methods, ncreasng the number of features would not decrease ther effectveness snce they do not drectly depend on the number of features n the state space. Moreover we compared the best polcy learned for each KC by the upper-bound method and those learned by random selecton. Over all 18 KCs, the upper-bound polces wth just 108 attempts were only 9.46% less effectve than random feature selecton polces wth 2160 attempts. These results ndcate that that n educaton, features that may result hgher learnng gans should always be always be consdered by the tutor when makng decsons. Ths s lkely due to the fact that n the worst case a student wll smply not learn rather than lose nformaton so the cost of ncorporatng superfluous feature s low.. 5.2 Frequency of Features n Best Polces Fgure 2. The Frequency of Each Feature Shown In Best Polces. Fgure 2 shows the frequency wth whch each feature appears n the best polces. Ths frequency dffers sgnfcantly among the four feature types: F(3= 7.47, p=0.003. There s no sgnfcant dfference between the three autonomy and fve temporal stuaton related features. When combned they are sgnfcantly more frequent than ether the performance or stuaton related features: t(12= 2.74, p=0.18 and t(10= 4.26, p=0.002 respectvely. Consstent wth prevous research based on analyzng human tutoral dalogue, autonomy related features seemed to be more relevant n dervng effectve tutoral tactcs. Addtonally, we found that temporal

stuaton related features were also relevant, even more so than the performance related ones when decdng whether to elct or tell. Ths was not ndcated n prevous lterature on human tutoral dalog analyss. One possble explanaton for ths was that n most of the pror lterature, temporal stuaton related factors were often not consdered. 6 Conclusons and Future Work In ths paper, we descrbed our work on applyng RL methods to derve effectve tutorng tactcs for elct/tell. We showed that dervng effectve tutorng tactcs from tutorng dalogues can be cast n the MDP formalsm. Addtonally, we proposed four RL-based doman-general feature selecton methods. We found the upper-bound method to be more effectve than the others. One of our goals for future work s to nvestgate how to stll get reasonable polces wthout annotatng ndvdual KCs n the dalogues. The annotaton process s prohbtvely tmeconsumng and t s not unusual for doman experts to dsagree [14]. Another of our goals for future work s to determne how to learn one reasonable polcy for all KCs wthout sacrfcng too much of the expected effectveness. Another two mportant ssues are how to avod the expensve ntal data collecton and how to combne the new data wth the exstng data so that we can learn even more powerful polces. Acknowledgments We would lke to thank the NLT group and Dane J. Ltman for ther comments. Support for ths research was provded by NSF grants #0325054. References [1] B. S. Bloom, The 2 sgma problem: The search for methods of group nstructon as effectve as oneto-one tutorng., Educatonal Researcher,, no. 13, pp. 4 16, 1984. [2] M. Frampton and O. Lemon. 2005. Renforcement learnng of dalogue strateges usng the user's last dalogue act. In IJCAI Wkshp. on K&R n Practcal Dalogue Systems. [3] J. Henderson, O. Lemon, and K. Georgla. 2005. Hybrd renforcement/supervsed learnng for dalogue polces from communcator data. In IJCAI Wkshp. on K&R n Practcal Dalogue Systems. [4] Collns, A., and Stevens, A Goals and Strateges of Inqury Teachers: In R. Vlaser (Ed. 1982 Advances n Instructonal Psychology (vol. 2 Hllsdale, NJ: Erlbaum Assoc. 65-119 [5] Collns, A. (1996. Desgn ssues for learnng envronments. In S. Vosnadou, E. D. Corte, R. Glaser, & H. Mandl (Eds., Internatonal perspectves on the desgn of technology-supported learnng envronments (pp. 347-361. Mahwah, NJ: Erlbaum, 1996. (Bloom, 1984. [6] M. Evens and J. Mchael, One-on-One Tutorng by Humans and Computers, Lawrence Erlbaum Assocates, Inc., 2006. [7] Johanna D. Moore, Kaska Porayska-Pomsta, Sebastan Varges, Claus Znn: Generatng Tutoral Feedback wth Affect. FLAIRS Conference 2004 [8] S. Sngh, M. Kearns, D. Ltman, and M. Walker. 1999. Renforcement learnng for spoken dalogue systems. In Proc. NIPS 99. [9] R. Sutton and A. Barto. 1998. Renforcement Learnng. The MIT Press. [10] J. Tetreault, D. Bohus, and D. Ltman, Estmatng the relablty of mdp polces: a confdence nterval ap proach, n NAACL, 2007. [11] J. Tetreault and D. Ltman. 2006. Comparng the utlty of state features n spoken dalogue usng renforcement learnng. In NAACL. [12] J. Wllams, P. Poupart, and S. Young. 2005a. Factored partally observable markov decson processes for dalogue management. In IJCAI Wkshp. on K&R n Practcal Dalogue Systems. [13] M. Walker. 2000. An applcaton of renforcement learnng to dalogue strategy selecton n a spoken dalogue system for emal. JAIR, 12. [14] VanLehn, K, Jordan, & Dane 2007. Developng pedagogcally effectve tutoral dalogue tactcs: Experments and a testbed. In proceedngs of SLaTE Workshop.