Identifying Intention Posts in Discussion Forums

Size: px

Start display at page:

Download "Identifying Intention Posts in Discussion Forums"

Kristopher Smith
6 years ago
Views:

1 Identfyng Intenton Posts n Dscusson Forums Zhyuan Chen, Bng Lu Department of Computer Scence Unversty of Illnos at Chcago Chcago, IL 60607, USA czyuanacm@gmal.com,lub@cs.uc.edu Mechun Hsu, Malu Castellanos, Rddhman Ghosh HP Labs Palo Alto, CA 94304, USA {mechun.hsu, malu.castellanos, rddhman.ghosh}@hp.com Abstract Ths paper proposes to study the problem of dentfyng ntenton posts n onlne dscusson forums. For example, n a dscusson forum, a user wrote I plan to buy a camera, whch ndcates a buyng ntenton. Ths ntenton can be easly exploted by advertsers. To the best of our knowledge, there s stll no reported study of ths problem. Our research found that ths problem s partcularly suted to transfer learnng because n dfferent domans, people express the same ntenton n smlar ways. We then propose a new transfer learnng method whch, unlke a general transfer learnng algorthm, explots several specal characterstcs of the problem. Expermental results show that the proposed method outperforms several strong baselnes, ncludng supervsed learnng n the target doman and a recent transfer learnng method. 1 Introducton Socal meda content s ncreasngly regarded as an nformaton gold mne. Researchers have studed many problems n socal meda, e.g., sentment analyss (Pang & Lee, 2008; Lu, 2010) and socal network analyss (Easley & Klenberg, 2010). In ths paper, we study a novel problem whch s also of great value, namely, ntenton dentfcaton, whch ams to dentfy dscusson posts expressng certan user ntentons that can be exploted by busnesses or other nterested partes. For example, one user wrote, I am lookng for a brand new car to replace my old Ford Focus. Identfyng such ntenton automatcally can help socal meda stes to decde what ads to dsplay so that the ads are more lkely to be clcked. Ths work focuses on dentfyng user posts wth explct ntentons. By explct we mean that the ntenton s explctly stated n the text, no need to deduce (hdden or mplct ntenton). For example, n the above sentence, the author clearly expressed that he/she wanted to buy a car. On the other hand, an example of an mplct sentence s Anyone knows the battery lfe of Phone? The person may or may not be thnkng about buyng an Phone. To our knowledge, there s no reported study of ths problem n the context of text documents. The man related work s n Web search, where user (or query) ntent classfcaton s a maor ssue (Hu et al., 2009; L, 2010; L, Wang, & Acero, 2008). Its task s to determne what the user s searchng for based on hs/her keyword queres (2 to 3 words) and hs/her clck data. We wll dscuss ths and other related work n Secton 2. We formulate the proposed problem as a twoclass classfcaton problem snce an applcaton may only be nterested n a partcular ntenton. We defne ntenton posts (postve class) as the posts that explctly express a partcular ntenton of nterest, e.g., the ntenton to buy a product. The other posts are non-ntenton posts (negatve class). Note that we do not explot ntenton specfc knowledge snce our am s to propose a generc method applcable to dfferent types of ntentons. There s an mportant feature about ths problem whch makes t amenable to transfer learnng so that we do not need to label data n every doman. That s, for a partcular knd of ntenton such as buyng, the ways to express the ntenton n dfferent domans are often very smlar. Ths 1041 Proceedngs of NAACL-HLT 2013, pages , Atlanta, Georga, 9 14 June c 2013 Assocaton for Computatonal Lngustcs

2 fact can be exploted to buld a classfer based on labeled data n some domans and apply t to a new/target doman wthout labelng any tranng data n the target doman. However, ths problem also has some specal dffcultes that exstng general transfer learnng methods do not deal wth. The two specal dffcultes of the proposed problem are as follows: 1. In an ntenton post, the ntenton s typcally expressed n only one or two sentences whle most sentences do not express ntenton, whch provde very nosy data for classfers. Furthermore, words/phrases used for expressng ntenton are qute lmted compared to other types of expressons. These mean that the set of shared (or common) features n dfferent domans s very small. Most of the exstng advanced transfer learnng methods all try to extract and explot these shared features. The small number of such features n our task makes t hard for the exstng methods to fnd them accurately, whch n turn learn poorer classfers. 2. As mentoned above, n dfferent domans, the ways to express the same ntenton are often smlar. Ths means that only the postve (ntenton) features are shared among dfferent domans, whle features ndcatng the negatve class n dfferent domans are very dverse. We then have an mbalance problem,.e., the shared features are almost exclusvely features ndcatng the postve class. To our knowledge, none of the exstng transfer learnng methods deals wth ths mbalance problem of shared features, whch also results n naccurate classfers. We thus propose a new transfer learnng (or doman adaptaton) method, called Co-Class, whch, unlke a general transfer learnng method, s able to deal wth these dffcultes n solvng the problem. Co-Class works as follows: we frst buld a classfer usng the labeled data from exstng domans, called the source data, and then apply the classfer to classfy the target (doman) data (whch s unlabeled). Based on the target data labeled by, we perform a feature selecton on the target data. The selected set of features s used to buld two classfers, one ( ) from the labeled source data and one ( ) from the target data whch has been labeled by. The two classfers ( and ) then work together to perform classfcaton of the target data. The process then runs teratvely untl the labels assgned to the target data stablze. Note that n each teraton both classfers are bult usng the same set of features selected from the target doman n order to focus on the target doman. The proposed Co-Class explctly deals wth the dffcultes mentoned above (see Secton 3). Our experments usng four real-lfe data sets extracted from four forum dscusson stes show that Co-Class outperforms several strong baselnes. What s also nterestng s that t works even better than fully supervsed learnng n the target doman tself,.e., usng both tranng and test data n the target doman. It also outperforms a recent state-of-the-art transfer learnng method (Tan et al., 2009), whch has been successfully appled to the NLP task of sentment classfcaton. In summary, ths paper makes two man contrbutons: 1. It proposes to study the novel problem of ntenton dentfcaton. User ntenton s an mportant type of nformaton n socal meda wth many applcatons. To our knowledge, there s stll no reported study of ths problem. 2. It proposes a new transfer learnng method Co- Class whch s able to explot the above two key ssues/characterstcs of the problem n buldng cross-doman classfers. Our expermental results demonstrate ts effectveness. 2 Related Work Although we have not found any paper studyng ntenton classfcaton of socal meda posts, there are some related works n the doman of Web search, where user or query ntent classfcaton s a maor ssue (Hu et al., 2009; L, 2010; L et al., 2008). The task there s to classfy a query submtted to a search engne to determne what the user s searchng for. It s dfferent from our problem because they classfy based on the user-submtted keyword queres (often 2 to 3 words) together wth the user s clck-through data (whch represent the user s behavor). Such ntents are typcally mplct because people usually do not ssue a search query lke I want to buy a dgtal camera. Instead, they may ust type the keywords dgtal camera. Our nterest s to dentfy explct ntents expressed n full text documents (forum posts). Another related problem s onlne commercal ntenton (OCI) dentfcaton (Da et al., 1042

3 2006; Hu et al., 2009), whch focuses on capturng commercal ntenton based on a user query and web browsng hstory. In ths sense, OCI s stll a user query ntent problem. In NLP, (Kanayama & Nasukawa, 2008) studed users needs and wants from opnons. For example, they amed to dentfy the user needs from sentences such as I d be happy f t s equpped wth a crsp LCD. Ths s clearly dfferent from our explct ntenton to buy or to use a product/servce, e.g., I plan to buy a new TV. Our proposed Co-Class technque s related to transfer learnng or doman adaptaton. The proposed method belongs to feature representaton transfer" from source doman to target doman (Pan & Yang, 2010). Aue & Gamon (2005) tred tranng on a mxture of labeled revews from other domans where such data are avalable and test on the target doman. Ths s bascally one of our baselne methods 3TR-1TE n Secton 4. Ther work does not do multple teratons and does not buld two separate classfers as we do. Some related methods were also proposed n (W. Da, Xue, Yang & Yu, 2007; Tan et al., 2007; Yang, S & Callan, 2006). More sophstcated transfer learnng methods try to fnd common features n both the source and target domans and then try to map the dfferences of the two domans (Bltzer, Dredze, & Perera, 2007; Pan, et al, 2010; Bollegala, Wer & Carroll, 2011; Tan et al., 2009). Some researchers also used topc modelng of both domans to transfer knowledge (Gao & L, 2011; He, Ln & Alan, 2011). However, none of these methods deals wth the two problems/dffcultes of our task. Co-Class tackles them explctly and effectvely (Secton 4). The proposed Co-Class method s also related to Co-Tranng method n (Blum & Mtchell, 1998). We wll compare them n detal n Secton The Proposed Technque We now present the proposed technque. Our obectve s to perform classfcaton n the target doman by utlzng labeled data from the source domans. We use the term source domans as we can combne labeled data from multple source domans. The target doman has no labeled data. Only the source doman data are labeled. To deal wth the frst problem n Secton 1 (.e., the dffculty of fndng common features across dfferent domans), Co-Class avods t by usng an EM-based method to teratvely transfer from the source domans to the target doman whle explotng feature selecton n the target doman to focus on mportant features n the target doman. Snce our deas are developed startng from the EM (Expectaton Maxmzaton) algorthm and ts shortcomngs, we now ntroduce EM. 3.1 EM Algorthm EM (Dempster, Lard, & Rubn, 1977) s a popular class of teratve algorthms for maxmum lkelhood estmaton n problems wth ncomplete data. It s often used to address mssng values n the data by computng expected values usng exstng values. The EM algorthm conssts of two steps, the Expectaton step (E-step) and the Maxmzaton step (M-step). E-step bascally flls n the mssng data, and M-step re-estmates the parameters. Ths process terates untl convergence. Snce our target data have no labels, whch can be treated as mssng values/data, the EM algorthm naturally apples. For text classfcaton, each teraton of EM (Ngam, McCallum, Thrun, & Mtchell, 2000) usually uses the naïve Bayes (NB) classfer. Below, we frst ntroduce the NB classfer. Gven a set of tranng documents, each document s an ordered lst of words. We use to denote the word n the poston of, where each word s from the vocabulary, whch s the set of all words consdered n classfcaton. We also have a set of classes representng postve and negatve classes. For classfcaton, we compute the posteror probablty. Based on the Bayes rule and multnomal model, we have: D r( c d ) 1 r( c ) (1) D and wth Laplacan smoothng: Ρr( w c ) t 1 V D N 1 V s1 1 ( w,d )Pr( c d ) (2) where s the number of tmes that the word occurs n document, and ( ) s the probablty of assgnng class to. Assumng that word probabltes are ndependent gven a class, we have the NB classfer: D t N( w,d )Pr( c d ) s 1043

4 r( c r( c d ) C r( (3) The EM algorthm bascally bulds a classfer teratvely usng NB and both the labeled source data and the unlabeled target data. However, the maor shortcomng s that the feature set, even wth feature selecton, may ft the labeled source data well but not the target data because the target data has no labels to be used n feature selecton. Feature selecton s shown to be very mportant for ths applcaton as we wll see n Secton FS-EM Based on the dscusson above, the key to solve the problem of EM s to fnd a way to reflect the features n the target doman durng the teratons. We propose two alternatves, FS-EM (Feature Selecton EM) and Co-Class (Co-Classfcaton). Ths sub-secton presents FS-EM. EM can select features only before teratons usng the labeled source data and keep usng the same features n each teraton. However, these features only ft the labeled source data but not the target data. We then propose to select features durng teratons,.e., after each teraton, we redo feature selecton. For ths, we use the predcted classes of the target data. In naïve Bayes, we defne the predcted class for document as (4) The detaled algorthm for FS-EM s gven n Fgure 1. Frst, we select a feature set from the labeled source data and then buld an ntal NB classfer (lnes 1 and 2). The feature selecton s based on Informaton Gan, whch wll be ntroduced n Secton 3.4. After that, we classfy each document n the target data to obtan ts predcted class (lnes 4-6). A new target data set s produced n lne 7, whch s wth added classes (predcted n lne 5). Lne 8 selects a new feature set from the data (whch s dscussed below), from whch a new classfer s bult (lne 9). The teraton stops when the predcted classes of do not change any more (lne 10). We now turn to the data set, whch can be formed wth one of the two methods: r1 d ) k cr ) r( w c ) 1 d, k d r( w 1, k d k r c ) The frst method (called FS-EM1) merges the labeled source data and the target data (wth predcted classes). However, ths method does not work well because the labeled source data can domnate and the target doman features are stll not well represented. The second method ( ), denoted as FS- EM2, selects features from the target doman data only based on the predcted classes. The classfers are bult n teratons (lnes 3-10) usng only the target doman data. The weakness of ths s that t completely gnores the labeled source data after ntalzaton, but the source data does contan some valuable nformaton. Our fnal proposed method Co-Class s able to solve ths problem. 3.3 Co-Class Co-Class s our fnal proposed algorthm. It consders both the source labeled data and the target data wth predcted classes. It uses the dea of FS- EM, but s also nspred by Co-Tranng n (Blum & Mtchell, 1998). It addtonally deals wth the second ssue dentfed n Secton 1 (.e., the mbalance of shared postve and negatve features). Co-Tranng s orgnally desgned for semsupervsed learnng to learn from a small labeled and a large unlabeled set of tranng examples, whch assumes the set of features n the data can be parttoned nto two subsets, and each subset s suffcent for buldng an accurate classfer. The proposed Co-Class model s smlar to Co- Tranng n that t also bulds two classfers. However, unlke Co-Tranng, Co-Class does not partton the feature space. Instead, one classfer s bult based on the target data wth predcted classes ( ), and the other classfer s bult usng only the source labeled data ( ). Both classfers use the same features (ths s an mportant pont) that are selected from the target data only, n order to focus on the target doman. The fnal classfcaton s based on both classfers. Furthermore, Co-Tranng only uses the data from the same doman. The detaled Co-Class algorthm s gven n Fgure 2. Lnes 1-6 are the same as lnes 1, 2 and 4-7 n FS-EM. Lne 8 selects new features from. Two naïve Bayes classfers, and, are then bult usng the source data and predcted target data respectvely wth the same set of 1044

5 Algorthm FS-EM Input: Labeled data and unlabeled data 1 Select a feature set based on IG from ; 2 Learn an ntal naïve Bayes classfer from based on (usng Equatons (1) and (2)); 3 repeat 4 for each document n do 5 ; // predct the class of usng 6 end 7 Produce data based on predcted class of ; 8 Select a new feature set from ; 9 Learn a new classfer on based on the new feature set ; 10 untl the predcted classes of stablze 11 Return the classfer from the last teraton. Fgure 1 The FS-EM algorthm features (lnes 9-10). Lnes classfy each target doman document usng the two classfers. ( ) s the aggregate functon to combne the results of two classfers. It s defned as: ( ) { Ths ams to deal wth the mbalanced feature problem. As dscussed before, the expressons for statng a partcular ntenton (e.g., buyng) are very smlar across domans but the non-ntenton expressons across domans are hghly dverse, whch result n strong postve features and weak negatve features. We then need to restrct the postve class by requrng both classfers to gve postve predctons. If we use the method n Co- Tranng (multplyng the probabltes of the two NB classfers), the classfcaton results deterorate from teraton to teraton because the postve class recall gets hgher and hgher due to strong postve features, but the precson gets lower and lower. Snce we buld and use two classfers for the fnal classfcaton, we call the method Co-Class, short for Co-Classfcaton. Co-Class s dfferent from EM (Ngam et al., 2000) n two man aspects. Frst, t ntegrates feature selecton nto the teratons, whch has not been done before. Feature selecton refnes features to enhance the correlaton between the features and classes. Second, two classfers are bult based on dfferent domans and combned to mprove the classfcaton. Only one classfer s bult n exstng EM methods, whch gves poorer results (Secton 4). Algorthm Co-Class Input: Labeled data and unlabeled data 1 Select a feature set based on IG from ; 2 Learn an ntal naïve Bayes classfer from based on (usng Equatons (1) and (2)); 3 for each document n do 4 ; // predct the class of usng 5 end 6 Produce data based on the predcted class of ; 7 repeat 8 Select a new feature set from ; 9 Buld a naïve Bayes classfer usng and ; 10 Buld a naïve Bayes classfer usng and ; 11 for each document n do 12 ( ); // Aggregate functon 13 end 14 Produce data based on predcted class of ; 15 untl the predcton classes of stablze 16 Return classfers and from the last teraton. 3.4 Feature Selecton As feature selecton s mportant for our task, we brefly ntroduce the Informaton Gan (IG) method gven n (Yang & Pedersen, 1997), whch s a popular feature selecton algorthm for text classfcaton. IG s based on entropy reflectng the purty of the categores or classes by knowng the presence or absence of each feature, whch s defned as: IG( f ) m 1 P( c )log P( c ) P( f ) m f, f 1 P( c f )log P( c f ) Usng the IG value of each feature, all features can be ranked. As n normal classfcaton tasks, the common practce s to use a set of top ranked features for classfcaton. 4 Evaluaton Fgure 2 The Co-Class algorthm We have conducted a comprehensve set of experments to compare the proposed Co-Class method wth several strong baselnes, ncludng a state-ofthe-art transfer learnng method. 4.1 Experment Settngs Datasets: We created 4 dfferent doman datasets crawled from 4 dfferent forum dscusson stes: Cellphone: Electroncs: Camera:

TV: http://www.avforums.com/forums/tvs/ For our experments, we are nterested n the ntenton to buy, whch s our ntenton or postve class. For each dataset, we manually labeled 1000 posts.

6 TV: For our experments, we are nterested n the ntenton to buy, whch s our ntenton or postve class. For each dataset, we manually labeled 1000 posts. Labelng: We ntally labeled about one ffth of posts by two human annotators. We found ther labels hghly agreed. We then used only one annotator to complete the remanng labelng. The reason for the strong labelng agreement s that we are nterested n only explct buyng ntentons, whch are clearly expressed n each post, e.g., I am n the market for a new smartphone. There s lttle ambguty or subectvty n labelng. To ensure that the task s realstc, for all datasets we keep ther orgnal class dstrbutons as they are extracted from ther respectve webstes to reflect the real-lfe stuaton. The ntenton class s always the mnorty class, whch makes t much harder to predct due to the mbalanced class dstrbuton. Table 1 gves the statstcs of each dataset. On average, each post contans about 7.5 sentences and 122 words. We have made the datasets used n ths paper publcally avalable at the webstes of the frst two authors. Evaluaton measures: For all experments, we use precson, recall and F1-score as the evaluaton measures. They are sutable because our obectve s to dentfy ntenton posts. 4.2 One Doman Learnng The obectve of our work s to classfy the target doman nstances wthout labelng any target doman data. To set the background, we frst gve the results of one doman learnng,.e., assumng that there s labeled tranng data n the target doman (whch s the tradtonal fully supervsed learnng). We want to see how the results of Co- Class compare wth the fully supervsed learnng. For ths set of experments, we use naïve Bayes and SVM. For naïve Bayes, we use the Lngppe mplementaton ( For Dataset No. of No. of Total No. Intenton Non-Intenton of posts Cellphone Electroncs Camera TV Table 1: Datasets statstcs wth the buy ntenton SVM, we use SVM Lght (Joachms, 1999) from ( wth the lnear kernel as t has been shown by many researchers that lnear kernel s suffcent for text classfcaton (Joachms, 1998; Yang and Lu, 1999). Durng labelng, we observed that the ntenton n an ntenton (postve) post s often expressed n the frst few or the last few sentences. Hence, we tred to use the full post (denoted by Full), the frst 5 sentences (denoted by (5, 0)), and frst 5 and last 5 sentences (denoted by (5, 5)). We also expermented wth the frst 3 sentences, and frst 3 and last 3 sentences but ther results were poorer. The experments were done usng 10-fold cross valdaton. For the number of selected features, we tred 500, 1000, 1500, 2000, 2500 and all. We also tred ungrams, bgrams, trgrams, and 4- grams. To compare naïve Bayes wth SVM, we tred each combnaton,.e. number of features and n-grams, and found the best model for each method. We found that naïve Bayes works best when usng trgrams wth 1500 selected features. Bgrams wth 1000 features are the best combnaton for SVM. Fgure 3 shows the comparson of the best results (F1-scores) of naïve Bayes and SVM. From Fgure 3, we make the followng observatons: 1. SVM does not do well for ths task. We tuned the parameters of SVM, but the results were smlar to the default settng, and all were worse than naïve Bayes. We beleve the man reason s that the data for ths applcaton s hghly nosy because apart from one or two ntenton sentences, other sentences n an ntenton post have lttle dfference from those n a non-ntenton post. SVM does not perform well wth very nosy data. When there are data ponts far away from ther own classes, SVM Fgure 3 Naïve Bayes vs. SVM 1046

7 Naïve Bayes Cellphone Electroncs Camera TV (n-grams, features) Full 5,0 5,5 Full 5,0 5,5 Full 5,0 5,5 Full 5,0 5,5 Ungrams, Bgrams, Trgrams, grams, Table 2: One-doman learnng usng naïve Bayes wth n-grams (wth best no. of features) Naïve Bayes (n-grams, features) Cellphone Electroncs Camera TV Full 5,0 5,5 Full 5,0 5,5 Full 5,0 5,5 Full 5,0 5,5 Trgrams, Trgrams, Trgrams, Trgrams, grams, Table 3: F1-scores of 3TR-1TE wth trgrams and dfferent no. of features tends to be strongly affected by such ponts (Wu & Lu, 2007). Naïve Bayes s more robust n the presence of nose due to ts probablstc nature. 2. SVM usng only the frst few and/or last few sentences performs better than usng full posts because full posts have more nose. However, t s stll worse than naïve Bayes. 3. For naïve Bayes, usng full posts and the frst 5 and last 5 (5, 5) sentences gve smlar results, whch s not surprsng as (5, 5) has almost all the nformaton needed. Wthout usng the last 5 sentence (5, 0), the results are poorer. We also found that wthout feature selecton (usng all features), the results are markedly worse for both naïve Bayes and SVM. Ths s understandable (as we dscussed earler) because most words and sentences n both ntenton and nonntenton posts are very smlar. Thus, feature selecton s hghly desrable for ths applcaton. Effect of dfferent combnatons: Table 2 gves the detaled F1-score results of naïve Bayes wth best results n dfferent n-grams (wth best number of features). We can see that usng trgrams produces the best results on average, but bgrams and 4-grams are qute smlar. It turns out that usng trgrams wth 1500 selected features performs the best. SVM results are not shown as they are poorer. In summary, we say that naïve Bayes s more sutable than SVM for our applcaton and feature selecton s crucal. In our experments reported below, we wll only use naïve Bayes wth feature selecton. 4.3 Evaluaton of Co-Class We now compare Co-Class wth the baselne methods lsted below. Note that for ths set of experments, the source data all contan labeled posts from three domans and the target data contan unlabeled posts n one doman. That s, for each target doman, we merge three other domans for tranng and the target doman for testng. For example, for the target of Cellphone, the model s bult usng the data from the other three domans (.e., Electroncs, Camera and TV ). The results are the classfcaton of the model on the target doman Cellphone. Several strong baselnes are descrbed as follows: 3TR-1TE: Use labeled data from three domans to tran and then classfy the target (test) doman. There s no teraton. Ths method was used n (Aue & Gamon, 2005). EM: Ths s the algorthm n Secton 3.1. The combned data from three domans are used as the labeled source data. The data of the remanng one doman are used as the unlabeled target data, whch s also used as the test data (snce t s unlabeled). ANB: Ths s a recent transfer learnng method (Tan et al., 2009). ANB uses frequently cooccurrng entropy (FCE) to pck out generalzable (or shared) features that occur frequently n both the source and target domans. Then, a weghted transfer verson of naïve Bayes classfer s appled. We chose ths method for comparson as t s a recent method, also based on naïve Bayes, and has been appled to the NLP task of sentment 1047

8 Fgure 4 Comparson EM, ANB, FS-EM1, FS-EM2, and Co-Class across teratons (0 s 3TR-1TE) classfcaton, whch to some extend s related to the proposed task of ntenton classfcaton. ANB was also shown to perform better than EM and naïve Bayes transfer learnng method (Da et al., 2007). We look at the results of 3TR-1TE frst, whch are shown n Table 3. Due to space lmtatons, we only show the trgrams F1-scores as they perform the best on average. Table 3 gves the number of features wth trgrams. We can observe that on average usng 3000 features gves the best F1- score results. It has 1000 more features than one doman learnng because we now combne three domans (3000 posts) for tranng and thus more useful features. From Table 3, we observe that the F1-score results of 3TR-1TE are worse than those of one doman learnng (Table 2), whch s ntutve because no tranng data are used from the target doman. But the results are not dramatcally worse whch ndcate that there are some common features n dfferent domans, meanng people expressng the same ntenton n smlar ways. Snce we found that trgrams wth 3000 features perform the best on average, we run EM, FS- EM1, FS-EM2 and Co-Class based on trgrams wth 3000 features. For the baselne ANB, we tuned the parameters usng a development set (1/10 of the tranng data). We found that selectng 2000 generalzable/shared features gves the best results (the default s 500 n (Tan et al., 2009)). We kept ANB s other orgnal parameter values. The F1-scores (averages over all 4 datasets) wth the number of teratons are shown n Fgure 4. Iteraton 0 s the result of 3TR-1TE. From Fgure 4, we can make the followng observatons: 1. EM makes a lttle mprovement n teraton 1. After that, the results deterorate. The gan of teraton 1 shows that ncorporatng the target doman data (unlabeled) s helpful. However, the selected features from source domans can only ft the labeled source data but not the target data, whch was explaned n Secton ANB mproves slghtly from teraton 1 to teraton 6, but the results are all worse than those of Co-Class. We checked the generalzable/shared features of ANB and found that they were not sutable for our problem snce they were manly adectves, nouns and sentment verbs, whch do not have strong correlaton wth ntentons. Ths shows that t s hard to fnd the truly shared features ndcatng ntentons. Furthermore, ANB s results are almost the same as those of EM. 3. FS-EM2 behaves smlarly to FS-EM1. After two teratons, the results start to deterorate. Selectng features only from the target doman makes sense snce t can reflect target doman data well. However, t also becomes worse wth the ncreased number of teratons, due to strong postve features. Wth ncreased teratons, postve features get stronger due to the mbalanced feature problem dscussed n Secton Co-Class performs much better than all other methods. Wth the ncreased number of teratons, the results actually mprove. Startng from teraton 7, the results stablze. Co-Class solves the problem of strong postve features by requrng strong condtons for postve classfcaton and focusng on features n the target doman only. Although the detaled results of precson and recall are not shown, the Co-Class model actually mproves the F1-score by mprovng both the precson and recall. Sgnfcance of mprovement: We now dscuss the sgnfcance of mprovements by comparng the results of Co-Class wth other models. Table 4 summarzes the results among the models. For Co-Class, we use the converged models at teraton 7. We also nclude the One Doman learnng results whch are from fully supervsed classfcaton n the target domans wth trgrams and 1500 features. The results of 3TR-1TE, EM, ANB, FS- EM1, and FS-EM2 are obtaned based on ther settngs whch gve the best results n Fgure

9 Fgure 5 Effect of number of source domans usng 3TR-1TE and Co-Class. It s clear from Table 4 that Co-Class s the best method n general. It s even better than the fully supervsed One-Doman learnng, although ther results are not strctly comparable because One- Doman learnng uses tranng and test data from the same doman va 10-fold cross valdaton, whle all other methods use one doman as the test data (the labeled data are from the other three domans). One possble reason s that the labeled data are much bgger than those n One-Doman learnng, whch contan more expressons of buyng ntenton. Note that FS-EM1 and FS-EM2 work slghtly better than Co-Class n doman Camera because t s the least nosy doman wth very short posts whle other domans (as source data) are qute nosy. Wth good qualty data, FS-EM1 and FS-EM2 (also proposed n ths paper) can do slghtly better than Co-Class. Statstcal pared t-test shows that Co-Class performs sgnfcantly better than baselne methods 3TR- 1TE, EM, ANB and FS-EM1 at the confdence level of 95%, and better than FS-EM2 at the confdence level of 94%. Effect of the number of tranng domans: In our experments above, we used 3 source doman data and tested on one target doman. We now show what happens f we use only one or two source doman data and test on one target doman. We tred all possble combnatons of source and target data. Fgure 5 gves the average results over the four target/test domans. We can see that usng more source domans s better due to more labeled data. Wth more domans, Co-Class also mproves more over 3TR-1TE. 5 Concluson Ths paper studed the problem of dentfyng ntenton posts n dscusson forums. The problem has not been studed n the socal meda context. Due to specal characterstcs of the problem, we found that t s partcularly suted to transfer learnng. A new transfer learnng method, called Co- Class, was proposed to solve the problem. Unlke a general transfer learnng method, Co-Class can deal wth two specfc dffcultes of the problem to produce more accurate classfers. Our expermental results show that Co-Class outperforms strong baselnes ncludng classfers traned usng labeled data n the target domans and classfers from a state-of-the-art transfer learnng method. Acknowledgments Ths work was supported n part by a grant from Natonal Scence Foundaton (NSF) under grant no. IIS , and a grant from HP Labs Innovaton Research Program. References Aue, A., & Gamon, M. (2005). Customzng Sentment Classfers to New Domans: A Case Study. Proceedngs of Recent Advances n Natural Language Processng (RANLP). Bltzer, J., Dredze, M., & Perera, F. (2007). Bographes, Bollywood, Boom-boxes and Blenders: Doman Adaptaton for Sentment Classfcaton. Proceedngs of Annual Meetng of the Assocaton for Computatonal Lngustcs (ACL). Model Cellphone Electroncs Camera TV Full 5,0 5,5 Full 5,0 5,5 Full 5,0 5,5 Full 5,0 5,5 One-Doman TR-1TE EM ANB FS-EM FS-EM Co-Class Table 4: F1-score results of One-Doman, 3TR-1TE, EM, ANB, FS-EM1, FS-EM2, and Co-Class 1049

10 Blum, A., & Mtchell, T. (1998). Combnng Labeled and Unlabeled Data wth Co-Tranng. COLT: Proceedngs of the eleventh annual conference on Computatonal learnng theory. Bollegala, D., Wer, D. J., & Carroll, J. (2011). Usng Multple Sources to Construct a Sentment Senstve Thesaurus for Cross-Doman Sentment Classfcaton. Proceedngs of Annual Meetng of the Assocaton for Computatonal Lngustcs (ACL). Da, H. K., Zhao, L., Ne, Z., Wen, J. R., Wang, L., & L, Y. (2006). Detectng onlne commercal ntenton (OCI). Proceedngs of the 15th nternatonal conference on World Wde Web (WWW). Da, W., Xue, G., Yang, Q., & Yu, Y. (2007). Transferrng nave bayes classfers for text classfcaton. In Proceedngs of the 22nd AAAI Conference on Artfcal Intellgence (AAAI). Dempster, A., Lard, N., & Rubn, D. (1977). Maxmum lkelhood from ncomplete data va the EM algorthm. Journal of the Royal Statstcal Socety. Seres B, 39(1), Easley, D., & Klenberg, J. (2010). Networks, Crowds, and Markets: Reasonng About a Hghly Connected World. Cambrdge Unversty Press. Gao, S., & L, H. (2011). A cross-doman adaptaton method for sentment classfcaton usng probablstc latent analyss. Proceedngs of the 20th ACM nternatonal conference on Informaton and knowledge management (CIKM). He, Y., Ln, C., & Alan, H. (2011). Automatcally Extractng Polarty-Bearng Topcs for Cross- Doman Sentment Classfcaton. Proceedngs of the 49th Annual Meetng of the Assocaton for Computatonal Lngustcs: Human Language Technologes (ACL). Hu, D. H., Shen, D., Sun, J.-T., Yang, Q., & Chen, Z. (2009). Context-Aware Onlne Commercal Intenton Detecton. Proceedngs of the 1st Asan Conference on Machne Learnng: Advances n Machne Learnng (ACML). Hu, J., Wang, G., Lochovsky, F., tao Sun, J., & Chen, Z. (2009). Understandng user s query ntent wth wkpeda. Proceedngs of the 18th nternatonal conference on World wde web (WWW). Joachms, T. (1998). Text Categorzaton wth Support Vector Machnes: Learnng wth Many Relevant Features. European Conference on Machne Learnng (ECML). Joachms, T. (1999). Makng large-scale SVM Learnng Practcal. Advances n Kernel Methods - Support Vector Learnng. MIT Press. Kanayama, H., & Nasukawa, T. (2008). Textual Demand Analyss: Detecton of Users Wants and Needs from Opnons. Proceedngs of the 22nd Internatonal Conference on Computatonal Lngustcs (COLING). L, X. (2010). Understandng the Semantc Structure of Noun Phrase Queres. Proceedngs of Annual Meetng of the Assocaton for Computatonal Lngustcs (ACL). L, X., Wang, Y.-Y., & Acero, A. (2008). Learnng query ntent from regularzed clck graphs. Proceedngs of the 31st annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval (SIGIR). Lu, B. (2010). Sentment Analyss and Subectvty. (N. Indurkhya & F. J. Damerau, Eds.) Handbook of Natural Language Processng, 2nd ed. Ngam, K., McCallum, A. K., Thrun, S., & Mtchell, T. (2000). Text Classfcaton from Labeled and Unlabeled Documents usng EM. Mach. Learn., 39(2-3), Pan, S. J., N, X., Sun, J.-T., Yang, Q., & Chen, Z. (2010). Cross-doman sentment classfcaton va spectral feature algnment. Proceedngs of the 19th nternatonal conference on World wde web (WWW). Pan, S. J., & Yang, Q. (2010). A Survey on Transfer Learnng. IEEE Trans. Knowl. Data Eng., 22(10), Pang, B., & Lee, L. (2008). Opnon mnng and sentment analyss. Foundatons and Trends n Informaton Retreval, 2(1-2), Tan, S., Cheng, X., Wang, Y., & Xu, H. (2009). Adaptng Nave Bayes to Doman Adaptaton for Sentment Analyss. Proceedngs of the 31th European Conference on IR Research on Advances n Informaton Retreval (ECIR). Tan, S., Wu, G., Tang, H., & Cheng, X. (2007). A novel scheme for doman-transfer problem n the context of sentment analyss. Proceedngs of the sxteenth ACM conference on Conference on nformaton and knowledge management (CIKM). Wu, Y., & Lu, Y. (2007). Robust truncated-hnge-loss support vector machnes. Journal of the Amercan Statstcal Assocaton, 102(479), Yang, H., S, L., & Callan, J. (2006). Knowledge Transfer and Opnon Detecton n the TREC 2006 Blog Track. Proceedngs of TREC. Yang, Y., & Lu, X. (1999). A re-examnaton of text categorzaton methods. Proceedngs of the 22nd annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval (SIGIR). Yang, Y., & Pedersen, J. O. (1997). A Comparatve Study on Feature Selecton n Text Categorzaton. Proceedngs of the Fourteenth Internatonal Conference on Machne Learnng (ICML). 1050

Reinforcement Learning-based Feature Selection For Developing Pedagogically Effective Tutorial Dialogue Tactics

Reinforcement Learning-based Feature Selection For Developing Pedagogically Effective Tutorial Dialogue Tactics Renforcement Learnng-based Feature Selecton For Developng Pedagogcally Effectve Tutoral Dalogue Tactcs 1 Introducton Mn Ch, Pamela Jordan, Kurt VanLehn, Moses Hall {mc31, pjordan+, vanlehn+, mosesh}@ptt.edu