Evaluation Approaches for an Arabic Extractive Generic Text Summarization System

Evaluatio Approaches for a Arabic Extractive Geeric Text Summarizatio System Ibrahim Sobh 1,, Nevie Darwish 1, Magda Fayek 1 1 The Departemet of Computer Egieerig, Cairo Uiversity, Giza, Egypt. The Research ad Developmet Iteratioal Compay, RDI, http://www.rdi-eg.com. sobh@rdi-eg.com, {darwish, magdafayek}@eg.cu.edu.eg Abstract The advace of techology ad extesive use of the web has prompt the eed to summarizatio of text documets. Users ted to extract the most iformative or idicative iformatio istead of readig the whole origial documets. Naturally, automatic text summarizatio will save time ad effort for the users, ad will eable them to make decisios i less time. This paper itroduces evaluatio methods for a Arabic extractive text summarizatio system. This system itegrates Bayesia ad Geetic Programmig (GP) classificatio methods i a optimized way to extract the summary seteces. The system is traiable ad uses maually aotated corpus. We have itroduced methods for evaluatig the summary agaist other huma summaries. Moreover, we used huma judgemet for system output, ad fially we tested the system agaist a commercial Arabic summarizatio system. Itroductio The process of summarizatio is becomig very importat i the presece of large umber of iformatio sources available i every field. Summarizatio work has bee started as early as i the 1950 s. (Luh, 1958) extracted abstracts of scietific articles automatically based o the assumptio that frequet words represets the most importat cocepts of the documet. (Edmudso et al. 1961) preseted a survey of the existig methods for automatic summarizatio. Based o cue phrases, title, key words ad title (Edmudso, 1969) has implemeted documet summarizatio. Basically, these methods form the core of the extractio methods eve today. Uses of Summaries Summary ca be used to be idicative to produce a referece fuctio to select documets for more i-depth readig or iformative to cover all or most saliet iformatio i the source text documets. Summary ca be geeral where there is o focus o some topic or view poit provided by the user or it ca be user-focused where summaries are guided by user view poit statemet, topic or questio to be aswered. Size of produced summary ca be very short (Headlie) or relatively short typically 0% to 5% of origial documet size. Extractive Summarizatio Extractive summarizatio extracts text by selectig from origial documet importat pieces to produce shorter result. Huma summaries ofte relay o cuttig ad pastig of the full documet to geerate summaries. By decomposig huma summary, we ca lear the kid of operatios which are usually performed to extract ad edit seteces ad the develop automatic programs to simulate the most successful operatios. A Hidde Markov Model (HMM) solutio to the decompositio problem was proposed by (Jig, 1999) ad it foud that 78% of summary seteces produced by humas are based o cut-ad-past. Graularities of extractio could be phrases ( or 3 words) or seteces (Kupiec et al. 1995). Extractio approach may have the problem of coherece but they are trusted by the users. There are differet approaches to implemet extractive summaries. The most importat oes are: the liear methods that give a score for each setece depedig o heuristic measures, Latet Sematic Aalysis (LSA) which is ispired by latet sematic idexig ad applyig Sigular Value Decompositio (SVD) to the documet setece matrix (Gog ad Liu 001), Maximal Margial Relevace (MMR) which measures the relevace or similarity betwee each setece i the full documet ad the seteces that have bee selected ad added ito the summary (Carboell ad Goldstei 1998), ad Graph Based methods that models the documet ito graph where seteces are the vertices, ad Machie Learig Approaches (Kupiec et al. 1995). Abstractive Summarizatio Abstractio, o the other had, geerates summaries at least some of whose material is ot preseted i the iput text. Abstractio of documets by humas is complex to model as is ay other iformatio processig by humas. The process of abstractio is complex to be formulated mathematically or logically (Jig, H. ad McKeow, K.R., 1999). Abstractio requires text aalysis, modelig ad laguage geeratio techiques. Summary Evaluatio Summary evaluatio methods attempt to determie how adequate ad reliable or how useful a summary is relative to its source. Geerally, there are two types of evaluatio methods. The first is itrisic evaluatio i which users judge the quality of summarizatio by directly aalyzig the summary. Users judge fluecy, how well the summary covers stipulated key ideas, or how it compares to a ideal summary writte by the author of the source text or a huma abstractor. Noe of these measures are etirely satisfactory. The ideal summary, i particular is hard to costruct ad rarely uique. I most cases there is o oly oe correct ideal summary for a give documet. The secod type of evaluatio methods is extrisic. Users judge a summary s quality accordig to how it affects the completio of some other task, such as how well they ca 150

aswer certai questios relative to the full source text. ROUGE (Recall-Orieted Uderstudy for Gistig Evaluatio) is also used for summary evaluatio by coutig the umber of overlappig uits such as -gram, word sequeces, ad word pairs betwee the computergeerated summary to be evaluated ad the ideal summaries created by humas. Extractive approach for summarizatio by classificatio eables us to use recall, precessio ad F-measure to evaluate summaries. I this paper, we measured how huma summaries may differ, ad how our system performed relative to differet huma summaries. We tested our system usig the same measures agaist a well kow commercial summarizatio system refereced as S System. I additio to this, we asked two humas to give each setece i the system output summary a subjective score to get a measure of summary quality. System Overview Typically extractive summarizers deal with seteces. Rules of setece scorig are geerally heuristic; however give a traiig corpus it would be possible to approach the problem as statistical classificatio to classify a setece to be i summary or out of summary classes give its feature vector. The importace of a setece withi a documet ca be is determied by various heuristics such as positio, cue phrases (Edmudso 1969, Kupiec et al. 1995), word/phrase frequecy (Luh 1958, Edmudso 1969, Kupiec et al. 1995), lexical cohesio (Barzilay ad Elhadad 1997), discourse structures (Marcu, 1998), ad idicator phrases (Hovy ad Li 1999, Kupiec et al. 1995). Naive Bayesia classificatio method is cosidered to be simple, easy to implemet ad does ot require heavy processig. However, it assumes the idepedece betwee features ad it may fall ito local optima. Naïve Bayesia classificatio method was used for extractive summaries (Kupiec et al. 1995) ad key phrase extractio (Witte et al. 1999). Geetic Programmig (GP) is used also for classificatio ad could be used for extractive summarizatio (Turey, 000). GP uses a beam search to try to fid global optima. The proposed system uses both classificatio techiques ad combies them i a optimized way to get better results usig a reduced feature set. The system structure requires aotated traiig ad testig corpus. Arabic Processig Arabic as high iflected ad derivative laguage requires stemmig for iformatio retrieval ad summarizatio applicatios. Feature extractio requires complex Arabic laguage processig: Stop words removal, Stemmig ad Part Of Speech Taggig (POST). We used the implemetatio of (Attia, 005) as a robust method for extractig roots as stems, POST ad stop words. Features We used oly five discrimiative features (Sobh, I., Darwish, N., Fayek. M. 007) for each setece: 1) Setece legth, ) Setece positio i paragraph, 3) Setece similarity, 4) Number of ifiitives i setece ad 5) Number of verbs i seteces. The Classifiers We used two classifiers i parallel. Naive Bayesia classifier ad Geetic Programmig classifier. Naive Bayesia Classifier A Bayesia classifier classifies each setece to be i summary or out of summary classes based o its feature vector ad the traiig data. For each setece the probability that will be icluded i summary ca be computed as follows: V1, V,... V s S) s S) s S V1, V,... V ) (1) V, V,... V ) Where s is the setece, S is the Summary class, V is the feature vector ad is the umber of features. Assumig that features are statistically idepedet: P ( s S V, V,... V ) 1 i1 V i 1 s S) s S) () V ) i1 The setece is classified ito summary class if the followig coditio is fulfilled: V s S) s S) V s NS) s NS)(3) i 1 i i 1 i Where NS is the o summary class. Geetic Programmig Classifier GP is automated learig of computer programs. Origially, Geetic Algorithms (GA) learig is ispired by the theory of evolutio. Basically the problem is represeted by gees. The first populatio of gees is iitialized ad the applyig mutatio ad cross-over operators o the curret populatio results i a ew better populatio. A fitess fuctio is used to evaluate how a idividual fits ad optimizes the problem. GP represets a problem as the set of all possible computer programs. A program is represeted i a gee where GP uses crossover ad mutatio as the trasformatio operators to chage cadidate solutios (programs) ito ew cadidate solutios. GP uses a beam search where the populatio size costitutes the size of the beam ad where the fitess fuctio serves as the evaluatio metric to choose which cadidate solutios are kept ad ot discarded. GP was used successfully i may fields for example, fiacial market, image processig, optimizatio, sigal processig ad patter recogitio. I his book (Hollad, 1975), Hollad metioed Artificial itelligece (AI) as oe of the mai motivators for the creatio of geetic algorithms. He did ot experimet the direct use of GA to evolve programs. Two researchers, (Cramer, 1985) ad (Koza, 1989) suggested that a tree structure should be used i a program geeratio i a geome. Koza however was the first to recogize the importace of the GP ad demostrated its feasibility for automatic programmig i geeral. (Koza, 1989) provided evidece i the form of i 151

several problems from five differet areas. I his book, (Koza, 199) he sparked the rapid growth of GP. We choose to use the Discipulus 1 GP system. Discipulus is cosidered the world s first ad fastest commercial Geetic Programmig system. It writes computer programs automatically i Java, C, ad Itel assembler code. Discipulus builds two types of models, Regressio models ad Classificatio models. We used the dowloadable free versio with default ad recommeded settigs for cross-over ad mutatio rates whe ruig the tool for classificatio. The Dual Classificatio System There are may classifier combiatio topologies. We selected a optimized ad simple way for combiig the two classifiers to get better results as follows: -Bayesia Classifier Uio (OR) GP Classifier: Cosider setece i summary if ay classifier agrees. Class Class Bayesia Class Geetic Programmi g (4) -Bayesia Classifier Itersectio (AND) GP Classifier: Cosider setece i summary if ad oly if both classifiers agree. Class Class Bayesia Class Geetic The Corpus Programmi g (5) The corpus is collected from the "Ahram" web site. Recet "Egypt" ad "Arabic Regio" ews were selected. The documets are trasformed from HTML format ito plai text. The total corpus size is 13 documets divided ito traiig set (80%) ad testig set (0%). The corpus is parsed ito paragraphs ad seteces. Each setece is represeted ito a sigle lie to a Arabic laguage specialist. The the specialist is asked to select (check) the most importat seteces i the documet. Number of selected seteces for each documet is left to the judgmet of the laguage specialist as it depeds o the documet. This approach should icrease the geerality of the system by capturig (learig) the appropriate compressio ratio. Selected seteces are aotated as i summary class; uselected seteces are aotated as out of summary class ad features vectors are extracted for all seteces. Total umber of seteces is 4899 seteces. (3 seteces per documet i average). The huma summary size i the traiig set is 3.3%. System Evaluatio ad Results We used three methods for evaluatig the system geerated summary: 1. Calculatig precisio, recall ad F-measure.. Comparig with other huma summaries. 3. Usig Huma judgmet for each setece i system summary. Moreover, we compared these results with a well kow summarizatio system refereced as S System. Precisio ad Recall Classificatio approach for geeratig automatic summaries makes it easier for evaluatig extractive summaries. Three importat measures are commoly used, precisio, recall ad F-measure for example (Steve et al. 00) ad (Gog ad Liu 001). Precisio is a measure of how much of iformatio that the system retured is correct. -Precessio = Number of system correct summary seteces / Total umber of system summary seteces Recall is a measure of the coverage of the system. -Recall = Number of system correct summary seteces / Total umber of huma summary seteces Recall ad precisio are atagoistic to oe aother. A system strives for coverage will get lower precisio ad a system strives for precisio will get lower recall. F- measure balaces recall ad precisio usig a parameter β. The F-measure is defied as follows: ( 1) PR F (6) P R Whe β is oe, Precisio P ad Recall R are give equal weight. Whe β is greater tha oe, Precisio is favored, whe β is less tha oe, recall is favored. I the followig experimets β equals oe. Our target is to have large F- measure ad at the same time produce a reasoable summary size accordig to the traiig set. The (F- Measure/summary size) ratio is importat whe comparig systems. Table 1 shows the results whe usig the five features for the Bayesia classificatio ad GP classificatio idepedet ad itegrated. System Recall Precisio F.measure Summary Size Bayesia 0.687 0.533 0.600 30.1% GP 0.474 0.75 0.573 15.8% AND 0.464 0.754 0.577 14.40% OR 0.697 0.55 0.599 31.01% Table 1: Five features summarizatio evaluatio Comparig idepedet huma summaries I order to uderstad how humas may geerate differet extractive summaries for the same documet, we called the mai huma summarizer the "referece huma summarizer". We asked two additioal idepedet huma summarizers to extract seteces from the same testig set. The we computed the summary compressio ratio for each oe ad we computed the commo selected seteces betwee each pair. Table shows the crossevaluatio betwee summary sizes. 1 http://www.aimlearig.com http://www.ahram.org 15

System Summary Size Referece Huma 3.4% Huma 1 35.8% Huma 3.3% Table : Huma summaries size compariso Table 3 shows a compariso betwee differet huma summaries itersectios (commo extracted seteces) percetages. For example, the itersectio seteces betwee "Referece" ad "Huma 1" is 47.4% relative to referece summary size (this could be the recall of huma 1 summary give referece summary, or precisio of referece summary give huma 1 summary). The (Bayesia-Huma ) ad (OR-Huma ) pairs have F-Measures of 0.557, 0.563 respectively which is much better tha (Huma 1- Huma ), (referece-huma 1) ad (referece-huma 1) pairs. These results imply that our system exists i the area of huma performace ad the differece betwee the system ad the humas is actually comparable to the differece betwee humas. Comparig with S System Figure compares betwee S System ad our system from the referece summarizer poit of view. Huma 1 Huma R P F R P F Referece 0.309 0.474 0.374 0.469 0.649 0.544 R P F Huma 1 0.534 0.483 0.507 Table 3. Huma summaries cross-evaluatio compariso. The largest F-measure was 0.544 betwee Huma ad the referece summaries. The largest recall 0.534 was betwee Huma 1 ad Huma summaries. The largest precisio was betwee Huma ad the referece summaries. This also shows that huma summaries may differ i size ad the selected extracted seteces. The followig figure compares betwee each pair of summaries. This icludes our system: (Bayesia, GP, AND, OR), ad huma summaries: Referece, Huma 1 ad Huma summaries. Figure : Systems comparisos with referece summary As expected, our systems results were close to the referece summary (as the system was traied o this referece summary) where the S System did ot see the referece summary before. I order to make fare compariso, we compared betwee S System ad our systems from the two ew huma summaries poit of view. Figure 3 shows the results. Figure 1: System pairs comparisos The compariso shows that (our system-referece) summary pair has the largest F-Measure betwee all other pairs. Also (our system-huma ) has average F-Measure of 0.489 which is larger tha (our system-huma 1) pair where the average F-Measure of 0.315. O the other had, the (AND-Huma 1) ad (GP- Huma 1) have the lowest F-Measures of 0.38, 0.4 respectively (It was expected due to the fact that the AND-system summary size is 14.4% ad GP-system summary size is 15.8% ad hece there is o chace to get high recall for other huma summaries). Figure 3: Systems comparisos with two huma summaries This compariso shows that S System had the best recall over all the systems, the the OR system; o the other had all our systems had better precisio tha S System. I terms of F-Measure, the Bayesia ad the uio systems were slightly better tha S System. This compariso does ot show the summary size. It is usually required to have high F-Measure at relatively small summary size; figure 4 shows the compariso betwee S System ad our systems takig ito cosideratio the F- Measure/summary size ratio. 153

Figure 4: Systems (F/Size) Comparisos with two huma summaries We oted that S System teds to select most of the seteces as summary if the origial documet is relatively small (8 to 10 seteces). Also i our system we cosidered a "comma" character as separator betwee seteces to provide more flexibility for huma summarizers whe makig decisios if the setece i summary or ot. O the other had we oted that S System did ot cosider this character as a separator, this makes its results more coheret but produced larger summaries that lowered the F-Measure/summary size ratio. Huma Evaluatio Although we are usig automatic techiques for evaluatig summaries due to the fact that we have a golde/referece summary, it is still importat to evaluate the output summaries usig huma judgmets to have aother way of evaluatig a summary eve that the expesive cost of huma judgmet. We asked the two huma summarizers to evaluate the output of the systems. For each summary, they are asked to assig each setece give its summary cotext a label as follows: -Good: It will be better to add this setece to be i this summary. This may be because the setece is iformative, importat ad does ot cause ambiguity with surroudig seteces. -Fair: The setece could be i or out this summary. This may be because the setece cotais margial iformatio. -Bad: It will be worse to put this setece i this summary. This may be because the setece cotais repeated, icomplete or useless iformatio. For example, a setece could be selected as good i certai summary ad fair i aother summary. We applied this huma judgmet for the Itersectio system (Bayesia AND Geetic Programmig), the Uio system (Bayesia OR Geetic Programmig) ad fially, the S System. The results are showed i figure 5. Figure 5: Systems huma evaluatio comparisos These results show that eve the two evaluators results are differet, the best system for both was AND system, the the S System, the the OR system. Coclusios I this paper, a optimized dual classificatio system for Arabic extractive text summarizatio has bee itroduced. Both classificatio methods have relatively close F- measures, but GP system teds to produce smaller summaries. Bayesia classificatio method is simple, assumes feature idepedece ad may fall ito local optima where GP search is global. By itegratig both classifiers we foud that usig the uio for itegratio icreases the recall ad the result summary size that could be used as iformative summary. However, usig the itersectio for itegratio icreases the precisio ad decreases the summary size that could be used as idicative summary. I order to uderstad the ature of huma summaries we asked two additioal huma to summarize the text. The we compared each pair i terms of recall, precisio ad F- Measure. We foud that our system performace was i the same area as humas. Moreover, we used the S system ad compared it agaist the additioal huma summaries. We foud that the S system had the best recall; o the other had all our systems had better precisio tha the S system. I terms of F-Measure, the Bayesia ad the uio systems were slightly better tha the S system. Whe takig the size of the summary, our system was much better tha the S system. By applyig two huma subjective judgmets for each setece give its summary cotext, we foud that evaluatio teds to prefer the AND system over the S system ad OR systems. Our system got average of 69% good seteces. Fially, our system is optimized, easy to trai ad customize ad able to produce summaries comparable to huma geerated summaries. We expect the system to be used for a wide rage of applicatios. 154

Future Work Applyig umber of suggested techiques is expected to ehace the system results. Addig sematic iformatio from comprehesive lexical resource such as WordNet (Miller, 1995), but for Arabic laguage, may ehace output cohesio ad help i feature selectio. Oe problem with extracted seteces, they may cotai aaphora liks to the rest of the text. This has bee ivestigated by (Paice, 1990). Several heuristics have bee proposed to solve this problem such as icludig the setece just before the extracted oe. Aaphora solvig seems to be iterestig poit of research. Adoptig alterative techiques for evaluatio will help better uderstadig the ature of the summarizatio problem. For example; testig the system performace for accomplishig aother task such as questio aswerig or documet classificatio. Moreover, we pla to use ad customize the same system for differet domais ad study the effect of this o the recommeded features ad overall system performace. Usig word stem (root + form) istead of root oly may ehace the results. Refereces Attia, M. (005). Theory ad Implemetatio of a Large-Scale Arabic Phoetic Trascriptor, ad Applicatios, PhD thesis, Faculty of Egieerig, Dept of Electroics ad Electrical Commuicatios, Cairo Uiversity. http://www.rdieg.com/rdi/techologies/paper.htm Barzilay, R., ad Elhadad, M., (1997). Usig lexical chais for text summarizatio, i Proceedigs of the ACL Itelliget Scalable Text Summarizatio Workshop (ISTS), 86 90. Carboell, J., ad Goldstei, J., (1998). "The use of MMR, diversity-based rerakig for reorderig documets ad producig summaries", i Proceedigs of the 1st Aual Iteratioal ACM SIGIR Coferece o Research ad Developmet i Iformatio Retrieval (SIGIR-98), 335 336, Melboure, Australia, August. Cramer, N.L. (1985). "A represetatio for the adaptive geeratio of simple sequetial programs" i proceedigs of a Iteratioal Coferece o Geetic Algorithms ad the Applicatios, 183-187, Caregie-Mello Uiversity, Pittsburgh, PA. Edmudso, H.P. (1969). "New Methods i Automatic Extractig". Joural of the ACM, 16(): 64-85. Edmudso, H.P. ad R.E. Wyllys. (1961). "Automatic Abstractig ad Idexig-Survey ad Recommedatios". Commuicatios of the ACM, 4(5): 6-34. Evas, D.K., McKeow, K., Klavas, J.L. (005). Similaritybased Multiligual Multidocumet Summarizatio, Techical Report CUCS-014-05, Departmet of Computer Sciece, Columbia Uiversity. Gog, Y. ad Liu, X. (001). Geeric text summarizatio usig relevace measure ad latet sematic aalysis i proceedigs of Special Iterest Group o Iformatio retrieval, SIGIR, ACM, 19 5. Hollad, J. (1975). "Adaptatio i atural ad artificial systems", MIT press, Cambridge, MA. Hovy, E.H., ad Chi-Yew Li. (1999). Automated text summarizatio i SUMMARIST I ACL/EACL summarizatio workshop, 18-4, Madrid, Spai Jig, H. ad McKeow, K.R., (1999). "The decompositio of huma-writte summary seteces" i proceedigs of Special Iterest Group o Iformatio retrieval, SIGIR, ACM, 19-136 Koza, J.R (199). "Geetic Programmig: O the Programmig of Computers by Natural Selectio", MIT Press, Cambridge, MA. Koza, J.R. (1989). "Hierarchical geetic algorithms operatig o populatios of computer programs" i proceedigs of the Eleveth Iteratioal Joit Coferece o Artificial Itelligece IJCAI, 768-774. Morga Kaufma. Sa Fracisco, CA. Kupiec, J., Pederso, J. O., Che, F. (1995). "A Traiable Documet Summarizer" i proceedigs of Special Iterest Group o Iformatio retrieval, SIGIR, ACM, 68-73. Luh, H. (1958). The automatic Creatio of Literature Abstracts, IBM Joural of Research ad Developmet (9):159-165. Marcu, D., (1998). Improvig Summarizatio through Rhetorical Parsig Tuig, i proceedigs of the COLINGACL workshop o Very Large Corpora. Motreal, Caada. Marylad, CS Dept. ad Ist. for Advaced Computer Studies, College Park, USA. Coferece o Itelliget Text Processig ad Computatioal Liguistics, CICLig, 568 581. Miller, G. (1995). "WordNet: A Lexical Database for Eglish." Commuicatios of the Associatio for Computig Machiery (CACM) 38(11): 39-41. Paice, C., (1990). Costructig literature abstracts by computer: techeques ad prospects, Iformatio processig ad maagemet, 6: 171-186. Sobh, I., Darwish, N., Fayek. M. (007). "A Optimized Dual Classificatio System for Arabic Extractive Geeric Text Summarizatio" i proceedigs of the Seveth Coferece o Laguage Egieerig, ESLEC. http://www.rdieg.com/rdi/techologies/paper.htm Steve, J., Stephe, L., ad Gordo, W., (00). "Iteractive Documet Summarizatio Usig Automatically Extracted Key phrases", i proceedigs of the 35th Aual Hawaii Iteratioal Coferece o System Scieces, HICSS-35. Turey, P.D. (000). "Learig Algorithms for Keyphrase Extractio", Iformatio Retrieval, (4), 303-336 (Natioal Research Coucil 44105, Caada) Witte, I.H., Payter, G.W., Frak E., Gutwi, C., ad Nevill- Maig, C.G. (1999). "KEA: Practical Automatic Keyphrase Extractio" i proceedigs of ACM Digital Libraries Coferece, 54-55. 155