Corpus-Based Terminology Extraction

CHAPTER NUMBER Corpus-Based Termiology Extractio ALEXANDRE PATRY AND PHILIPPE LANGLAIS Termiology maagemet is a key compoet of may atural laguage processig activities such as machie traslatio (Laglais ad Carl, 2004), text summarizatio ad text idexatio. With the rapid developmet of sciece ad techology cotiuously icreasig the umber of techical terms, termiology maagemet is certai to become of the utmost importace i more ad more cotet-based applicatios. While the automatic idetificatio of terms from texts has bee the focus of past studies (Jacquemi, 2001) (Castellví et al, 2001), the curret tred i Termiology Maagemet (TM) has shifted to the issue of term etworkig (Kageura et al, 2004). A possible explaatio of this shiftig may lie i the fact that Termiology Extractio (TE), although beig a oisy activity, ecompasses well established techiques that seem difficult to improve sigificatly upo. Despite this shift, we do believe that better extractio of terms could carry over subsequet steps of TM. A traditioal TE system usually ivolves a subtle mixture of liguistic rules ad statistical metrics i order to idetify a list of cadidate terms where it is hoped that terms are raked first. We distiguish our approach to TE from traditioal oes i two differet ways. First, we give back to the user a active role i the extractio process. That is, istead of ecodig a static defiitio of what might or might ot be a term, we let the user specify his ow. We do so by askig him to set up a traiig corpus (a corpus where the terms have bee idetified by a huma) from which our extractor will lear how to defie a term. Secod, our approach is completely automatic ad is readily adapted to the tools (part-of-speech tagger, 1

Alexadre Patry ad Philippe Laglais lemmatizer) ad metrics of the user. Oe might object that requirig a traiig corpus is askig the user to do a part of the job the machie is supposed to do, but we see it i a differet way. We cosider that a little help from the user could pay back i flexibility. The structure of our paper outlies the three steps ivolved i our approach. I the followig sectio, we describe our algorithm to idetify cadidate terms. I the third sectio, we itroduce the differet metrics we compute to score them. The fourth sectio explais how we applied AdaBoost (Freud ad Schapire, 1999), a machie learig algorithm, to rak ad idetify a list of terms. We the evaluate our approach o a corpus which was set up by the Office québécois de la lague fraçaise to evaluate commercially available term extractors. We show that our classifier outperforms the idividual metrics used i this study. Fially, we discuss some limitatios of our approach ad propose future works to be doe. Extractio of cadidate terms It is a commo practice to extract cadidate terms usig a part-of-speech (POS) tagger ad a automato (a program extractig word sequeces correspodig to predefied POS patters). Usually, those patters are maually hadcrafted ad target ou phrases, sice most of the terms of iterest are ou phrases (Justeso ad Katz, 1995). Typical examples of such patters ca be foud i (Jacquemi, 2001). As poited out i (Justeso ad Katz, 1995), relyig o a POS tagger ad legitimate patter recogitio is error proe, sice taggers are ot perfect. This might be especially true for very domai specific texts where a tagger is likely to be more erratic. To overcome this problem without givig up the use of POS patters (sice they are easy to desig ad to use), we propose a way to use a traiig corpus i order to automate the creatio of a automato. There are may potetial advatages with this approach. First, the POS tagger ad the taggig errors, to the extet that they are cosistet, will be automatically assimilated by the automato. Secod, this gives to the user the opportuity to specify the terms that are of iterest for him. If may terms ivolvig verbs are foud i the traiig corpus, the automato will reflect that iterest as well. We also observed i iformal experimets that wide spread patters ofte fails to extract may terms foud i our traiig corpus. Several approaches ca be applied whe geeratig a automato from sequeces of POS ecoutered i a traiig corpus. A 2

Corpus-Based Termiology Extractio straightforward approach is to memorize all the sequeces see i the traiig corpus. A sequece of words is thus a cadidate term oly if its sequece of POS tags has bee see before. This approach is simple but aive. It caot geerate ew patters that are slight variatios of the oes see at traiig time, ad a isolated taggig error ca lead to a bad patter. To avoid those problems, we propose to geerate the patters usig a laguage model traied o the POS tags of the terms foud i the traiig corpus. A laguage model is a fuctio computig the probability that a sequece of words has bee geerated by a certai laguage. I our case, the words are POS tags ad the laguage is the oe recogizig the sequeces of tags correspodig to terms. Our laguage model ca be described as follow: where P( w1 ) =! P( wi H i ) i= 1 w 1 is a sequece of POS tags ad summarizes the iformatio of the! 1 H i is called the history which i previous tags. To build a automato, we oly have to set a threshold ad geerate all the patters whose probability is higher tha it. A excerpt of such a automato is give i Figure 1. Probability Patter 0.538 NomC AdjQ 0.293 NomC Prep NomC 0.032 NomC Dete-dart-ddef NomC 0.0311 NomC Verb-ParPas 0.0311 NomC Prep Dete-dart-ddef NomC Figure 1 Excerpt of a automatically geerated automato. Aother advatage of such a automato is that all patters are associated with a probability, givig more iformatio tha a biary value (legitimate or ot). Ideed, the POS patter probability is oe of the umerous metrics that we feed our classifier with. Scorig the cadidate terms I the previous sectio, we showed a way to geerate a automato that extracts a set of cadidate terms that we ow wat to rak ad/or filter. Followig may other works o term extractio, we score each 3

Alexadre Patry ad Philippe Laglais cadidate usig various metrics. May differet oes have bee idetified i (Daille, 1994) ad (Castellví et al, 2001). We do ot believe that a sigle metric is sufficiet, but istead thik that it is more fruitful to use several of them ad trai a classifier to lear how to take beefit of each of them. Because we thik they are iterestig for the task, we retaied the followig metrics: the frequecy, the legth, the log-likelyhood, the etropy, tf idf ad the POS patter probabilities discussed i the previous sectio. Recall however that our approach is ot restricted to these metrics, but istead ca beefit from ay other oe that ca be computed automatically. Aloe, the frequecy is ot a robust metric to assess the termiological property of a cadidate, but it does carry useful iformatio, as does also the legth of terms. I (Duig, 1993), Duig advocates the use of log-likelyhood to measure whether two evets that occur together do so as a coicidece or ot. I our case, we wat to measure the cohesio of a complex cadidate term (a cadidate term composed of two words or more) by verifyig if its words occur together as a coicidece or ot. The loglikelyhood ratio of two adjacet words ( u ad v) ca be computed with the followig formula (Daille, 1994):! loguv = a log a + b logb + c log c + d log d + N log N! ( a + c) log( a + c)! ( a + b) log( a + b)! ( c + d) log( c + d)! ( d + b) log( d + b) where a is the umber of times uv appears i the documet, b the umber of times u appears ot followed by v, c the umber of times v appears ot preceded by u, N the corpus size ad d the umber of cadidate terms that does ot ivolve u or v. Followig (Russell, 1998), to compute log-likelyhood o cadidate terms ivolvig more tha two words, we keep the miimum value amog the log-likelyhood of each possible split i the cadidate term. With the ituitio that terms are coheret uits that ca appear surrouded by various differet words, we use as well the etropy to rate a cadidate term. The etropy of a cadidate is computed by averagig its left ad right etropy: e e( w ) = 1 e left left ( w ) + e 1 2 ( s) =! right { u: us" C} ( w ) h 1 us ( ) s 4

Corpus-Based Termiology Extractio h( x) = x log x where w 1 is the cadidate term ad C is the corpus from which we are extractig the terms. Fially, to weight the saliece of a cadidate term, we also use tf idf. This metric is based o the idea that terms describig a documet should appear ofte i it but should ot appear i may other documets. It is computed by dividig the frequecy of a cadidate term by the umber of documets i a out-of-domai corpus that cotais it. Because tf idf is usually computed o oe word, whe we evaluated complex cadidate terms, we computed tf idf o each of its words ad kept five values: the first, the last, the miimum, the maximum ad the average. I our experimets, the out-of-domai corpus was composed of texts take from the Frech Caadia parliametary debates (the so-called Hasard), totalizig 1.4 millio seteces. Idetifyig terms amog cadidates Oce each cadidate terms is scored, we must decide which oes should fially be elected a term. To accomplish this task, we trai a biary classifier (a fuctio which qualifies a cadidate as a term or ot) o the face of the scores we computed for a cadidate. We use the AdaBoost learig algorithm (Freud ad Schapire, 1999) to build this classifier. AdaBoost is a simple but efficiet learig techique that combies may weak classifiers (a weak classifier must be right more tha half of the time) ito a stroger oe. To achieve this, it trais them successively, each time focusig o examples that have bee hard to classify correctly by the previous weak classifiers. I our experimets, the weak classifiers were biary stumps (biary classifiers that compare oe of the score to a give threshold to classify a cadidate term) ad we limited their umber to 50. A example of such a classifier is preseted i Figure 2. Experimets e right ( s) =! { u: su" C} su ( ) Our commuity lacks a commo bechmark o which we could compare our result with others. I this work, we applied our approach to a corpus called EAU. It is composed of six texts dealig with water supply. Its complex terms have bee listed by some members or the h s 5

Alexadre Patry ad Philippe Laglais Office québécois de la lague fraçaise for a project called ATTRAIT (Atelier de Travail Iformatisé du Termiologue) whose mai objective was to evaluate existig software solutios for the termiologist 1. Iput: A scored cadidate term c " = 0 if etropy(c) > 1.6 the " = " + 0.26 else " = " - 0.26 if legth(c) > 1.6 the " = " + 0.08 else " = " - 0.08 if " > 0 the retur term else retur ot-term Figure 2 A excert from a classifier geerated by the Adaboost learig algorithm. I our experimets, we kept the preprocessig stage as simple as possible. The corpus ad the list of terms were automatically tokeized, lemmatized ad had their POS tagged with a i-house package (Foster, 1991). Oce preprocessed, the EAU corpus is composed 12 492 words ad 208 terms. Of these 208 terms, 186 appear without sytactic variatio (as they were listed) a total of 400 times. Sice the terms of our evaluatio corpus are already idetified, it is straightforward to compute the precisio ad the recall of our system. Precisio (resp. recall) is the ratio of terms correctly idetified by the system over the total umber of terms idetified as such (resp. over the total umber of terms maually idetified i the list). We evaluated our system usig five fold cross-validatio. This meas that the corpus was partitioed ito five subsets ad that five experimets were ru each time testig with a differet subset ad traiig the automato ad the classifier with the four others. Each traiig set (resp. testig set) was composed of about 12 000 (resp. 3000) words cotaiig a average of about 150 (resp. 50) terms. Because oly complex terms are listed ad because we do ot cosider term variatios, our results oly cosider complex terms that appear without variatio. Also, after iformal experimets, we set the miimum probability of a patter to be accepted by our automato to 0.005. The performace of our system, averaged o the five fold of the cross-validatio, ca be foud i Table 1. From the results, we ca see that the automato has a high recall but a low precisio, which was to be expected. Ideed, the automato is oly 1. See http://www.rit.org for more details o this project. 6

Corpus-Based Termiology Extractio a rough filter that elimiates easy to elimiate word sequeces, but keep as much terms as possible. O the other had, the selectio did ot perform as well as we expected. Its low recall ad precisio could be explaied by the metrics that are ot as expressive as we though ad by the fact that 75% of the terms i our test corpora appears oly oe time. Whe a term appears oly oe time, its frequecy ad etropy become useless. The results preseted i Table 2 seem to cofirm our hypothesis. Extractio Idetificatio Part µ! Precisio 0.14 0.05 Recall 0.94 0.03 Precisio 0.45 0.19 Recall 0.41 0.20 Overall system Precisio 0.43 0.18 Recall 0.38 0.18 Table 1 Mea ( µ ) ad stadard deviatio (! ) of the precisio ad recall of the differet parts of our system. Because we wated to compare our system with the idividual metrics that it uses, we had to modify it such that it raks the cadidate terms istead of simply acceptig or rejectig them. To do so, we made our system retur " istead of term or ot term (see Figure 2). We the sorted the cadidate terms i decreasig order of their " value. A commo practice whe comparig rakig algorithms is to build their ROC (receivig operator curve), which shows the ratio of good idetificatios (y axis) agaist the ratio of bad idetificatio (x axis) for all the acceptatio thresholds. The best curve will augmet i y faster tha i x, so will have a greater area uder it. We ca see i Figure 3 that our system performs better tha etropy or log-likelyhood aloe. This leads us to believe that differet scores carry differet iformatio ad that combiig them, as we did it, is fruitful. Discussio ad future works I this paper, we preseted a approach to automatically geerate a ed-to-ed term extractor from a traiig corpus. We also proposed a way to combie may statistical scores i order to extract terms more efficietly tha whe each score is used i isolatio. Because of the ature of the traiig algorithm, we ca easily exted 7

Alexadre Patry ad Philippe Laglais the set of metrics we cosidered here. Eve a priori kowledge could be itegrated by specifyig keywords before the extractio ad settig a score to oe whe a cadidate term cotais a keyword or zero otherwise. The same flexibility is achieved whe the automato is created. By geeratig it directly from the output of the POS tagger, our solutio does ot deped of a particular tagger ad is tolerat to cosistet taggig errors. Criteria µ! Cadidates appearig oe time Precisio 0.39 0.16 Recall 0.33 0.22 Cadidates appearig at least two times Precisio 0.73 0.14 Recall 0.85 0.09 Table 2 Compariso of the performace of the term idetificatio part for cadidates appearig with differet frequecies. Figure 3 The ROC of our system (AdaBoost) agaist two other score whe we traied our system o oe half of our corpus ad tested o the other. A greater area uder the curve is better. A shortcomig of this work is that we did ot treat term variatios. Termiology variatio is a well-kow pheomeo, whose amout is estimated accordig to (Kageura et al., 2004) from 15% to 35%. We thik that the best way to deal with them i our framework would be to 8

Corpus-Based Termiology Extractio itroduce a preprocessig stage where variatios are ormalized to a caoical form. Term variatios have bee extesively studied i (Jacquemi, 2001) ad (Daille, 2003). I our experimets, we focused o complex terms. Because some scores do ot apply to simple terms (e.g. log-likelyhood ad legth), we thik that the best way to extract simple terms would be to trai a dedicated classifier. Ackowledgemets We would like to thak Hugo Larochelle who foud the corpus we used i our experimets ad Elliott Macklovitch who made some useful commets o the first draft of this documet. This work has bee subsidized by NSERC ad FQRNT. Refereces Castellví, M. Teresa Cabré; Bagot, Rosa Estopà; Palastresi, Jordi Vivaldi; Automatic Term Detectio: A Review of Curret Systems i Recet advaces i computatioal termiology. Joh Bejami, 2001. Daille, Béatrice; Study ad Implemetatio of Combied Techiques for Automatic Extractio of Termiology i The Balacig Act: Combiig Symbolic ad Statistical Approaches to Laguage. New Mexico State Uiversity, Las Cruces, 1994. Daille, Béatrice; Coceptual structurig through term variatios i Proceedigs of the ACL Workshop o Multiword Expressios: Aalysis, Acquisitio ad Treatmet. 2003. Duig, Ted; Accurate Methods for the Statistics of Surprise ad Coicidece. 1993. Foster, George; Statistical lexical disambiguatio, Master Thesis. McGill Uiversity, Motreal, 1991. Freud, Y.; Schapire, R.E.; A Short Itroductio to Boostig i Joural of Japaese Society for Artificial Itelligece. 1999. Jacquemi, Christia; Spottig ad Discoverig Terms through Natural Laguage Processig. MIT Press, 2001. Justeso, Joh S.; Katz, Slava M.; Techical Termiology: Some Liguistic Properties ad a Algorithm for Idetificatio i Text i Natural Laguage Egieerig. 1995. Kageura, Kyo; Daille, Béatrice; Nakagawa, Hiroshi; Chie, Lee-Feg; Recet Treds i Computatioal Termiology i Termiology. Joh Bejami, 2004. 9

Alexadre Patry ad Philippe Laglais Laglais, Philippe; Carl, Michael; Geeral-purpose statistical traslatio egie ad domai specific texts: Would it work? i Termiology. Joh Bejami, 2004. 10