Corpus-Based Terminology Extraction

Similar documents
Natural language processing implementation on Romanian ChatBot

arxiv: v1 [cs.dl] 22 Dec 2016

'Norwegian University of Science and Technology, Department of Computer and Information Science

E-LEARNING USABILITY: A LEARNER-ADAPTED APPROACH BASED ON THE EVALUATION OF LEANER S PREFERENCES. Valentina Terzieva, Yuri Pavlov, Rumen Andreev

Management Science Letters

Consortium: North Carolina Community Colleges

Fuzzy Reference Gain-Scheduling Approach as Intelligent Agents: FRGS Agent

CONSTITUENT VOICE TECHNICAL NOTE 1 INTRODUCING Version 1.1, September 2014

Application for Admission

HANDBOOK. Career Center Handbook. Tools & Tips for Career Search Success CALIFORNIA STATE UNIVERSITY, SACR AMENTO

part2 Participatory Processes

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

VISION, MISSION, VALUES, AND GOALS

also inside Continuing Education Alumni Authors College Events

A Case Study: News Classification Based on Term Frequency

2014 Gold Award Winner SpecialParent

On March 15, 2016, Governor Rick Snyder. Continuing Medical Education Becomes Mandatory in Michigan. in this issue... 3 Great Lakes Veterinary

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Memory-based grammatical error correction

Cross-Lingual Text Categorization

Universiteit Leiden ICT in Business

Reducing Features to Improve Bug Prediction

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Online Updating of Word Representations for Part-of-Speech Tagging

Linking Task: Identifying authors and book titles in verbose queries

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Using dialogue context to improve parsing performance in dialogue systems

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

DERMATOLOGY. Sponsored by the NYU Post-Graduate Medical School. 129 Years of Continuing Medical Education

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Cross Language Information Retrieval

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Learning Methods in Multilingual Speech Recognition

AQUA: An Ontology-Driven Question Answering System

The stages of event extraction

arxiv: v1 [cs.cl] 2 Apr 2017

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Word Segmentation of Off-line Handwritten Documents

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Methods for the Qualitative Evaluation of Lexical Association Measures

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Language Independent Passage Retrieval for Question Answering

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Probabilistic Latent Semantic Analysis

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Mathematics process categories

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Distant Supervised Relation Extraction with Wikipedia and Freebase

Parsing of part-of-speech tagged Assamese Texts

Variations of the Similarity Function of TextRank for Automated Summarization

Calibration of Confidence Measures in Speech Recognition

CS Machine Learning

Corpus Linguistics (L615)

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Task Tolerance of MT Output in Integrated Text Processes

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Switchboard Language Model Improvement with Conversational Data from Gigaword

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

Lecture 1: Machine Learning Basics

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

A Bayesian Learning Approach to Concept-Based Document Classification

Accuracy (%) # features

Speech Recognition at ICSI: Broadcast News and beyond

A Comparison of Two Text Representations for Sentiment Analysis

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

Multi-Lingual Text Leveling

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Radius STEM Readiness TM

Prediction of Maximal Projection for Semantic Role Labeling

Test Effort Estimation Using Neural Network

Life and career planning

Integrating Semantic Knowledge into Text Similarity and Information Retrieval

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Detecting English-French Cognates Using Orthographic Edit Distance

HLTCOE at TREC 2013: Temporal Summarization

IBM Software Group. Mastering Requirements Management with Use Cases Module 6: Define the System

Term Weighting based on Document Revision History

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

The taming of the data:

A Pipelined Approach for Iterative Software Process Model

Transcription:

CHAPTER NUMBER Corpus-Based Termiology Extractio ALEXANDRE PATRY AND PHILIPPE LANGLAIS Termiology maagemet is a key compoet of may atural laguage processig activities such as machie traslatio (Laglais ad Carl, 2004), text summarizatio ad text idexatio. With the rapid developmet of sciece ad techology cotiuously icreasig the umber of techical terms, termiology maagemet is certai to become of the utmost importace i more ad more cotet-based applicatios. While the automatic idetificatio of terms from texts has bee the focus of past studies (Jacquemi, 2001) (Castellví et al, 2001), the curret tred i Termiology Maagemet (TM) has shifted to the issue of term etworkig (Kageura et al, 2004). A possible explaatio of this shiftig may lie i the fact that Termiology Extractio (TE), although beig a oisy activity, ecompasses well established techiques that seem difficult to improve sigificatly upo. Despite this shift, we do believe that better extractio of terms could carry over subsequet steps of TM. A traditioal TE system usually ivolves a subtle mixture of liguistic rules ad statistical metrics i order to idetify a list of cadidate terms where it is hoped that terms are raked first. We distiguish our approach to TE from traditioal oes i two differet ways. First, we give back to the user a active role i the extractio process. That is, istead of ecodig a static defiitio of what might or might ot be a term, we let the user specify his ow. We do so by askig him to set up a traiig corpus (a corpus where the terms have bee idetified by a huma) from which our extractor will lear how to defie a term. Secod, our approach is completely automatic ad is readily adapted to the tools (part-of-speech tagger, 1

Alexadre Patry ad Philippe Laglais lemmatizer) ad metrics of the user. Oe might object that requirig a traiig corpus is askig the user to do a part of the job the machie is supposed to do, but we see it i a differet way. We cosider that a little help from the user could pay back i flexibility. The structure of our paper outlies the three steps ivolved i our approach. I the followig sectio, we describe our algorithm to idetify cadidate terms. I the third sectio, we itroduce the differet metrics we compute to score them. The fourth sectio explais how we applied AdaBoost (Freud ad Schapire, 1999), a machie learig algorithm, to rak ad idetify a list of terms. We the evaluate our approach o a corpus which was set up by the Office québécois de la lague fraçaise to evaluate commercially available term extractors. We show that our classifier outperforms the idividual metrics used i this study. Fially, we discuss some limitatios of our approach ad propose future works to be doe. Extractio of cadidate terms It is a commo practice to extract cadidate terms usig a part-of-speech (POS) tagger ad a automato (a program extractig word sequeces correspodig to predefied POS patters). Usually, those patters are maually hadcrafted ad target ou phrases, sice most of the terms of iterest are ou phrases (Justeso ad Katz, 1995). Typical examples of such patters ca be foud i (Jacquemi, 2001). As poited out i (Justeso ad Katz, 1995), relyig o a POS tagger ad legitimate patter recogitio is error proe, sice taggers are ot perfect. This might be especially true for very domai specific texts where a tagger is likely to be more erratic. To overcome this problem without givig up the use of POS patters (sice they are easy to desig ad to use), we propose a way to use a traiig corpus i order to automate the creatio of a automato. There are may potetial advatages with this approach. First, the POS tagger ad the taggig errors, to the extet that they are cosistet, will be automatically assimilated by the automato. Secod, this gives to the user the opportuity to specify the terms that are of iterest for him. If may terms ivolvig verbs are foud i the traiig corpus, the automato will reflect that iterest as well. We also observed i iformal experimets that wide spread patters ofte fails to extract may terms foud i our traiig corpus. Several approaches ca be applied whe geeratig a automato from sequeces of POS ecoutered i a traiig corpus. A 2

Corpus-Based Termiology Extractio straightforward approach is to memorize all the sequeces see i the traiig corpus. A sequece of words is thus a cadidate term oly if its sequece of POS tags has bee see before. This approach is simple but aive. It caot geerate ew patters that are slight variatios of the oes see at traiig time, ad a isolated taggig error ca lead to a bad patter. To avoid those problems, we propose to geerate the patters usig a laguage model traied o the POS tags of the terms foud i the traiig corpus. A laguage model is a fuctio computig the probability that a sequece of words has bee geerated by a certai laguage. I our case, the words are POS tags ad the laguage is the oe recogizig the sequeces of tags correspodig to terms. Our laguage model ca be described as follow: where P( w1 ) =! P( wi H i ) i= 1 w 1 is a sequece of POS tags ad summarizes the iformatio of the! 1 H i is called the history which i previous tags. To build a automato, we oly have to set a threshold ad geerate all the patters whose probability is higher tha it. A excerpt of such a automato is give i Figure 1. Probability Patter 0.538 NomC AdjQ 0.293 NomC Prep NomC 0.032 NomC Dete-dart-ddef NomC 0.0311 NomC Verb-ParPas 0.0311 NomC Prep Dete-dart-ddef NomC Figure 1 Excerpt of a automatically geerated automato. Aother advatage of such a automato is that all patters are associated with a probability, givig more iformatio tha a biary value (legitimate or ot). Ideed, the POS patter probability is oe of the umerous metrics that we feed our classifier with. Scorig the cadidate terms I the previous sectio, we showed a way to geerate a automato that extracts a set of cadidate terms that we ow wat to rak ad/or filter. Followig may other works o term extractio, we score each 3

Alexadre Patry ad Philippe Laglais cadidate usig various metrics. May differet oes have bee idetified i (Daille, 1994) ad (Castellví et al, 2001). We do ot believe that a sigle metric is sufficiet, but istead thik that it is more fruitful to use several of them ad trai a classifier to lear how to take beefit of each of them. Because we thik they are iterestig for the task, we retaied the followig metrics: the frequecy, the legth, the log-likelyhood, the etropy, tf idf ad the POS patter probabilities discussed i the previous sectio. Recall however that our approach is ot restricted to these metrics, but istead ca beefit from ay other oe that ca be computed automatically. Aloe, the frequecy is ot a robust metric to assess the termiological property of a cadidate, but it does carry useful iformatio, as does also the legth of terms. I (Duig, 1993), Duig advocates the use of log-likelyhood to measure whether two evets that occur together do so as a coicidece or ot. I our case, we wat to measure the cohesio of a complex cadidate term (a cadidate term composed of two words or more) by verifyig if its words occur together as a coicidece or ot. The loglikelyhood ratio of two adjacet words ( u ad v) ca be computed with the followig formula (Daille, 1994):! loguv = a log a + b logb + c log c + d log d + N log N! ( a + c) log( a + c)! ( a + b) log( a + b)! ( c + d) log( c + d)! ( d + b) log( d + b) where a is the umber of times uv appears i the documet, b the umber of times u appears ot followed by v, c the umber of times v appears ot preceded by u, N the corpus size ad d the umber of cadidate terms that does ot ivolve u or v. Followig (Russell, 1998), to compute log-likelyhood o cadidate terms ivolvig more tha two words, we keep the miimum value amog the log-likelyhood of each possible split i the cadidate term. With the ituitio that terms are coheret uits that ca appear surrouded by various differet words, we use as well the etropy to rate a cadidate term. The etropy of a cadidate is computed by averagig its left ad right etropy: e e( w ) = 1 e left left ( w ) + e 1 2 ( s) =! right { u: us" C} ( w ) h 1 us ( ) s 4

Corpus-Based Termiology Extractio h( x) = x log x where w 1 is the cadidate term ad C is the corpus from which we are extractig the terms. Fially, to weight the saliece of a cadidate term, we also use tf idf. This metric is based o the idea that terms describig a documet should appear ofte i it but should ot appear i may other documets. It is computed by dividig the frequecy of a cadidate term by the umber of documets i a out-of-domai corpus that cotais it. Because tf idf is usually computed o oe word, whe we evaluated complex cadidate terms, we computed tf idf o each of its words ad kept five values: the first, the last, the miimum, the maximum ad the average. I our experimets, the out-of-domai corpus was composed of texts take from the Frech Caadia parliametary debates (the so-called Hasard), totalizig 1.4 millio seteces. Idetifyig terms amog cadidates Oce each cadidate terms is scored, we must decide which oes should fially be elected a term. To accomplish this task, we trai a biary classifier (a fuctio which qualifies a cadidate as a term or ot) o the face of the scores we computed for a cadidate. We use the AdaBoost learig algorithm (Freud ad Schapire, 1999) to build this classifier. AdaBoost is a simple but efficiet learig techique that combies may weak classifiers (a weak classifier must be right more tha half of the time) ito a stroger oe. To achieve this, it trais them successively, each time focusig o examples that have bee hard to classify correctly by the previous weak classifiers. I our experimets, the weak classifiers were biary stumps (biary classifiers that compare oe of the score to a give threshold to classify a cadidate term) ad we limited their umber to 50. A example of such a classifier is preseted i Figure 2. Experimets e right ( s) =! { u: su" C} su ( ) Our commuity lacks a commo bechmark o which we could compare our result with others. I this work, we applied our approach to a corpus called EAU. It is composed of six texts dealig with water supply. Its complex terms have bee listed by some members or the h s 5

Alexadre Patry ad Philippe Laglais Office québécois de la lague fraçaise for a project called ATTRAIT (Atelier de Travail Iformatisé du Termiologue) whose mai objective was to evaluate existig software solutios for the termiologist 1. Iput: A scored cadidate term c " = 0 if etropy(c) > 1.6 the " = " + 0.26 else " = " - 0.26 if legth(c) > 1.6 the " = " + 0.08 else " = " - 0.08 if " > 0 the retur term else retur ot-term Figure 2 A excert from a classifier geerated by the Adaboost learig algorithm. I our experimets, we kept the preprocessig stage as simple as possible. The corpus ad the list of terms were automatically tokeized, lemmatized ad had their POS tagged with a i-house package (Foster, 1991). Oce preprocessed, the EAU corpus is composed 12 492 words ad 208 terms. Of these 208 terms, 186 appear without sytactic variatio (as they were listed) a total of 400 times. Sice the terms of our evaluatio corpus are already idetified, it is straightforward to compute the precisio ad the recall of our system. Precisio (resp. recall) is the ratio of terms correctly idetified by the system over the total umber of terms idetified as such (resp. over the total umber of terms maually idetified i the list). We evaluated our system usig five fold cross-validatio. This meas that the corpus was partitioed ito five subsets ad that five experimets were ru each time testig with a differet subset ad traiig the automato ad the classifier with the four others. Each traiig set (resp. testig set) was composed of about 12 000 (resp. 3000) words cotaiig a average of about 150 (resp. 50) terms. Because oly complex terms are listed ad because we do ot cosider term variatios, our results oly cosider complex terms that appear without variatio. Also, after iformal experimets, we set the miimum probability of a patter to be accepted by our automato to 0.005. The performace of our system, averaged o the five fold of the cross-validatio, ca be foud i Table 1. From the results, we ca see that the automato has a high recall but a low precisio, which was to be expected. Ideed, the automato is oly 1. See http://www.rit.org for more details o this project. 6

Corpus-Based Termiology Extractio a rough filter that elimiates easy to elimiate word sequeces, but keep as much terms as possible. O the other had, the selectio did ot perform as well as we expected. Its low recall ad precisio could be explaied by the metrics that are ot as expressive as we though ad by the fact that 75% of the terms i our test corpora appears oly oe time. Whe a term appears oly oe time, its frequecy ad etropy become useless. The results preseted i Table 2 seem to cofirm our hypothesis. Extractio Idetificatio Part µ! Precisio 0.14 0.05 Recall 0.94 0.03 Precisio 0.45 0.19 Recall 0.41 0.20 Overall system Precisio 0.43 0.18 Recall 0.38 0.18 Table 1 Mea ( µ ) ad stadard deviatio (! ) of the precisio ad recall of the differet parts of our system. Because we wated to compare our system with the idividual metrics that it uses, we had to modify it such that it raks the cadidate terms istead of simply acceptig or rejectig them. To do so, we made our system retur " istead of term or ot term (see Figure 2). We the sorted the cadidate terms i decreasig order of their " value. A commo practice whe comparig rakig algorithms is to build their ROC (receivig operator curve), which shows the ratio of good idetificatios (y axis) agaist the ratio of bad idetificatio (x axis) for all the acceptatio thresholds. The best curve will augmet i y faster tha i x, so will have a greater area uder it. We ca see i Figure 3 that our system performs better tha etropy or log-likelyhood aloe. This leads us to believe that differet scores carry differet iformatio ad that combiig them, as we did it, is fruitful. Discussio ad future works I this paper, we preseted a approach to automatically geerate a ed-to-ed term extractor from a traiig corpus. We also proposed a way to combie may statistical scores i order to extract terms more efficietly tha whe each score is used i isolatio. Because of the ature of the traiig algorithm, we ca easily exted 7

Alexadre Patry ad Philippe Laglais the set of metrics we cosidered here. Eve a priori kowledge could be itegrated by specifyig keywords before the extractio ad settig a score to oe whe a cadidate term cotais a keyword or zero otherwise. The same flexibility is achieved whe the automato is created. By geeratig it directly from the output of the POS tagger, our solutio does ot deped of a particular tagger ad is tolerat to cosistet taggig errors. Criteria µ! Cadidates appearig oe time Precisio 0.39 0.16 Recall 0.33 0.22 Cadidates appearig at least two times Precisio 0.73 0.14 Recall 0.85 0.09 Table 2 Compariso of the performace of the term idetificatio part for cadidates appearig with differet frequecies. Figure 3 The ROC of our system (AdaBoost) agaist two other score whe we traied our system o oe half of our corpus ad tested o the other. A greater area uder the curve is better. A shortcomig of this work is that we did ot treat term variatios. Termiology variatio is a well-kow pheomeo, whose amout is estimated accordig to (Kageura et al., 2004) from 15% to 35%. We thik that the best way to deal with them i our framework would be to 8

Corpus-Based Termiology Extractio itroduce a preprocessig stage where variatios are ormalized to a caoical form. Term variatios have bee extesively studied i (Jacquemi, 2001) ad (Daille, 2003). I our experimets, we focused o complex terms. Because some scores do ot apply to simple terms (e.g. log-likelyhood ad legth), we thik that the best way to extract simple terms would be to trai a dedicated classifier. Ackowledgemets We would like to thak Hugo Larochelle who foud the corpus we used i our experimets ad Elliott Macklovitch who made some useful commets o the first draft of this documet. This work has bee subsidized by NSERC ad FQRNT. Refereces Castellví, M. Teresa Cabré; Bagot, Rosa Estopà; Palastresi, Jordi Vivaldi; Automatic Term Detectio: A Review of Curret Systems i Recet advaces i computatioal termiology. Joh Bejami, 2001. Daille, Béatrice; Study ad Implemetatio of Combied Techiques for Automatic Extractio of Termiology i The Balacig Act: Combiig Symbolic ad Statistical Approaches to Laguage. New Mexico State Uiversity, Las Cruces, 1994. Daille, Béatrice; Coceptual structurig through term variatios i Proceedigs of the ACL Workshop o Multiword Expressios: Aalysis, Acquisitio ad Treatmet. 2003. Duig, Ted; Accurate Methods for the Statistics of Surprise ad Coicidece. 1993. Foster, George; Statistical lexical disambiguatio, Master Thesis. McGill Uiversity, Motreal, 1991. Freud, Y.; Schapire, R.E.; A Short Itroductio to Boostig i Joural of Japaese Society for Artificial Itelligece. 1999. Jacquemi, Christia; Spottig ad Discoverig Terms through Natural Laguage Processig. MIT Press, 2001. Justeso, Joh S.; Katz, Slava M.; Techical Termiology: Some Liguistic Properties ad a Algorithm for Idetificatio i Text i Natural Laguage Egieerig. 1995. Kageura, Kyo; Daille, Béatrice; Nakagawa, Hiroshi; Chie, Lee-Feg; Recet Treds i Computatioal Termiology i Termiology. Joh Bejami, 2004. 9

Alexadre Patry ad Philippe Laglais Laglais, Philippe; Carl, Michael; Geeral-purpose statistical traslatio egie ad domai specific texts: Would it work? i Termiology. Joh Bejami, 2004. 10