Domain term relevance through tf-dcf

Similar documents
Neural Network Model of the Backpropagation Algorithm

Fast Multi-task Learning for Query Spelling Correction

More Accurate Question Answering on Freebase

Information Propagation for informing Special Population Subgroups about New Ground Transportation Services at Airports

MyLab & Mastering Business

An Effiecient Approach for Resource Auto-Scaling in Cloud Environments

Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices

1 Language universals

Cross Language Information Retrieval

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Universiteit Leiden ICT in Business

A Case Study: News Classification Based on Term Frequency

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Constructing Parallel Corpus from Movie Subtitles

On document relevance and lexical cohesion between query terms

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness

Advanced Grammar in Use

Writing a composition

Agent-Based Software Engineering

Variations of the Similarity Function of TextRank for Automated Summarization

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Linking Task: Identifying authors and book titles in verbose queries

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

A Comparison of Two Text Representations for Sentiment Analysis

Term Weighting based on Document Revision History

Cross-lingual Text Fragment Alignment using Divergence from Randomness

Formulaic Language and Fluency: ESL Teaching Applications

On-the-Fly Customization of Automated Essay Scoring

Loughton School s curriculum evening. 28 th February 2017

Learning Methods in Multilingual Speech Recognition

Probabilistic Latent Semantic Analysis

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

Field Experience Management 2011 Training Guides

AQUA: An Ontology-Driven Question Answering System

Using dialogue context to improve parsing performance in dialogue systems

Data Fusion Models in WSNs: Comparison and Analysis

2 nd grade Task 5 Half and Half

New Ways of Connecting Reading and Writing

Different Requirements Gathering Techniques and Issues. Javaria Mushtaq

Interpretive (seeing) Interpersonal (speaking and short phrases)

teacher, peer, or school) on each page, and a package of stickers on which

A High-Quality Web Corpus of Czech

Identification of Opinion Leaders Using Text Mining Technique in Virtual Community

Student Course Evaluation Class Size, Class Level, Discipline and Gender Bias

In Workflow. Viewing: Last edit: 10/27/15 1:51 pm. Approval Path. Date Submi ed: 10/09/15 2:47 pm. 6. Coordinator Curriculum Management

The following information has been adapted from A guide to using AntConc.

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

10.2. Behavior models

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Distant Supervised Relation Extraction with Wikipedia and Freebase

José Carlos Pinto -

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Corpus Linguistics (L615)

THE UNIVERSITY OF SYDNEY Semester 2, Information Sheet for MATH2068/2988 Number Theory and Cryptography

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Grade 11 Language Arts (2 Semester Course) CURRICULUM. Course Description ENGLISH 11 (2 Semester Course) Duration: 2 Semesters Prerequisite: None

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Women in Orthopaedic Fellowships: What Is Their Match Rate, and What Specialties Do They Choose?

Rubric for Scoring English 1 Unit 1, Rhetorical Analysis

Arizona s English Language Arts Standards th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS

Problems of the Arabic OCR: New Attitudes

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

NCEO Technical Report 27

An Interactive Intelligent Language Tutor Over The Internet

Speech Recognition at ICSI: Broadcast News and beyond

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Memory-based grammatical error correction

Common Core State Standards for English Language Arts

Detecting English-French Cognates Using Orthographic Edit Distance

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Integration of ICT in Teaching and Learning

A Case-Based Approach To Imitation Learning in Robotic Agents

Pre-vocational training. Unit 2. Being a fitness instructor

Mandarin Lexical Tone Recognition: The Gating Paradigm

PowerTeacher Gradebook User Guide PowerSchool Student Information System

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume 11 : 12 December 2011 ISSN

Proof Theory for Syntacticians

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

Written by: YULI AMRIA (RRA1B210085) ABSTRACT. Key words: ability, possessive pronouns, and possessive adjectives INTRODUCTION

Welcome to the Purdue OWL. Where do I begin? General Strategies. Personalizing Proofreading

Grade 2 Unit 2 Working Together

Best Practices in Internet Ministry Released November 7, 2008

Efficient Online Summarization of Microblogging Streams

J j W w. Write. Name. Max Takes the Train. Handwriting Letters Jj, Ww: Words with j, w 321

ENGLISH. Progression Chart YEAR 8

Association Between Categorical Variables

Transcription:

Domain erm relevance hrough f-dcf Lucelene Lopes PPGCC - FACIN PUCRS Universiy Poro Alegre - Brazil lucelene.lopes@pucrs.br Paulo Fernandes PPGCC - FACIN PUCRS Universiy Poro Alegre - Brazil paulo.fernandes@pucrs.br Renaa Vieira PPGCC - FACIN PUCRS Universiy Poro Alegre - Brazil renaa.vieira@pucrs.br Absrac This paper proposes a new index for he relevance of erms exraced from domain corpora. We call i erm frequency, disjoin corpora frequency (f-dcf), and i is based on he absolue erm frequency of each erm empered by is frequency in oher (conrasing) corpora. Concepual differences and mahemaical compuaion of he proposed index are discussed in respec wih oher similar approaches ha also ake he frequency in conrasing corpora ino accoun. To illusrae he efficiency of he f-dcf index, his paper evaluaes he applicaion of his index and oher similar approaches. I. INTRODUCTION The auomaic exracion of erms from exs is a well mapped ask, bu he auomaic choice of which exraced erms are relevan for a specific domain is a much more defian ask. Finding he mos relevan erms for a domain, i.e., he domain conceps, is an imporan sep for knowledge engineering asks such as onology learning from exs [1]. Some classical linguisic-based work in his area sugges he use of disribuional analysis [2] o associae erms and hen, esablish which of hem are good concep candidaes. A differen approach, bu ye following he same idea of inferring conceps from erm associaion, is made by Chemuduguna e al. [3], where he idenificaion of conceps is made hrough pure saisical measures empered by previous insered human informaion. Tiov and Kozhevnikov [4] work also follows his line of research by inferring semanic relaions among erms in order o idenify differen erms represening a same concep in ses of small documens (weaher forecass) wih no linguisic annoaion. The work of Bosma and Vossen [5] presens a similar effor o esablish erm relevance measures considering a muliple corpora resource. This work proposes differen relevance measures of erms o each corpus, bu, Bosma and Vossen s relevance measure of a erm in a given corpus do no affec he relevance of his same erm in oher corpora. In fac, he mehodology proposed in heir work access WORDNET [6] in order o validae he erm candidaes according o heir measures, bu also o esablish relaions (hypernym, hyponym, meronym, ec.) among hem. In opposiion o hese effors, his paper proposes an approach ha is no linguisic-based, bu i relies only he on saisical informaion gaher from he domain corpus o esablish a numerical measure o erm relevance in his corpus. Therefore, his paper approach is aligned wih works ha ake ino accoun he erm frequency on documens o compue a relevance index o esablish how represenaive a erm exraced from a corpus will be for he domain represened by his corpus. Some examples of such saisicalbased approaches are he works of Dunning in 1993 [7] which proposes he use of log likelihood raio, Manning and Schulz in 1999 [8] which proposes a composiion of f-idf (erm-frequency, inverse documen frequency [9] adaped for erm relevance in a corpus), and oher iniiaives based on compuing indexes from one specific corpus only. However, our claim is ha hose ypical indices fail o rule ou hose erms which are no paricularly relevan o a arge domain. The basic idea behind approaches like he one in our paper is he assumpion ha a erm relevance o a specific domain can only be esablished by comparison wih corpora from oher domains, called conrasing corpora. One of he firs examples of similar previous work like our own was he work of Chung in 2003 [10]. Bu recenly, more sophisicaed versions were proposed by Park e al. in 2008 [11] wih domain specificiy index, by Ki and Liu in 2008 wih ermhood index [12], and by Kim e al. in 2009 wih erm frequency, inverse domain frequency index [13]. These approaches brough some qualiy o he erm exracion, as was verified by he works of Teixeira e al. [14], as well as, Rose e al. [15]. Similar o our proposal, all hese previous works followed he same principle o compue a relevance index ha is direcly proporional o he erm absolue frequency in he corpus and inversely proporional o he erm absolue frequency in oher corpora. The main difference among hese similar previous works [11], [12], [13] and our own is he specific formula o weigh he influence of oher corpora frequency. This paper firs conribuion resides in drawing a panorama of opions of indices o express he relevance of exraced erm from a domain corpus, focusing on indices ha ake ino accoun also corpora of oher domains (conrasing corpora). Some experimens illusrae he benefis of approaches using conrasing corpora over radiional indices. Secondly, and mos imporan, his paper conribues wih he proposal of a new relevance index, called f-dcf, ha is, according o our experimens, superior o he oher indexes based on conrasing corpora. This conribuion is enhanced by he analysis of he f-dcf behavior agains differen opions of conrasing corpora.

I is no he goal of his paper o analyze echniques o improve he qualiy of erm exracion iself, since we assume ha a previously performed exracion provides a se of exraced erms. I is also ou of he scope of his paper o analyze how many erms should be considered conceps of a domain. Our purpose is o presen argumens and experimens showing ha he proposed index is effecive o rank exraced erms according o heir relevance for he domain, hus allowing o idenify domain concep candidaes. This paper is organized as follows: Secion II describes he exisen saisical measures ha are compared o our proposed f-dcf index; Secion III presens our paper main conribuion, which is he proposal of a erm relevance index based on he inclusion of a disjoin corpora frequency (dcf ) componen; Secion IV evaluaes he exising and proposed indices. Finally, Conclusion sress he conribuions and limiaions of his paper, leading o he proposiion of fuure works. II. EXISTING MEASURES FOR RELEVANCE ESTIMATION The mos elemenary way o esablish he saisical relevance of erms exraced from a domain specific corpus is o compue he absolue frequency of erms, i.e., how many imes each erm occurs in he corpus. Obviously, his simple approach is very fragile, since no necessarily a very frequen erm is relevan for he domain. This fac is specially noiceable wih simple exracion mehods, alhough even sophisicaed linguisic-based mehods also suffer from using such simple crieria. For example, pure saisical mehods require he adopion of a lis of highly frequen grammaical words (sop lis). Wihou a sop lis, any pure saisical mehod delivers erms wih very low significance such as preposiions and usual expressions. However, i migh be very difficul o esablish an exhausive sop lis in advance for differen domain and genre. The use of erm frequency as relevance measure is a lile less harmful for exracion mehods aking ino accoun linguisic informaion. For example, he synacic annoaion of a corpus allows he exracion procedure o avoid erms ha are unsuiable for concep names, such as verbs and pronouns. In fac, more sophisicaed linguisic analysis, as he idenificaion of noun phrases, may improve significanly he qualiy of exracion, bu even in hese cases he use of erm frequency do no preven he incorrec exracion of common expressions which are no domain specific. For example, he quie common expression fuure work may be found in several academic exs, bu i is hardly considered a defining concep o any scienific domain. Neverheless, he saring poin of all sophisicaed indices is he simple absolue frequency. Assuming, f,d as he number of occurrences of erm in documen d, and D (c) he se of all documens belonging o he corpus c referring o a specific domain, he absolue erm frequency of a erm in corpus c is expressed by: f (c) = f,d (1) d D (c) A. Term frequency and inverse documen frequency - f-idf An alernaive for plain erm frequency is o ake ino accoun he frequency of he erm among documens. The seminal work of Spärck-Jones [9] shows he imporance o consider frequen erms, bu also non-frequen ones in order o rerieve documens. These ideas lead o he well-known Roberson and Spärck-Jones probabilisic model o erm relevance o specific documens [16]. Crof and Harper [17], and laer Roberson and Walker [18], proposed formulaions o a popular index ha akes posiively ino accoun he erm frequency (f ), i.e., he number of occurrences of a given erm in a documen d; and also considers negaively he number of documens of he corpus where erm appears a leas once, i.e., he inverse documen frequency (idf ). This index, called f-idf has many formulaions, e.g., [19], [20], [8], bu in his paper we will consider he formulaion adoped by Bell e al. [21]. The f-idf index is mahemaically defined for each erm o each documen d belonging o a corpus c ha has a leas one occurrence of as follows: ( ) f-idf,d = (1 + log(f,d )) log 1 + D(c) (2) }{{} D (c) f par }{{} idf par where f,d is he number of occurrences of erm in documen d; D (c) is he se of all documen of a given corpus c; and D (c) is he subse of hese documens where appears a leas once. Observing equaion (2) i is possible o observe he erm frequency (f ) and he inverse documen frequency (idf ) pars. The f par considers he logarihmic frequency of he erm, since he variaion of erm occurrences of erms approaches an exponenial disribuion, i.e., a erm ha occurs 10 imes is no 10 imes more imporan han a erm ha appears only once. Neverheless, erm is an order of magniude more imporan han erm. The idf par represens a value ha varies from log(2) for a erm ha appears in all documens, unil log(1+ D (c) ) for a documen ha appears in only one documen. The idea behind f-idf formulaion is ha a erm is more relevan as a keyword for a documen d if i appears many imes in his documen and very few imes (or ideally none) in oher documens. This is an imporan disincion for informaion rerieval. The populariy of his index is jusified mosly because i prevens frequen erms spread in many documens o be considered more relevan han hey should. Indeed, f-idf is an effecive measure o idenify he defining erms of documens, because i spos erms ha are good for documen indexaion. The use of f-idf o esablish relevance of erms o domain corpora was proposed by Manning and Schüze [8]. According o hese auhors, a possible index o express he relevance of a erm in a corpus c is expressed by: f-idf (c) = f-idf,d (3) d D (c)

B. Term domain specificiy - ds The firs iniiaives o consider he relevance of erms o a domain corpus aking ino accoun conrasive generic corpus, or corpora, include he works made by Chung in 2003 [10] and Drouin in 2004 [22]. However, a he auhors bes knowledge, i is he work of Park e al. [11], in 2008, one of he firs formulaions of an index o express erm relevance o a specific domain. In ha work, such index is called domain specificiy, and i is expressed as he raio beween he probabiliy of occurrence of a erm in a domain corpus c and he probabiliy of his same erm in a generic corpus. Park e al. definiion of erm domain specificiy o a specific domain corpus c, considering a generic domain corpus g was expressed as: ds (c) = p(c) = p (g) f (c) N (c) (4) f (g) N (g) where p (c) express he probabiliy of occurrence of erm in corpus c; and N (c) is he oal number of erms in corpus c, i.e., N (c) = f(c). C. Termhood - hd Following he approach o consider, besides he domain corpus of ineres, a conrasing corpus, he work of Ki and Liu in 2008 [12] proposes an index called ermhood. This index, as for Park e al. s erm domain specificiy, follows he idea ha a erm relevan o a domain is more frequen in he corpus domain han in oher corpora. The main difference brough by his work is o consider he erm rank in he corpus vocabulary (he se of all erms in he corpus), insead of he erm absolue frequency. Ki and Liu definiion of erm ermhood index for a corpus c, a generic domain corpus g (called background corpus by hem) was expressed by: hd (c) = r (c) V (c) }{{} norm. rank value in c r (g) V (g) }{{} norm. rank value in g where V (c) is he vocabulary of corpus c, i.e., V (c) is he cardinaliy of he se of all erms in he corpus c, and r (c) is he rank value of erm expressed as V (c) for he more frequen erm, V (c) 1 for he second mos frequen, and so on unil he less frequen erm as r (c) = 1. Observing he ermhood index we can see i as he difference beween he normalized rank value of he erm in he domain corpus c and he generic domain corpus g. Acually, he division of he rank value by he vocabulary size is inended o keep he normalized rank value wihin he inerval (0, 1], wih a value equal o 1 o he more frequen erm, and he oher erms decaying, according o heir frequency, asympoically oward 0. As a resul, he ermhood index will be whiin he inerval [1, 1], having he more frequen erm in c having a value equal o 1, if i does no belong o vocabulary V (g), unil a value -1 for he more frequen erm in g, if i does no belong o vocabulary V (c). (5) D. Term frequency, inverse domain frequency - TF-IDF Recenly, Kim e al. [13] have proposed in 2009 anoher index o rank erm relevance considering he original idea of he f-idf index, which was o idenify whereas a erm is suiable o represen a documen. In such way, Kim e al. did no acually proposed a new index, bu insead, hey proposed he use of he same f-idf formulaion, bu considering he se of documens of a corpus as a single documen. To avoid confusion, we will refer o his index wih he acronym TF- IDF in uppercase, o differeniae i from he erm frequency, inverse documen frequency (f-idf ). The TF-IDF index for erm a corpus c, considering a se of corpora G as proposed by Kim e al. is numerically expressed by: TF-IDF (c) = f(c) f (c) } {{ } TF par ( G log G ) } {{ } IDF par where f (c) is he erm frequency of erm in corpus c; G is he se of all domain corpora; and G is he subse of G where he erm appears a leas once. I is imporan o noice ha he basic formulaion of f-idf used as inspiraion by Kim e al. proposal is no as robus as he one of Bell e al. (Eq. 3). For insance, if a erm appears in all corpora, he IDF par of Eq. 6 will become 0, and herefore, such erm will have a TF-IDF index also equal o 0, i.e., i will be considered less relevan han any oher erm, regardless is number of occurrences. Anoher imporan difference beween Equaions 3 and 6 is ha Bell e al. s (Eq. 3) uses he log of absolue erm frequency in he f par, while Kim e al. s (Eq. 6) considers direcly a relaive erm frequency. III. PROPOSED INDEX The goal of all indices presened in he previous secion is o obain higher numeric values for erms ha are relevan o a given domain, or for more recen knowledge engineering asks [14], [15], erms ha are suiable candidaes for conceps of an onology. The raw erm absolue frequency (Eq. 1), obviously indicaes a relevance, since a erm ha is very frequen is likely o be imporan o he domain. Also he f-idf (Eq. 3) index can be an indicaive of relevance, since erms ha are very disincive o some documens of he corpus are also likely o be represenaive of he domain. The ds (Eq. 4), hd (Eq. 5) and TF-IDF (Eq. 6) indices have beer chance o idenifying conceps of a domain because hey use conrasing corpora. Neverheless, hese indices adop differen approaches ha reveals disinc empirical iniiaives o ackle he concep idenificaion problem. The firs difference is how hese indices ake he occurrences of erms in he domain corpus ino accoun. The ds (Eq. 4) and TF-IDF (Eq. 6) indices compue a relaive frequency of he erm, since he erm probabiliy (p (c) ) for ds and he f par for TF-IDF are compued as he absolue frequency divided (6)

by he oal number of erms in he domain corpus. The hd (Eq. 5) index, however, compues a normalized rank value, ha, even hough being compued according o he absolue frequency, delivers a linear relaion 1 among all erms. The second difference resides in he effec brough by he occurrence of erms in conrasing corpora. The ds (Eq. 4) index penalizes he erms ha occurs in he conrasing corpora by dividing is probabiliy in he domain corpus by he probabiliy in he conrasing corpora. The hd (Eq. 5) index also penalizes he erms ha occurs in he conrasing corpora, bu in his case i subracs he normalized rank value in he domain corpus by he normalized rank value in he conrasing corpora. The approach for TF-IDF (Eq. 6) index is quie differen, since i rewards he erms ha are unique o he domain corpus by muliplying he relaive frequency by he log of he number of corpora. Such reward decreases as he erm appears in oher conrasing corpora, unil i drops o 0 when he erm appears in all corpora. I is imporan o noice ha his reward decreases proporionally o he number of corpora, bu i is independen o he number of erm occurrences in conrasing corpora. We propose a new index o esimae he erm relevance o a domain following he same idea of conrasing corpora, bu we propose differences in he way erm occurrences in he domain corpus are aken ino accoun, and mos of all, in he effec brough by occurrences in he conrasing corpora. Specifically, we propose a represenaion o his effec called disjoin corpora frequency (dcf ), which is a mahemaical way o penalize erms ha appear in conrasing corpora proporionally o is number of occurrences, as well as he number of conrasing corpora in which he erm appears. A. Term frequency, disjoin corpora frequency - f-dcf Our proposal, like oher conrasing corpora approaches, is based on a primary indicaion of erm relevance and a reward/penalizaion mechanism. The basis of f-dcf index is o consider he absolue frequency as he primary indicaion of erm relevance. Then, we choose o penalize erms ha appear in he conrasing corpora by dividing is absolue frequency in he domain corpus by a geomeric composiion of is absolue frequency in each of he conrasing corpora. The f-dcf index is mahemaically expressed, for erm in corpus c, considering a se of conrasing corpora G, as: f-dcf (c) = g G f (c) ( 1 + log 1 + f (g) ) (7) The choice of absolue frequency as primary indicaion of erm relevance for corpus c, insead of using a relaive frequency (like ds and TF-IDF) or erm rank (like hd), aims he simpliciy of he measure for wo main reasons: 1 I is imporan o recall, ha he disribuion of absolue frequency values is likely o follow a Zipf law [23], i.e., he mos frequen erm is likely o have wice he number of occurrences as he second, hree imes he number of occurrences of he hird, and so on. We do no consider ha here is a need for linearizaion brough by he use of he erm rank, as for hd index, nor here is a need o make explici he normalizaion according o he corpus size, as for ds and TF-IDF; In fac, any normalizaion according o he corpus size sill remain possible afer he f-dcf compuaion; We consider ha keeping a relaion wih he absolue erm frequency preserves he index inuiive comprehension, since he f-dcf index numeric value will be smaller (if he erm appears in he conrasing corpora) or equal o f (if he erm does no appear in he conrasing corpora). The geomeric composiion of absolue frequencies in he conrasing corpora chosen o express he penalizaion, i.e., he divisor in Eq. 7, ries o encompass he following assumpions: The number of occurrences of a erm in each of he conrasing corpora is disribued according o a Zipf law [23], and o correcly esimaed his imporance, a linearizaion of his number of occurrences mus be made; A erm ha appears only in he domain corpora should no be penalized a all, i.e., erms ha do no occur in he conrasing corpora mus have he divisor equal o 1; and A erm ha appears in many corpora is more likely o be irrelevan o he domain corpus, han hose erms ha appears in fewer corpora. Because of he firs assumpion, we choose o consider a log funcion o compue he absolue frequency in each conrasing corpora (f (g) ). This decision follows he same principle adoped in he original proposiion of f-idf measure proposed by Roberson and Spärck-Jones [16]. The second assumpion made us adap his log funcion wih he addiion of value 1 inside and ouside he log funcion in order o deliver a value equal o 1 when he number of occurrences of a erm in a conrasing corpora is equal o 0. This decision follows he same principle adoped o he Bell e al. [21] o express heir formulaion of f-idf measure. Finally, he hird assumpion led us o employ he produc of he log of occurrences in each conrasing corpora. The produc represens ha he imporance of occurrences grows geomerically as i appears in oher corpora. In fac, according o our formulaion a erm is more likely o be irrelevan for a domain corpus when i appears few imes in many muliple conrasing corpora, han if i appears many imes in jus few conrasing corpora. Addiionally, he produc is compaible wih he idea o have a divisor equal o 1 when a erm appears only in he domain corpus. IV. PRACTICAL RESULTS The pracical applicaion of he proposed index is mean o illusrae is effeciveness and some basic characerisics of fdcf according o he conrasing corpora used. The experimens were conduced over Brazilian Poruguese corpora, using a linguisic-based erm exracion ool o provide erms and heir number of occurrences. Neverheless, corpora in any language submied o any kind of exracion could be employed wihou any loss of generaliy.

A. The chosen corpora The chosen es bed was one corpus from Pediarics domain [24] wih 281 documens from The Brazilian Journal on Pediarics. This corpus (PED) was chosen because of he availabiliy of reference liss of relevan erms. Four oher scienific corpora were used as suppor for definiion of specific Pediarics erms. These corpora have approximaively 1 million words each and heir domains are: Sochasic modeling (SM), Daa mining (DM), Parallel processing (PP) and Geology (GEO) [25]. Tab. I summarizes he informaion abou hese corpora. Table I CORPORA CHARACTERISTICS. documens senences words Pediarics PED 281 27,724 835,412 Sochasic Modeling SM 88 44,222 1,173,401 Daa Mining DM 53 42,932 1,127,816 Parallel Processing PP 62 40,928 1,086,771 Geology GEO 234 69,461 2,010,527 B. Exracion ools The exracion procedure of erms and heir frequencies was made by a wo sep process. Firs he documens were annoaed by he Poruguese parser PALAVRAS [26]. Then he PALAVRAS oupu, i.e., a se of TigerXML files, was submied o ExATOlp erm exracor [27]. PALAVRAS and ExATOlp join applicaion delivers high qualiy erm liss, since he exraced erms are noun phrases found in he corpus and heir frequencies. The exraced noun phrases were filered according o ExATOlp heurisic rules aiming he oupu of noun phrases as meaningful as possible. These heurisics goes from simple exclusion of aricles, bu also quie ingenious ones like deecion of implici noun phrases 2 [28]. C. Exraced erms and reference liss The exraced erms were divided in wo liss, bigrams and rigrams. Single erms and hose wih more han hree words were no considered in he evaluaion, since hey were no included in he hand-made reference lis consruced by erminology laboraory TEXTECC (hp://www6.ufrgs.br/execc/). The reference liss were produced by a careful and laborious process ha involved erminologiss, domain specialiss (Pediaricians) and academic sudens. These liss are available for download a TEXTECC websie and hey have been used for pracical applicaions including glossary consrucion, ranslaion aid, and even onology consrucion. These reference liss are composed by 1,534 bigrams and 2,660 rigrams and hey can also be consuled a hp://onolp.inf.pucrs.br/onolp/ downloads-onolplisa.php. The full exraced erm liss delivered by PALAVRAS and ExATOlp for he Pediarics corpus were composed by 15,483 2 Implici noun phrases are, for example, sick children and healhy children ha can be exraced from he senence Sick and healhy children can be reaed.. disinc bigrams and 18,171 disinc rigrams. To each of hese liss he compued indices were: f he absolue erm frequency (Eq. 1); f-idf he erm frequency, inverse documen frequency (Eq. 3) wih he basic formulaion from Bell e al. [21] aggregaed wih he sum proposed by Manning and Schüze [8] o be used as an example of index no using conrasing corpora; ds he erm domain specificiy (Eq. 4) proposed by Park e al. [11]; hd he ermhood (Eq. 5) proposed by Ki and Liu [12]; TF-IDF he erm frequency, inverse domain frequency (Eq. 6) proposed by Kim e al. [13]; and f-dcf he erm frequency, disjoin corpora frequency (Eq. 7) proposed in he previous secion of his paper. D. The impac of differen measures on frequen erms Observing in deail some erms in he exraced liss i is possible o have a beer undersanding of he effec of each index, and, herefore, he benefis brough by f-dcf as relevance index. Tab. II presens he op en frequen erms, i.e., he en erms wih more absolue occurrences in he Pediarics corpus. In his able i is shown he number of occurrences of he erm in each corpora, i.e., Pediarics (PED), Sochasic modeling (SM), Daa mining (DM), Parallel processing (PP) and Geology (GEO). Addiionally, he las column (ref. lis) indicaes weher he erm belongs ( IN ) or no ( OUT ) o he reference lis. Table II OCCURRENCES FOR FREQUENT TERMS FROM PEDIATRICS CORPUS. erm in Poruguese (ranslaion) PED SM DM PP GEO ref. lis aleiameno maerno (breas feeding) 306 0 0 0 0 IN recém nascido (new born) 299 0 0 0 0 IN faixa eária (age slo) 234 0 6 0 0 IN presene esudo (curren sudy) 188 4 1 0 67 OUT leie maerno (moher s milk) 163 0 0 0 0 IN idade gesacional (gesacional age) 144 0 0 0 0 IN venilação mecânica (mechanical venilaion) 138 0 0 0 0 IN via aérea (airway) 120 0 0 0 0 IN pressão arerial (blood pressure) 112 0 0 0 0 IN sexo masculino (male sex) 109 7 8 0 0 OUT The same en more frequen erms are also shown in Tab. III wih he values for he six presened indices, as well as heir rank according o each of hem. For example, in he hird row of Tab. III, he erm faixa eária ( age slo in English) belongs o he reference lis and i is ranked as he hird erm in he liss sored wih he erm frequency (f - Eq. 1) and wih he erm frequency, inverse documen frequency (f-idf - Eq. 3). In he liss sored wih he oher indices his erm is ranked as he 13,281 h (for ds - Eq. 4), he fourh (for hd - Eq. 5), he sixh (for TF-IDF - Eq. 6), and he fifeenh (for f-dcf - Eq. 7). Observing he rank differences beween he liss sored wih he erm frequency (f - Eq. 1) and he erm frequency, inverse documen frequency (f-idf - Eq. 3), we noiced an imporan

Table III ANALYSIS OF FREQUENT TERMS FROM PEDIATRICS CORPUS. erm in Poruguese f f-idf ds hd TF-IDF f-dcf (ranslaion) Eq. 1 Eq. 3 Eq. 4 Eq. 5 Eq. 6 Eq. 7 aleiameno maerno 306 199.18 1.00 1.00 0.0027 306.00 (breas feeding) 1 s 1 s 1 s 1 s 1 s 1 h recém nascido 299 184.98 1.00 0.99 0.0027 299.00 (new born) 2 nd 2 nd 1 s 2 nd 2 nd 2 nd faixa eária 234 169.18 0.98 0.93 0.0012 61.46 (age slo) 3 rd 3 rd 13,281 s 4 h 6 h 15 h presene esudo 188 167.78 0.73 0.50 0.0002 3.99 (curren sudy) 4 h 4 h 13,429 h 42 nd 57 h 1,276 h leie maerno 163 143.23 1.00 0.94 0.0015 163.00 (moher s milk) 5 h 5 h 1 s 3 rd 3 rd 3 rd idade gesacional 144 135.60 1.00 0.93 0.0013 144.00 (gesacional age) 6 h 7 h 1 s 5 h 4 h 4 h venilação mecânica 138 140.85 1.00 0.91 0.0012 138.00 (mechanical venilaion) 7 h 6 h 1 s 6 h 5 h 5 h via aérea 120 132.72 1.00 0.90 0.0011 120.00 (airway) 8 h 8 h 1 s 7 h 7 h 6 h pressão arerial 112 93.27 1.00 0.88 0.0010 112.00 (blood pressure) 9 h 19 h 1 s 8 h 8 h 7 h sexo masculino 109 125.70 0.88 0.77 0.0003 6.53 (male sex) 10 h 9 h 13,318 h 14 h 35 h 543 h similariy. The only significanly change occurs for he erm pressão arerial ( blood pressure ) ha drops from he 9 h o he 19 h posiion. However, his change does no correspond o a meaningful downgrade, since his erm ( blood pressure ) seems o be as relevan o Pediarics as, for insance, via aérea ( airway ). In conras, he quie generic erm presene esudo ( curren sudy ) is no affeced a all by f-idf. Observing he effec brough by he erm domain specificiy index (ds - Eq. 4), we realize he lack of precision, since i assigns an equally imporan rank o all erms ha are no exclusive o he Pediarics corpus. Consequenly, he erms ha appears in oher corpora are cas ou of any lis of relevan erms, since, giving he conrasing corpora (SM, DM, PP and GEO), here is more han 13,000 erms appearing only in he Pediarics corpus. The erms faixa eária ( age slo ), presene esudo ( curren sudy ) and sexo masculino ( male sex ) are all ranked beyond he 13,000 h posiion. The lis sored wih he ermhood index (hd - Eq. 5) shows he downgrade effec on he hree erms appearing in he conrasing corpora (grey rows in Tabs. II and III). However, hese erms are no sen very low, since even he erm presene esudo ( curren sudy ), which is very frequen in he conrasing corpora (72 occurrences), is downgraded only o he 42 h posiion. The lis sored according o erm frequency, inverse domain frequency index (TF-IDF - Eq. 6) shows a sronger effec han he ermhood (hd - Eq. 5), since i is based on he number of conrasing corpora he erm appear. In consequence, he erm faixa eária ( age slo ) drops o he sixh posiion because i appears also in he Daa Mining corpus, while he erm presene esudo ( curren sudy ) drops o he 57 h posiion because i appears in all corpora, bu Geology. I is imporan o call he reader aenion ha our proposed index (f-dcf - Eq. 7) is he only one ha akes ino accoun boh he number of occurrences in he conrasing corpora (as ermhood and erm domain specificiy), and he number of corpora in which he erm appears (as erm frequency, inverse corpus frequency). For ha reason, he downgrade effec in he lis sored according o our index is he sronger one. Our index cass ou he erm presene esudo ( curren sudy ) o he 1,276 h posiion, while i downgrades significanly he erm sexo masculino ( male sex ) o he 543 h posiion. In opposiion, he erm faixa eária ( age slo ) is mildly downgraded from he hird o he fifeenh posiion. V. CONCLUSION This paper presened a novel numerical index o esimae he relevance of exraced erms wih respec o a specific domain. The inclusion of disjoin corpora frequency (dcf ) componen successfully improved he precision of exraced liss in comparison wih he radiional f and f-idf, bu also oher indices based on comparison wih conrasing corpora, namely erm domain specificiy [11], ermhood [12] and erm frequency, inverse domain frequency [13]. The proposed dcf approach was described here in composiion wih he absolue frequency (f ) and i has he advanage o keep an analogue semanic of he original absolue frequency index. If a given erm does no appear in oher corpora, is fdcf index will be equal o he erm frequency, i.e., only erms appearing in oher corpora will be numerically downgraded. This is no he case of any of he oher pre-exisen measures. Our proposal is he follow up o iniial sudies based on he comparison wih conrasing corpora. Such inuiive idea was iniially proposed during he las 10 years [10], [22], [29], [11], [12], [13], [15], bu, a he auhors bes knowledge, our proposal is he firs one o pay aenion o an correc weighing of he influence of occurrences of erms in conrasing corpora. Specifically, our f-dcf index formulaion consider he produc of he log of he number of occurrences in oher corpora as reducive facor for he domain corpus absolue erm frequency. This choice is jusified by he fac ha erm occurrences are likely o be disribued by a Zipf law [23]. In Park e al. [11] his fac was ignored. In Ki and Liu [12] his fac was approached by he rank difference. In Kim e al. [13] his fac was approached by erm relaive frequency and he logarihm in he IDF par. Therefore, our formulaion seems o be mahemaically more robus. The main limiaion of he curren sudy is he lack of horough experimens wih oher corpora. We had choose o limi our experimens o he sudied corpora because here were no sign of availabiliy of daa ses previously employed by oher auhors. Neverheless, since he objecive of his paper is o propose he f-dcf index, i remains as a naural fuure work he experimenaion of our proposal o a saisically significan se of corpora. Such fuure work will demand he analysis of he proposed f-dcf index, in comparison wih oher indices, in erms of numerical measures, as precision, and he gahering of corpora and corresponding liss of references. Anoher valid fuure work is he sudy of heurisics o choose a good cu-off poin o apply in he exraced erm liss. Wih he use of a simple index of relevance, like he absolue erm frequency, he cu-off poin choice seems simple, since i is enough o define a minimum number of erm occurrences. However, wih a more sophisicaed one, as he f-dcf index proposed here, i is a lile less obvious o define a meaningful and effecive cu-off poin [30].

REFERENCES [1] P. Cimiano, Onology learning and populaion from ex: algorihms, evaluaion and applicaions. Springer, 2006. [2] D. Bourigaul and G. Lame, Analyse disribuionnelle e srucuraion de erminologie. applicaion a la consrucion d une onologie documenaire du droi, Traiemen auomaique des langues, vol. 43, no. 1, 2002. [3] C. Chemuduguna, A. Holloway, P. Smyh, and M. Seyvers, Modeling documens by combining semanic conceps wih unsupervised saisical learning, in The Semanic Web - ISWC 2008, ser. Lecure Noes in Compuer Science, A. Sheh, S. Saab, M. Dean, M. Paolucci, D. Maynard, T. Finin, and K. Thirunarayan, Eds. Springer Berlin / Heidelberg, 2008, vol. 5318, pp. 229 244. [4] I. Tiov and M. Kozhevnikov, Boosrapping semanic analyzers from non-conradicory exs, in Proceedings of he 48h Annual Meeing of he Associaion for Compuaional Linguisics, ser. ACL 10. Morrisown, NJ, USA: Associaion for Compuaional Linguisics, 2010, pp. 958 967. [5] W. Bosma and P. Vossen, Boosrapping language neural erm exracion, in Proceedings of he Sevenh conference on Inernaional Language Resources and Evaluaion (LREC 10), N. C. C. Chair), K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, and D. Tapias, Eds. Vallea, Mala: European Language Resources Associaion (ELRA), may 2010. [6] C. Fellbaum, Wordne, in Theory and Applicaions of Onology: Compuer Applicaions, R. Poli, M. Healy, and A. Kameas, Eds. Springer Neherlands, 2010, pp. 231 243. [7] T. Dunning, Accurae mehods for he saisics of surprise and coincidence, Compuaional Linguisics, vol. 19, pp. 61 74, March 1993. [Online]. Available: hp://dl.acm.org/ciaion.cfm?id=972450. 972454 [8] C. D. Manning and H. Schüze, Foundaions of saisical naural language processing. MIT Press, 1999. [9] K. Spärck-Jones, A saisical inerpreaion of erm specificiy and is applicaion in rerieval, Journal of Documenaion, vol. 28, no. 1, pp. 11 21, 1972. [Online]. Available: hp://www.emeraldinsigh.com/ journals.hm?aricleid=1649768&show=absrac [10] T. Chung, A corpus comparison approach for erminology exracion, Terminology, vol. 9, pp. 221 246, 2003. [Online]. Available: hp://www.ingenaconnec.com/conen/jbp/erm/2003/00000009/ 00000002/ar00004 [11] Y. Park, S. Pawardhan, K. Visweswariah, and S. C. Gaes, An empirical analysis of word error rae and keyword error rae, in INTERSPEECH, 2008, pp. 2070 2073. [12] C. Ki and X. Liu, Measuring mono-word ermhood by rank difference via corpus comparison, Terminology, vol. 14, no. 2, pp. 204 229, 2008. [13] S. N. Kim, T. Baldwin, and M.-Y. Kan, Exracing domainspecific words - a saisical approach, in Proceedings of he 2009 Ausralasian Language Technology Associaion Workshop, L. Pizzao and R. Schwier, Eds. Sydney, Ausralia: Ausralasian Language Technology Associaion, December 2009, pp. 94 98. [Online]. Available: www.ala.asn.au/evens/ala2009/proceedings/pdf/ ALTA2009_12.pdf [14] L. Teixeira, G. Lopes, and R. Ribeiro, Auomaic exracion of documen opics, in Technological Innovaion for Susainabiliy, ser. IFIP Advances in Informaion and Communicaion Technology, L. Camarinha-Maos, Ed. Springer Boson, 2011, vol. 349, pp. 101 108. [Online]. Available: hp://dx.doi.org/10.1007/978-3-642-19170-1_11 [15] G. Rose, M. Holland, S. Larocca, and R. Winkler, Semi-auomaed mehods for refining a domain-specific erminology base, U. S. Army Research Laboraory, Adelphi, MD, USA, Tech. Rep. ARL-RP-0311, 2011. [16] S. Roberson and K. Spärck-Jones, Relevance weighing of search erms, Journal of American Sociey for Informaion Science, vol. 27, no. 3, pp. 129 146, 1976. [17] W. B. Crof and D. J. Harper, Using probabilisic models of documen rerieval wihou relevance informaion, Journal of documenaion, vol. 35, no. 4, pp. 285 295, 1979. [18] S. E. Roberson and S. Walker, On relevance weighs wih lile relevance informaion, SIGIR Forum, vol. 31, pp. 16 24, July 1997. [Online]. Available: hp://doi.acm.org/10.1145/278459.258529 [19] A. Lavelli, F. Sebasiani, and R. Zanoli, Disribuional erm represenaions: an experimenal comparison, in CIKM, 2004, pp. 615 624. [20] A. Maedche and S. Saab, Learning onologies for he semanic web, in SemWeb, 2001. [21] T. Bell, I. Wien, and A. Moffa, Managing Gigabyes: Compressing and Indexing Documens and Images. San Francisco: Morgan Kaufmann, 1999. [Online]. Available: hp://onology.csse.uwa.edu.au/ reference/browse_paper.php?pid=233281449 [22] P. Drouin, Deecion of domain specific erminology using corpora comparison, in Proceedings of he 4h Inernaional Conference on Language Resources and Evaluaion (LREC) 2004, M. T. Lino, M. F. Xavier, F. Ferreira, R. Cosa, and R. Silva, Eds., ELRA. Lisbon, Porugal: European Language Resources Associaion, May 2004, pp. 79 82. [23] G. K. Zipf, The Psycho-Biology of Language - An Inroducion o Dynamic Philology. Boson, USA: Houghon-Mifflin Company, 1935. [24] R. J. Coulhard, The applicaion of Corpus Mehodology o Translaion: he JPED parallel corpus and he Pediarics comparable corpus, Ph.D. disseraion, UFSC, 2005. [25] L. Lopes and R. Vieira, Building Domain Specific Corpora in Poruguese Language, Ponifícia Universidade Caólica do Rio Grande do Sul (PUCRS), Poro Alegre, Brasil, Tech. Rep. TR 062, Dezembro 2010. [26] E. Bick, The parsing sysem PALAVRAS: auomaic grammaical analysis of poruguese in consrain grammar framework, Ph.D. disseraion, Arhus Universiy, 2000. [27] L. Lopes, P. Fernandes, R. Vieira, and G. Fedrizzi, ExATO lp An Auomaic Tool for Term Exracion from Poruguese Language Corpora, in Proceedings of he 4h Language & Technology Conference: Human Language Technologies as a Challenge for Compuer Science and Linguisics (LTC 09). Faculy of Mahemaics and Compuer Science of Adam Mickiewicz Universiy, November 2009, pp. 427 431. [28] L. Lopes and R. Vieira, Heurisics o improve onology erm exracion, in PROPOR 2012 Inernaional Conference on Compuaional Processing of Poruguese Language, 2012, submied. [29] J. Wermer and U. Hahn, You can bea frequency (unless you use linguisic knowledge): a qualiaive evaluaion of associaion measures for collocaion and erm exracion, in Proceedings of he 21s Inernaional Conference on Compuaional Linguisics and he 44h annual meeing of he Associaion for Compuaional Linguisics, ser. ACL- 44. Sroudsburg, PA, USA: Associaion for Compuaional Linguisics, 2006, pp. 785 792. [30] L. Lopes, R. Vieira, M. Finao, and D. Marins, Exracing compound erms from domain corpora, Journal of he Brazilian Compuer Sociey, vol. 16, pp. 247 259, 2010, 10.1007/s13173-010-0020-4. [Online]. Available: hp://dx.doi.org/10.1007/s13173-010-0020-4