Domain term relevance through tf-dcf

Size: px
Start display at page:

Download "Domain term relevance through tf-dcf"

Transcription

1 Domain erm relevance hrough f-dcf Lucelene Lopes PPGCC - FACIN PUCRS Universiy Poro Alegre - Brazil lucelene.lopes@pucrs.br Paulo Fernandes PPGCC - FACIN PUCRS Universiy Poro Alegre - Brazil paulo.fernandes@pucrs.br Renaa Vieira PPGCC - FACIN PUCRS Universiy Poro Alegre - Brazil renaa.vieira@pucrs.br Absrac This paper proposes a new index for he relevance of erms exraced from domain corpora. We call i erm frequency, disjoin corpora frequency (f-dcf), and i is based on he absolue erm frequency of each erm empered by is frequency in oher (conrasing) corpora. Concepual differences and mahemaical compuaion of he proposed index are discussed in respec wih oher similar approaches ha also ake he frequency in conrasing corpora ino accoun. To illusrae he efficiency of he f-dcf index, his paper evaluaes he applicaion of his index and oher similar approaches. I. INTRODUCTION The auomaic exracion of erms from exs is a well mapped ask, bu he auomaic choice of which exraced erms are relevan for a specific domain is a much more defian ask. Finding he mos relevan erms for a domain, i.e., he domain conceps, is an imporan sep for knowledge engineering asks such as onology learning from exs [1]. Some classical linguisic-based work in his area sugges he use of disribuional analysis [2] o associae erms and hen, esablish which of hem are good concep candidaes. A differen approach, bu ye following he same idea of inferring conceps from erm associaion, is made by Chemuduguna e al. [3], where he idenificaion of conceps is made hrough pure saisical measures empered by previous insered human informaion. Tiov and Kozhevnikov [4] work also follows his line of research by inferring semanic relaions among erms in order o idenify differen erms represening a same concep in ses of small documens (weaher forecass) wih no linguisic annoaion. The work of Bosma and Vossen [5] presens a similar effor o esablish erm relevance measures considering a muliple corpora resource. This work proposes differen relevance measures of erms o each corpus, bu, Bosma and Vossen s relevance measure of a erm in a given corpus do no affec he relevance of his same erm in oher corpora. In fac, he mehodology proposed in heir work access WORDNET [6] in order o validae he erm candidaes according o heir measures, bu also o esablish relaions (hypernym, hyponym, meronym, ec.) among hem. In opposiion o hese effors, his paper proposes an approach ha is no linguisic-based, bu i relies only he on saisical informaion gaher from he domain corpus o esablish a numerical measure o erm relevance in his corpus. Therefore, his paper approach is aligned wih works ha ake ino accoun he erm frequency on documens o compue a relevance index o esablish how represenaive a erm exraced from a corpus will be for he domain represened by his corpus. Some examples of such saisicalbased approaches are he works of Dunning in 1993 [7] which proposes he use of log likelihood raio, Manning and Schulz in 1999 [8] which proposes a composiion of f-idf (erm-frequency, inverse documen frequency [9] adaped for erm relevance in a corpus), and oher iniiaives based on compuing indexes from one specific corpus only. However, our claim is ha hose ypical indices fail o rule ou hose erms which are no paricularly relevan o a arge domain. The basic idea behind approaches like he one in our paper is he assumpion ha a erm relevance o a specific domain can only be esablished by comparison wih corpora from oher domains, called conrasing corpora. One of he firs examples of similar previous work like our own was he work of Chung in 2003 [10]. Bu recenly, more sophisicaed versions were proposed by Park e al. in 2008 [11] wih domain specificiy index, by Ki and Liu in 2008 wih ermhood index [12], and by Kim e al. in 2009 wih erm frequency, inverse domain frequency index [13]. These approaches brough some qualiy o he erm exracion, as was verified by he works of Teixeira e al. [14], as well as, Rose e al. [15]. Similar o our proposal, all hese previous works followed he same principle o compue a relevance index ha is direcly proporional o he erm absolue frequency in he corpus and inversely proporional o he erm absolue frequency in oher corpora. The main difference among hese similar previous works [11], [12], [13] and our own is he specific formula o weigh he influence of oher corpora frequency. This paper firs conribuion resides in drawing a panorama of opions of indices o express he relevance of exraced erm from a domain corpus, focusing on indices ha ake ino accoun also corpora of oher domains (conrasing corpora). Some experimens illusrae he benefis of approaches using conrasing corpora over radiional indices. Secondly, and mos imporan, his paper conribues wih he proposal of a new relevance index, called f-dcf, ha is, according o our experimens, superior o he oher indexes based on conrasing corpora. This conribuion is enhanced by he analysis of he f-dcf behavior agains differen opions of conrasing corpora.

2 I is no he goal of his paper o analyze echniques o improve he qualiy of erm exracion iself, since we assume ha a previously performed exracion provides a se of exraced erms. I is also ou of he scope of his paper o analyze how many erms should be considered conceps of a domain. Our purpose is o presen argumens and experimens showing ha he proposed index is effecive o rank exraced erms according o heir relevance for he domain, hus allowing o idenify domain concep candidaes. This paper is organized as follows: Secion II describes he exisen saisical measures ha are compared o our proposed f-dcf index; Secion III presens our paper main conribuion, which is he proposal of a erm relevance index based on he inclusion of a disjoin corpora frequency (dcf ) componen; Secion IV evaluaes he exising and proposed indices. Finally, Conclusion sress he conribuions and limiaions of his paper, leading o he proposiion of fuure works. II. EXISTING MEASURES FOR RELEVANCE ESTIMATION The mos elemenary way o esablish he saisical relevance of erms exraced from a domain specific corpus is o compue he absolue frequency of erms, i.e., how many imes each erm occurs in he corpus. Obviously, his simple approach is very fragile, since no necessarily a very frequen erm is relevan for he domain. This fac is specially noiceable wih simple exracion mehods, alhough even sophisicaed linguisic-based mehods also suffer from using such simple crieria. For example, pure saisical mehods require he adopion of a lis of highly frequen grammaical words (sop lis). Wihou a sop lis, any pure saisical mehod delivers erms wih very low significance such as preposiions and usual expressions. However, i migh be very difficul o esablish an exhausive sop lis in advance for differen domain and genre. The use of erm frequency as relevance measure is a lile less harmful for exracion mehods aking ino accoun linguisic informaion. For example, he synacic annoaion of a corpus allows he exracion procedure o avoid erms ha are unsuiable for concep names, such as verbs and pronouns. In fac, more sophisicaed linguisic analysis, as he idenificaion of noun phrases, may improve significanly he qualiy of exracion, bu even in hese cases he use of erm frequency do no preven he incorrec exracion of common expressions which are no domain specific. For example, he quie common expression fuure work may be found in several academic exs, bu i is hardly considered a defining concep o any scienific domain. Neverheless, he saring poin of all sophisicaed indices is he simple absolue frequency. Assuming, f,d as he number of occurrences of erm in documen d, and D (c) he se of all documens belonging o he corpus c referring o a specific domain, he absolue erm frequency of a erm in corpus c is expressed by: f (c) = f,d (1) d D (c) A. Term frequency and inverse documen frequency - f-idf An alernaive for plain erm frequency is o ake ino accoun he frequency of he erm among documens. The seminal work of Spärck-Jones [9] shows he imporance o consider frequen erms, bu also non-frequen ones in order o rerieve documens. These ideas lead o he well-known Roberson and Spärck-Jones probabilisic model o erm relevance o specific documens [16]. Crof and Harper [17], and laer Roberson and Walker [18], proposed formulaions o a popular index ha akes posiively ino accoun he erm frequency (f ), i.e., he number of occurrences of a given erm in a documen d; and also considers negaively he number of documens of he corpus where erm appears a leas once, i.e., he inverse documen frequency (idf ). This index, called f-idf has many formulaions, e.g., [19], [20], [8], bu in his paper we will consider he formulaion adoped by Bell e al. [21]. The f-idf index is mahemaically defined for each erm o each documen d belonging o a corpus c ha has a leas one occurrence of as follows: ( ) f-idf,d = (1 + log(f,d )) log 1 + D(c) (2) }{{} D (c) f par }{{} idf par where f,d is he number of occurrences of erm in documen d; D (c) is he se of all documen of a given corpus c; and D (c) is he subse of hese documens where appears a leas once. Observing equaion (2) i is possible o observe he erm frequency (f ) and he inverse documen frequency (idf ) pars. The f par considers he logarihmic frequency of he erm, since he variaion of erm occurrences of erms approaches an exponenial disribuion, i.e., a erm ha occurs 10 imes is no 10 imes more imporan han a erm ha appears only once. Neverheless, erm is an order of magniude more imporan han erm. The idf par represens a value ha varies from log(2) for a erm ha appears in all documens, unil log(1+ D (c) ) for a documen ha appears in only one documen. The idea behind f-idf formulaion is ha a erm is more relevan as a keyword for a documen d if i appears many imes in his documen and very few imes (or ideally none) in oher documens. This is an imporan disincion for informaion rerieval. The populariy of his index is jusified mosly because i prevens frequen erms spread in many documens o be considered more relevan han hey should. Indeed, f-idf is an effecive measure o idenify he defining erms of documens, because i spos erms ha are good for documen indexaion. The use of f-idf o esablish relevance of erms o domain corpora was proposed by Manning and Schüze [8]. According o hese auhors, a possible index o express he relevance of a erm in a corpus c is expressed by: f-idf (c) = f-idf,d (3) d D (c)

3 B. Term domain specificiy - ds The firs iniiaives o consider he relevance of erms o a domain corpus aking ino accoun conrasive generic corpus, or corpora, include he works made by Chung in 2003 [10] and Drouin in 2004 [22]. However, a he auhors bes knowledge, i is he work of Park e al. [11], in 2008, one of he firs formulaions of an index o express erm relevance o a specific domain. In ha work, such index is called domain specificiy, and i is expressed as he raio beween he probabiliy of occurrence of a erm in a domain corpus c and he probabiliy of his same erm in a generic corpus. Park e al. definiion of erm domain specificiy o a specific domain corpus c, considering a generic domain corpus g was expressed as: ds (c) = p(c) = p (g) f (c) N (c) (4) f (g) N (g) where p (c) express he probabiliy of occurrence of erm in corpus c; and N (c) is he oal number of erms in corpus c, i.e., N (c) = f(c). C. Termhood - hd Following he approach o consider, besides he domain corpus of ineres, a conrasing corpus, he work of Ki and Liu in 2008 [12] proposes an index called ermhood. This index, as for Park e al. s erm domain specificiy, follows he idea ha a erm relevan o a domain is more frequen in he corpus domain han in oher corpora. The main difference brough by his work is o consider he erm rank in he corpus vocabulary (he se of all erms in he corpus), insead of he erm absolue frequency. Ki and Liu definiion of erm ermhood index for a corpus c, a generic domain corpus g (called background corpus by hem) was expressed by: hd (c) = r (c) V (c) }{{} norm. rank value in c r (g) V (g) }{{} norm. rank value in g where V (c) is he vocabulary of corpus c, i.e., V (c) is he cardinaliy of he se of all erms in he corpus c, and r (c) is he rank value of erm expressed as V (c) for he more frequen erm, V (c) 1 for he second mos frequen, and so on unil he less frequen erm as r (c) = 1. Observing he ermhood index we can see i as he difference beween he normalized rank value of he erm in he domain corpus c and he generic domain corpus g. Acually, he division of he rank value by he vocabulary size is inended o keep he normalized rank value wihin he inerval (0, 1], wih a value equal o 1 o he more frequen erm, and he oher erms decaying, according o heir frequency, asympoically oward 0. As a resul, he ermhood index will be whiin he inerval [1, 1], having he more frequen erm in c having a value equal o 1, if i does no belong o vocabulary V (g), unil a value -1 for he more frequen erm in g, if i does no belong o vocabulary V (c). (5) D. Term frequency, inverse domain frequency - TF-IDF Recenly, Kim e al. [13] have proposed in 2009 anoher index o rank erm relevance considering he original idea of he f-idf index, which was o idenify whereas a erm is suiable o represen a documen. In such way, Kim e al. did no acually proposed a new index, bu insead, hey proposed he use of he same f-idf formulaion, bu considering he se of documens of a corpus as a single documen. To avoid confusion, we will refer o his index wih he acronym TF- IDF in uppercase, o differeniae i from he erm frequency, inverse documen frequency (f-idf ). The TF-IDF index for erm a corpus c, considering a se of corpora G as proposed by Kim e al. is numerically expressed by: TF-IDF (c) = f(c) f (c) } {{ } TF par ( G log G ) } {{ } IDF par where f (c) is he erm frequency of erm in corpus c; G is he se of all domain corpora; and G is he subse of G where he erm appears a leas once. I is imporan o noice ha he basic formulaion of f-idf used as inspiraion by Kim e al. proposal is no as robus as he one of Bell e al. (Eq. 3). For insance, if a erm appears in all corpora, he IDF par of Eq. 6 will become 0, and herefore, such erm will have a TF-IDF index also equal o 0, i.e., i will be considered less relevan han any oher erm, regardless is number of occurrences. Anoher imporan difference beween Equaions 3 and 6 is ha Bell e al. s (Eq. 3) uses he log of absolue erm frequency in he f par, while Kim e al. s (Eq. 6) considers direcly a relaive erm frequency. III. PROPOSED INDEX The goal of all indices presened in he previous secion is o obain higher numeric values for erms ha are relevan o a given domain, or for more recen knowledge engineering asks [14], [15], erms ha are suiable candidaes for conceps of an onology. The raw erm absolue frequency (Eq. 1), obviously indicaes a relevance, since a erm ha is very frequen is likely o be imporan o he domain. Also he f-idf (Eq. 3) index can be an indicaive of relevance, since erms ha are very disincive o some documens of he corpus are also likely o be represenaive of he domain. The ds (Eq. 4), hd (Eq. 5) and TF-IDF (Eq. 6) indices have beer chance o idenifying conceps of a domain because hey use conrasing corpora. Neverheless, hese indices adop differen approaches ha reveals disinc empirical iniiaives o ackle he concep idenificaion problem. The firs difference is how hese indices ake he occurrences of erms in he domain corpus ino accoun. The ds (Eq. 4) and TF-IDF (Eq. 6) indices compue a relaive frequency of he erm, since he erm probabiliy (p (c) ) for ds and he f par for TF-IDF are compued as he absolue frequency divided (6)

4 by he oal number of erms in he domain corpus. The hd (Eq. 5) index, however, compues a normalized rank value, ha, even hough being compued according o he absolue frequency, delivers a linear relaion 1 among all erms. The second difference resides in he effec brough by he occurrence of erms in conrasing corpora. The ds (Eq. 4) index penalizes he erms ha occurs in he conrasing corpora by dividing is probabiliy in he domain corpus by he probabiliy in he conrasing corpora. The hd (Eq. 5) index also penalizes he erms ha occurs in he conrasing corpora, bu in his case i subracs he normalized rank value in he domain corpus by he normalized rank value in he conrasing corpora. The approach for TF-IDF (Eq. 6) index is quie differen, since i rewards he erms ha are unique o he domain corpus by muliplying he relaive frequency by he log of he number of corpora. Such reward decreases as he erm appears in oher conrasing corpora, unil i drops o 0 when he erm appears in all corpora. I is imporan o noice ha his reward decreases proporionally o he number of corpora, bu i is independen o he number of erm occurrences in conrasing corpora. We propose a new index o esimae he erm relevance o a domain following he same idea of conrasing corpora, bu we propose differences in he way erm occurrences in he domain corpus are aken ino accoun, and mos of all, in he effec brough by occurrences in he conrasing corpora. Specifically, we propose a represenaion o his effec called disjoin corpora frequency (dcf ), which is a mahemaical way o penalize erms ha appear in conrasing corpora proporionally o is number of occurrences, as well as he number of conrasing corpora in which he erm appears. A. Term frequency, disjoin corpora frequency - f-dcf Our proposal, like oher conrasing corpora approaches, is based on a primary indicaion of erm relevance and a reward/penalizaion mechanism. The basis of f-dcf index is o consider he absolue frequency as he primary indicaion of erm relevance. Then, we choose o penalize erms ha appear in he conrasing corpora by dividing is absolue frequency in he domain corpus by a geomeric composiion of is absolue frequency in each of he conrasing corpora. The f-dcf index is mahemaically expressed, for erm in corpus c, considering a se of conrasing corpora G, as: f-dcf (c) = g G f (c) ( 1 + log 1 + f (g) ) (7) The choice of absolue frequency as primary indicaion of erm relevance for corpus c, insead of using a relaive frequency (like ds and TF-IDF) or erm rank (like hd), aims he simpliciy of he measure for wo main reasons: 1 I is imporan o recall, ha he disribuion of absolue frequency values is likely o follow a Zipf law [23], i.e., he mos frequen erm is likely o have wice he number of occurrences as he second, hree imes he number of occurrences of he hird, and so on. We do no consider ha here is a need for linearizaion brough by he use of he erm rank, as for hd index, nor here is a need o make explici he normalizaion according o he corpus size, as for ds and TF-IDF; In fac, any normalizaion according o he corpus size sill remain possible afer he f-dcf compuaion; We consider ha keeping a relaion wih he absolue erm frequency preserves he index inuiive comprehension, since he f-dcf index numeric value will be smaller (if he erm appears in he conrasing corpora) or equal o f (if he erm does no appear in he conrasing corpora). The geomeric composiion of absolue frequencies in he conrasing corpora chosen o express he penalizaion, i.e., he divisor in Eq. 7, ries o encompass he following assumpions: The number of occurrences of a erm in each of he conrasing corpora is disribued according o a Zipf law [23], and o correcly esimaed his imporance, a linearizaion of his number of occurrences mus be made; A erm ha appears only in he domain corpora should no be penalized a all, i.e., erms ha do no occur in he conrasing corpora mus have he divisor equal o 1; and A erm ha appears in many corpora is more likely o be irrelevan o he domain corpus, han hose erms ha appears in fewer corpora. Because of he firs assumpion, we choose o consider a log funcion o compue he absolue frequency in each conrasing corpora (f (g) ). This decision follows he same principle adoped in he original proposiion of f-idf measure proposed by Roberson and Spärck-Jones [16]. The second assumpion made us adap his log funcion wih he addiion of value 1 inside and ouside he log funcion in order o deliver a value equal o 1 when he number of occurrences of a erm in a conrasing corpora is equal o 0. This decision follows he same principle adoped o he Bell e al. [21] o express heir formulaion of f-idf measure. Finally, he hird assumpion led us o employ he produc of he log of occurrences in each conrasing corpora. The produc represens ha he imporance of occurrences grows geomerically as i appears in oher corpora. In fac, according o our formulaion a erm is more likely o be irrelevan for a domain corpus when i appears few imes in many muliple conrasing corpora, han if i appears many imes in jus few conrasing corpora. Addiionally, he produc is compaible wih he idea o have a divisor equal o 1 when a erm appears only in he domain corpus. IV. PRACTICAL RESULTS The pracical applicaion of he proposed index is mean o illusrae is effeciveness and some basic characerisics of fdcf according o he conrasing corpora used. The experimens were conduced over Brazilian Poruguese corpora, using a linguisic-based erm exracion ool o provide erms and heir number of occurrences. Neverheless, corpora in any language submied o any kind of exracion could be employed wihou any loss of generaliy.

5 A. The chosen corpora The chosen es bed was one corpus from Pediarics domain [24] wih 281 documens from The Brazilian Journal on Pediarics. This corpus (PED) was chosen because of he availabiliy of reference liss of relevan erms. Four oher scienific corpora were used as suppor for definiion of specific Pediarics erms. These corpora have approximaively 1 million words each and heir domains are: Sochasic modeling (SM), Daa mining (DM), Parallel processing (PP) and Geology (GEO) [25]. Tab. I summarizes he informaion abou hese corpora. Table I CORPORA CHARACTERISTICS. documens senences words Pediarics PED , ,412 Sochasic Modeling SM 88 44,222 1,173,401 Daa Mining DM 53 42,932 1,127,816 Parallel Processing PP 62 40,928 1,086,771 Geology GEO ,461 2,010,527 B. Exracion ools The exracion procedure of erms and heir frequencies was made by a wo sep process. Firs he documens were annoaed by he Poruguese parser PALAVRAS [26]. Then he PALAVRAS oupu, i.e., a se of TigerXML files, was submied o ExATOlp erm exracor [27]. PALAVRAS and ExATOlp join applicaion delivers high qualiy erm liss, since he exraced erms are noun phrases found in he corpus and heir frequencies. The exraced noun phrases were filered according o ExATOlp heurisic rules aiming he oupu of noun phrases as meaningful as possible. These heurisics goes from simple exclusion of aricles, bu also quie ingenious ones like deecion of implici noun phrases 2 [28]. C. Exraced erms and reference liss The exraced erms were divided in wo liss, bigrams and rigrams. Single erms and hose wih more han hree words were no considered in he evaluaion, since hey were no included in he hand-made reference lis consruced by erminology laboraory TEXTECC (hp://www6.ufrgs.br/execc/). The reference liss were produced by a careful and laborious process ha involved erminologiss, domain specialiss (Pediaricians) and academic sudens. These liss are available for download a TEXTECC websie and hey have been used for pracical applicaions including glossary consrucion, ranslaion aid, and even onology consrucion. These reference liss are composed by 1,534 bigrams and 2,660 rigrams and hey can also be consuled a hp://onolp.inf.pucrs.br/onolp/ downloads-onolplisa.php. The full exraced erm liss delivered by PALAVRAS and ExATOlp for he Pediarics corpus were composed by 15,483 2 Implici noun phrases are, for example, sick children and healhy children ha can be exraced from he senence Sick and healhy children can be reaed.. disinc bigrams and 18,171 disinc rigrams. To each of hese liss he compued indices were: f he absolue erm frequency (Eq. 1); f-idf he erm frequency, inverse documen frequency (Eq. 3) wih he basic formulaion from Bell e al. [21] aggregaed wih he sum proposed by Manning and Schüze [8] o be used as an example of index no using conrasing corpora; ds he erm domain specificiy (Eq. 4) proposed by Park e al. [11]; hd he ermhood (Eq. 5) proposed by Ki and Liu [12]; TF-IDF he erm frequency, inverse domain frequency (Eq. 6) proposed by Kim e al. [13]; and f-dcf he erm frequency, disjoin corpora frequency (Eq. 7) proposed in he previous secion of his paper. D. The impac of differen measures on frequen erms Observing in deail some erms in he exraced liss i is possible o have a beer undersanding of he effec of each index, and, herefore, he benefis brough by f-dcf as relevance index. Tab. II presens he op en frequen erms, i.e., he en erms wih more absolue occurrences in he Pediarics corpus. In his able i is shown he number of occurrences of he erm in each corpora, i.e., Pediarics (PED), Sochasic modeling (SM), Daa mining (DM), Parallel processing (PP) and Geology (GEO). Addiionally, he las column (ref. lis) indicaes weher he erm belongs ( IN ) or no ( OUT ) o he reference lis. Table II OCCURRENCES FOR FREQUENT TERMS FROM PEDIATRICS CORPUS. erm in Poruguese (ranslaion) PED SM DM PP GEO ref. lis aleiameno maerno (breas feeding) IN recém nascido (new born) IN faixa eária (age slo) IN presene esudo (curren sudy) OUT leie maerno (moher s milk) IN idade gesacional (gesacional age) IN venilação mecânica (mechanical venilaion) IN via aérea (airway) IN pressão arerial (blood pressure) IN sexo masculino (male sex) OUT The same en more frequen erms are also shown in Tab. III wih he values for he six presened indices, as well as heir rank according o each of hem. For example, in he hird row of Tab. III, he erm faixa eária ( age slo in English) belongs o he reference lis and i is ranked as he hird erm in he liss sored wih he erm frequency (f - Eq. 1) and wih he erm frequency, inverse documen frequency (f-idf - Eq. 3). In he liss sored wih he oher indices his erm is ranked as he 13,281 h (for ds - Eq. 4), he fourh (for hd - Eq. 5), he sixh (for TF-IDF - Eq. 6), and he fifeenh (for f-dcf - Eq. 7). Observing he rank differences beween he liss sored wih he erm frequency (f - Eq. 1) and he erm frequency, inverse documen frequency (f-idf - Eq. 3), we noiced an imporan

6 Table III ANALYSIS OF FREQUENT TERMS FROM PEDIATRICS CORPUS. erm in Poruguese f f-idf ds hd TF-IDF f-dcf (ranslaion) Eq. 1 Eq. 3 Eq. 4 Eq. 5 Eq. 6 Eq. 7 aleiameno maerno (breas feeding) 1 s 1 s 1 s 1 s 1 s 1 h recém nascido (new born) 2 nd 2 nd 1 s 2 nd 2 nd 2 nd faixa eária (age slo) 3 rd 3 rd 13,281 s 4 h 6 h 15 h presene esudo (curren sudy) 4 h 4 h 13,429 h 42 nd 57 h 1,276 h leie maerno (moher s milk) 5 h 5 h 1 s 3 rd 3 rd 3 rd idade gesacional (gesacional age) 6 h 7 h 1 s 5 h 4 h 4 h venilação mecânica (mechanical venilaion) 7 h 6 h 1 s 6 h 5 h 5 h via aérea (airway) 8 h 8 h 1 s 7 h 7 h 6 h pressão arerial (blood pressure) 9 h 19 h 1 s 8 h 8 h 7 h sexo masculino (male sex) 10 h 9 h 13,318 h 14 h 35 h 543 h similariy. The only significanly change occurs for he erm pressão arerial ( blood pressure ) ha drops from he 9 h o he 19 h posiion. However, his change does no correspond o a meaningful downgrade, since his erm ( blood pressure ) seems o be as relevan o Pediarics as, for insance, via aérea ( airway ). In conras, he quie generic erm presene esudo ( curren sudy ) is no affeced a all by f-idf. Observing he effec brough by he erm domain specificiy index (ds - Eq. 4), we realize he lack of precision, since i assigns an equally imporan rank o all erms ha are no exclusive o he Pediarics corpus. Consequenly, he erms ha appears in oher corpora are cas ou of any lis of relevan erms, since, giving he conrasing corpora (SM, DM, PP and GEO), here is more han 13,000 erms appearing only in he Pediarics corpus. The erms faixa eária ( age slo ), presene esudo ( curren sudy ) and sexo masculino ( male sex ) are all ranked beyond he 13,000 h posiion. The lis sored wih he ermhood index (hd - Eq. 5) shows he downgrade effec on he hree erms appearing in he conrasing corpora (grey rows in Tabs. II and III). However, hese erms are no sen very low, since even he erm presene esudo ( curren sudy ), which is very frequen in he conrasing corpora (72 occurrences), is downgraded only o he 42 h posiion. The lis sored according o erm frequency, inverse domain frequency index (TF-IDF - Eq. 6) shows a sronger effec han he ermhood (hd - Eq. 5), since i is based on he number of conrasing corpora he erm appear. In consequence, he erm faixa eária ( age slo ) drops o he sixh posiion because i appears also in he Daa Mining corpus, while he erm presene esudo ( curren sudy ) drops o he 57 h posiion because i appears in all corpora, bu Geology. I is imporan o call he reader aenion ha our proposed index (f-dcf - Eq. 7) is he only one ha akes ino accoun boh he number of occurrences in he conrasing corpora (as ermhood and erm domain specificiy), and he number of corpora in which he erm appears (as erm frequency, inverse corpus frequency). For ha reason, he downgrade effec in he lis sored according o our index is he sronger one. Our index cass ou he erm presene esudo ( curren sudy ) o he 1,276 h posiion, while i downgrades significanly he erm sexo masculino ( male sex ) o he 543 h posiion. In opposiion, he erm faixa eária ( age slo ) is mildly downgraded from he hird o he fifeenh posiion. V. CONCLUSION This paper presened a novel numerical index o esimae he relevance of exraced erms wih respec o a specific domain. The inclusion of disjoin corpora frequency (dcf ) componen successfully improved he precision of exraced liss in comparison wih he radiional f and f-idf, bu also oher indices based on comparison wih conrasing corpora, namely erm domain specificiy [11], ermhood [12] and erm frequency, inverse domain frequency [13]. The proposed dcf approach was described here in composiion wih he absolue frequency (f ) and i has he advanage o keep an analogue semanic of he original absolue frequency index. If a given erm does no appear in oher corpora, is fdcf index will be equal o he erm frequency, i.e., only erms appearing in oher corpora will be numerically downgraded. This is no he case of any of he oher pre-exisen measures. Our proposal is he follow up o iniial sudies based on he comparison wih conrasing corpora. Such inuiive idea was iniially proposed during he las 10 years [10], [22], [29], [11], [12], [13], [15], bu, a he auhors bes knowledge, our proposal is he firs one o pay aenion o an correc weighing of he influence of occurrences of erms in conrasing corpora. Specifically, our f-dcf index formulaion consider he produc of he log of he number of occurrences in oher corpora as reducive facor for he domain corpus absolue erm frequency. This choice is jusified by he fac ha erm occurrences are likely o be disribued by a Zipf law [23]. In Park e al. [11] his fac was ignored. In Ki and Liu [12] his fac was approached by he rank difference. In Kim e al. [13] his fac was approached by erm relaive frequency and he logarihm in he IDF par. Therefore, our formulaion seems o be mahemaically more robus. The main limiaion of he curren sudy is he lack of horough experimens wih oher corpora. We had choose o limi our experimens o he sudied corpora because here were no sign of availabiliy of daa ses previously employed by oher auhors. Neverheless, since he objecive of his paper is o propose he f-dcf index, i remains as a naural fuure work he experimenaion of our proposal o a saisically significan se of corpora. Such fuure work will demand he analysis of he proposed f-dcf index, in comparison wih oher indices, in erms of numerical measures, as precision, and he gahering of corpora and corresponding liss of references. Anoher valid fuure work is he sudy of heurisics o choose a good cu-off poin o apply in he exraced erm liss. Wih he use of a simple index of relevance, like he absolue erm frequency, he cu-off poin choice seems simple, since i is enough o define a minimum number of erm occurrences. However, wih a more sophisicaed one, as he f-dcf index proposed here, i is a lile less obvious o define a meaningful and effecive cu-off poin [30].

7 REFERENCES [1] P. Cimiano, Onology learning and populaion from ex: algorihms, evaluaion and applicaions. Springer, [2] D. Bourigaul and G. Lame, Analyse disribuionnelle e srucuraion de erminologie. applicaion a la consrucion d une onologie documenaire du droi, Traiemen auomaique des langues, vol. 43, no. 1, [3] C. Chemuduguna, A. Holloway, P. Smyh, and M. Seyvers, Modeling documens by combining semanic conceps wih unsupervised saisical learning, in The Semanic Web - ISWC 2008, ser. Lecure Noes in Compuer Science, A. Sheh, S. Saab, M. Dean, M. Paolucci, D. Maynard, T. Finin, and K. Thirunarayan, Eds. Springer Berlin / Heidelberg, 2008, vol. 5318, pp [4] I. Tiov and M. Kozhevnikov, Boosrapping semanic analyzers from non-conradicory exs, in Proceedings of he 48h Annual Meeing of he Associaion for Compuaional Linguisics, ser. ACL 10. Morrisown, NJ, USA: Associaion for Compuaional Linguisics, 2010, pp [5] W. Bosma and P. Vossen, Boosrapping language neural erm exracion, in Proceedings of he Sevenh conference on Inernaional Language Resources and Evaluaion (LREC 10), N. C. C. Chair), K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, and D. Tapias, Eds. Vallea, Mala: European Language Resources Associaion (ELRA), may [6] C. Fellbaum, Wordne, in Theory and Applicaions of Onology: Compuer Applicaions, R. Poli, M. Healy, and A. Kameas, Eds. Springer Neherlands, 2010, pp [7] T. Dunning, Accurae mehods for he saisics of surprise and coincidence, Compuaional Linguisics, vol. 19, pp , March [Online]. Available: hp://dl.acm.org/ciaion.cfm?id= [8] C. D. Manning and H. Schüze, Foundaions of saisical naural language processing. MIT Press, [9] K. Spärck-Jones, A saisical inerpreaion of erm specificiy and is applicaion in rerieval, Journal of Documenaion, vol. 28, no. 1, pp , [Online]. Available: hp:// journals.hm?aricleid= &show=absrac [10] T. Chung, A corpus comparison approach for erminology exracion, Terminology, vol. 9, pp , [Online]. Available: hp:// /ar00004 [11] Y. Park, S. Pawardhan, K. Visweswariah, and S. C. Gaes, An empirical analysis of word error rae and keyword error rae, in INTERSPEECH, 2008, pp [12] C. Ki and X. Liu, Measuring mono-word ermhood by rank difference via corpus comparison, Terminology, vol. 14, no. 2, pp , [13] S. N. Kim, T. Baldwin, and M.-Y. Kan, Exracing domainspecific words - a saisical approach, in Proceedings of he 2009 Ausralasian Language Technology Associaion Workshop, L. Pizzao and R. Schwier, Eds. Sydney, Ausralia: Ausralasian Language Technology Associaion, December 2009, pp [Online]. Available: ALTA2009_12.pdf [14] L. Teixeira, G. Lopes, and R. Ribeiro, Auomaic exracion of documen opics, in Technological Innovaion for Susainabiliy, ser. IFIP Advances in Informaion and Communicaion Technology, L. Camarinha-Maos, Ed. Springer Boson, 2011, vol. 349, pp [Online]. Available: hp://dx.doi.org/ / _11 [15] G. Rose, M. Holland, S. Larocca, and R. Winkler, Semi-auomaed mehods for refining a domain-specific erminology base, U. S. Army Research Laboraory, Adelphi, MD, USA, Tech. Rep. ARL-RP-0311, [16] S. Roberson and K. Spärck-Jones, Relevance weighing of search erms, Journal of American Sociey for Informaion Science, vol. 27, no. 3, pp , [17] W. B. Crof and D. J. Harper, Using probabilisic models of documen rerieval wihou relevance informaion, Journal of documenaion, vol. 35, no. 4, pp , [18] S. E. Roberson and S. Walker, On relevance weighs wih lile relevance informaion, SIGIR Forum, vol. 31, pp , July [Online]. Available: hp://doi.acm.org/ / [19] A. Lavelli, F. Sebasiani, and R. Zanoli, Disribuional erm represenaions: an experimenal comparison, in CIKM, 2004, pp [20] A. Maedche and S. Saab, Learning onologies for he semanic web, in SemWeb, [21] T. Bell, I. Wien, and A. Moffa, Managing Gigabyes: Compressing and Indexing Documens and Images. San Francisco: Morgan Kaufmann, [Online]. Available: hp://onology.csse.uwa.edu.au/ reference/browse_paper.php?pid= [22] P. Drouin, Deecion of domain specific erminology using corpora comparison, in Proceedings of he 4h Inernaional Conference on Language Resources and Evaluaion (LREC) 2004, M. T. Lino, M. F. Xavier, F. Ferreira, R. Cosa, and R. Silva, Eds., ELRA. Lisbon, Porugal: European Language Resources Associaion, May 2004, pp [23] G. K. Zipf, The Psycho-Biology of Language - An Inroducion o Dynamic Philology. Boson, USA: Houghon-Mifflin Company, [24] R. J. Coulhard, The applicaion of Corpus Mehodology o Translaion: he JPED parallel corpus and he Pediarics comparable corpus, Ph.D. disseraion, UFSC, [25] L. Lopes and R. Vieira, Building Domain Specific Corpora in Poruguese Language, Ponifícia Universidade Caólica do Rio Grande do Sul (PUCRS), Poro Alegre, Brasil, Tech. Rep. TR 062, Dezembro [26] E. Bick, The parsing sysem PALAVRAS: auomaic grammaical analysis of poruguese in consrain grammar framework, Ph.D. disseraion, Arhus Universiy, [27] L. Lopes, P. Fernandes, R. Vieira, and G. Fedrizzi, ExATO lp An Auomaic Tool for Term Exracion from Poruguese Language Corpora, in Proceedings of he 4h Language & Technology Conference: Human Language Technologies as a Challenge for Compuer Science and Linguisics (LTC 09). Faculy of Mahemaics and Compuer Science of Adam Mickiewicz Universiy, November 2009, pp [28] L. Lopes and R. Vieira, Heurisics o improve onology erm exracion, in PROPOR 2012 Inernaional Conference on Compuaional Processing of Poruguese Language, 2012, submied. [29] J. Wermer and U. Hahn, You can bea frequency (unless you use linguisic knowledge): a qualiaive evaluaion of associaion measures for collocaion and erm exracion, in Proceedings of he 21s Inernaional Conference on Compuaional Linguisics and he 44h annual meeing of he Associaion for Compuaional Linguisics, ser. ACL- 44. Sroudsburg, PA, USA: Associaion for Compuaional Linguisics, 2006, pp [30] L. Lopes, R. Vieira, M. Finao, and D. Marins, Exracing compound erms from domain corpora, Journal of he Brazilian Compuer Sociey, vol. 16, pp , 2010, /s [Online]. Available: hp://dx.doi.org/ /s

Neural Network Model of the Backpropagation Algorithm

Neural Network Model of the Backpropagation Algorithm Neural Nework Model of he Backpropagaion Algorihm Rudolf Jakša Deparmen of Cyberneics and Arificial Inelligence Technical Universiy of Košice Lená 9, 4 Košice Slovakia jaksa@neuron.uke.sk Miroslav Karák

More information

Fast Multi-task Learning for Query Spelling Correction

Fast Multi-task Learning for Query Spelling Correction Fas Muli-ask Learning for Query Spelling Correcion Xu Sun Dep. of Saisical Science Cornell Universiy Ihaca, NY 14853 xusun@cornell.edu Anshumali Shrivasava Dep. of Compuer Science Cornell Universiy Ihaca,

More information

More Accurate Question Answering on Freebase

More Accurate Question Answering on Freebase More Accurae Quesion Answering on Freebase Hannah Bas, Elmar Haussmann Deparmen of Compuer Science Universiy of Freiburg 79110 Freiburg, Germany {bas, haussmann}@informaik.uni-freiburg.de ABSTRACT Real-world

More information

Information Propagation for informing Special Population Subgroups about New Ground Transportation Services at Airports

Information Propagation for informing Special Population Subgroups about New Ground Transportation Services at Airports Downloaded from ascelibrary.org by Basil Sephanis on 07/13/16. Copyrigh ASCE. For personal use only; all righs reserved. Informaion Propagaion for informing Special Populaion Subgroups abou New Ground

More information

MyLab & Mastering Business

MyLab & Mastering Business MyLab & Masering Business Efficacy Repor 2013 MyLab & Masering: Business Efficacy Repor 2013 Edied by Michelle D. Speckler 2013 Pearson MyAccouningLab, MyEconLab, MyFinanceLab, MyMarkeingLab, and MyOMLab

More information

An Effiecient Approach for Resource Auto-Scaling in Cloud Environments

An Effiecient Approach for Resource Auto-Scaling in Cloud Environments Inernaional Journal of Elecrical and Compuer Engineering (IJECE) Vol. 6, No. 5, Ocober 2016, pp. 2415~2424 ISSN: 2088-8708, DOI: 10.11591/ijece.v6i5.10639 2415 An Effiecien Approach for Resource Auo-Scaling

More information

Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices

Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices Z. Zhang e al.: Channel Mapping using Bidirecional Long Shor-Term Memory for Dereverberaion in Hands-Free Voice Conrolled Devices 525 Channel Mapping using Bidirecional Long Shor-Term Memory for Dereverberaion

More information

1 Language universals

1 Language universals AS LX 500 Topics: Language Uniersals Fall 2010, Sepember 21 4a. Anisymmery 1 Language uniersals Subjec-erb agreemen and order Bach (1971) discusses wh-quesions across SO and SO languages, hypohesizing:...

More information

Cross Language Information Retrieval

Cross Language Information Retrieval Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................

More information

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

The Internet as a Normative Corpus: Grammar Checking with a Search Engine The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a

More information

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com

More information

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,

More information

Universiteit Leiden ICT in Business

Universiteit Leiden ICT in Business Universiteit Leiden ICT in Business Ranking of Multi-Word Terms Name: Ricardo R.M. Blikman Student-no: s1184164 Internal report number: 2012-11 Date: 07/03/2013 1st supervisor: Prof. Dr. J.N. Kok 2nd supervisor:

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities Soto Montalvo GAVAB Group URJC Raquel Martínez NLP&IR Group UNED Arantza Casillas Dpt. EE UPV-EHU Víctor Fresno GAVAB

More information

Constructing Parallel Corpus from Movie Subtitles

Constructing Parallel Corpus from Movie Subtitles Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing

More information

On document relevance and lexical cohesion between query terms

On document relevance and lexical cohesion between query terms Information Processing and Management 42 (2006) 1230 1247 www.elsevier.com/locate/infoproman On document relevance and lexical cohesion between query terms Olga Vechtomova a, *, Murat Karamuftuoglu b,

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

Rule Learning with Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX

More information

Advanced Grammar in Use

Advanced Grammar in Use Advanced Grammar in Use A self-study reference and practice book for advanced learners of English Third Edition with answers and CD-ROM cambridge university press cambridge, new york, melbourne, madrid,

More information

Writing a composition

Writing a composition A good composition has three elements: Writing a composition an introduction: A topic sentence which contains the main idea of the paragraph. a body : Supporting sentences that develop the main idea. a

More information

Agent-Based Software Engineering

Agent-Based Software Engineering Agent-Based Software Engineering Learning Guide Information for Students 1. Description Grade Module Máster Universitario en Ingeniería de Software - European Master on Software Engineering Advanced Software

More information

Variations of the Similarity Function of TextRank for Automated Summarization

Variations of the Similarity Function of TextRank for Automated Summarization Variations of the Similarity Function of TextRank for Automated Summarization Federico Barrios 1, Federico López 1, Luis Argerich 1, Rosita Wachenchauzer 12 1 Facultad de Ingeniería, Universidad de Buenos

More information

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability Shih-Bin Chen Dept. of Information and Computer Engineering, Chung-Yuan Christian University Chung-Li, Taiwan

More information

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge Innov High Educ (2009) 34:93 103 DOI 10.1007/s10755-009-9095-2 Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge Phyllis Blumberg Published online: 3 February

More information

A Comparison of Two Text Representations for Sentiment Analysis

A Comparison of Two Text Representations for Sentiment Analysis 010 International Conference on Computer Application and System Modeling (ICCASM 010) A Comparison of Two Text Representations for Sentiment Analysis Jianxiong Wang School of Computer Science & Educational

More information

Term Weighting based on Document Revision History

Term Weighting based on Document Revision History Term Weighting based on Document Revision History Sérgio Nunes, Cristina Ribeiro, and Gabriel David INESC Porto, DEI, Faculdade de Engenharia, Universidade do Porto. Rua Dr. Roberto Frias, s/n. 4200-465

More information

Cross-lingual Text Fragment Alignment using Divergence from Randomness

Cross-lingual Text Fragment Alignment using Divergence from Randomness Cross-lingual Text Fragment Alignment using Divergence from Randomness Sirvan Yahyaei, Marco Bonzanini, and Thomas Roelleke Queen Mary, University of London Mile End Road, E1 4NS London, UK {sirvan,marcob,thor}@eecs.qmul.ac.uk

More information

Formulaic Language and Fluency: ESL Teaching Applications

Formulaic Language and Fluency: ESL Teaching Applications Formulaic Language and Fluency: ESL Teaching Applications Formulaic Language Terminology Formulaic sequence One such item Formulaic language Non-count noun referring to these items Phraseology The study

More information

On-the-Fly Customization of Automated Essay Scoring

On-the-Fly Customization of Automated Essay Scoring Research Report On-the-Fly Customization of Automated Essay Scoring Yigal Attali Research & Development December 2007 RR-07-42 On-the-Fly Customization of Automated Essay Scoring Yigal Attali ETS, Princeton,

More information

Loughton School s curriculum evening. 28 th February 2017

Loughton School s curriculum evening. 28 th February 2017 Loughton School s curriculum evening 28 th February 2017 Aims of this session Share our approach to teaching writing, reading, SPaG and maths. Share resources, ideas and strategies to support children's

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s)) Ohio Academic Content Standards Grade Level Indicators (Grade 11) A. ACQUISITION OF VOCABULARY Students acquire vocabulary through exposure to language-rich situations, such as reading books and other

More information

Field Experience Management 2011 Training Guides

Field Experience Management 2011 Training Guides Field Experience Management 2011 Training Guides Page 1 of 40 Contents Introduction... 3 Helpful Resources Available on the LiveText Conference Visitors Pass... 3 Overview... 5 Development Model for FEM...

More information

AQUA: An Ontology-Driven Question Answering System

AQUA: An Ontology-Driven Question Answering System AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.

More information

Using dialogue context to improve parsing performance in dialogue systems

Using dialogue context to improve parsing performance in dialogue systems Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,

More information

Data Fusion Models in WSNs: Comparison and Analysis

Data Fusion Models in WSNs: Comparison and Analysis Proceedings of 2014 Zone 1 Conference of the American Society for Engineering Education (ASEE Zone 1) Data Fusion s in WSNs: Comparison and Analysis Marwah M Almasri, and Khaled M Elleithy, Senior Member,

More information

2 nd grade Task 5 Half and Half

2 nd grade Task 5 Half and Half 2 nd grade Task 5 Half and Half Student Task Core Idea Number Properties Core Idea 4 Geometry and Measurement Draw and represent halves of geometric shapes. Describe how to know when a shape will show

More information

New Ways of Connecting Reading and Writing

New Ways of Connecting Reading and Writing Sanchez, P., & Salazar, M. (2012). Transnational computer use in urban Latino immigrant communities: Implications for schooling. Urban Education, 47(1), 90 116. doi:10.1177/0042085911427740 Smith, N. (1993).

More information

Different Requirements Gathering Techniques and Issues. Javaria Mushtaq

Different Requirements Gathering Techniques and Issues. Javaria Mushtaq 835 Different Requirements Gathering Techniques and Issues Javaria Mushtaq Abstract- Project management is now becoming a very important part of our software industries. To handle projects with success

More information

Interpretive (seeing) Interpersonal (speaking and short phrases)

Interpretive (seeing) Interpersonal (speaking and short phrases) Subject Spanish Grammar Lesson Length 50 minutes Linguistic Level Beginning Spanish 1 Topic Descriptive personal characteristics using the verb ser Students will be able to identify the appropriate situations

More information

teacher, peer, or school) on each page, and a package of stickers on which

teacher, peer, or school) on each page, and a package of stickers on which ED 026 133 DOCUMENT RESUME PS 001 510 By-Koslin, Sandra Cohen; And Others A Distance Measure of Racial Attitudes in Primary Grade Children: An Exploratory Study. Educational Testing Service, Princeton,

More information

A High-Quality Web Corpus of Czech

A High-Quality Web Corpus of Czech A High-Quality Web Corpus of Czech Johanka Spoustová, Miroslav Spousta Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University Prague, Czech Republic {johanka,spousta}@ufal.mff.cuni.cz

More information

Identification of Opinion Leaders Using Text Mining Technique in Virtual Community

Identification of Opinion Leaders Using Text Mining Technique in Virtual Community Identification of Opinion Leaders Using Text Mining Technique in Virtual Community Chihli Hung Department of Information Management Chung Yuan Christian University Taiwan 32023, R.O.C. chihli@cycu.edu.tw

More information

Student Course Evaluation Class Size, Class Level, Discipline and Gender Bias

Student Course Evaluation Class Size, Class Level, Discipline and Gender Bias Student Course Evaluation Class Size, Class Level, Discipline and Gender Bias Jacob Kogan Department of Mathematics and Statistics,, Baltimore, MD 21250, U.S.A. kogan@umbc.edu Keywords: Abstract: World

More information

In Workflow. Viewing: Last edit: 10/27/15 1:51 pm. Approval Path. Date Submi ed: 10/09/15 2:47 pm. 6. Coordinator Curriculum Management

In Workflow. Viewing: Last edit: 10/27/15 1:51 pm. Approval Path. Date Submi ed: 10/09/15 2:47 pm. 6. Coordinator Curriculum Management 1 of 5 11/19/2015 8:10 AM Date Submi ed: 10/09/15 2:47 pm Viewing: Last edit: 10/27/15 1:51 pm Changes proposed by: GODWINH In Workflow 1. BUSI Editor 2. BUSI Chair 3. BU Associate Dean 4. Biggio Center

More information

The following information has been adapted from A guide to using AntConc.

The following information has been adapted from A guide to using AntConc. 1 7. Practical application of genre analysis in the classroom In this part of the workshop, we are going to analyse some of the texts from the discipline that you teach. Before we begin, we need to get

More information

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis International Journal of Arts Humanities and Social Sciences (IJAHSS) Volume 1 Issue 1 ǁ August 216. www.ijahss.com Linguistic Variation across Sports Category of Press Reportage from British Newspapers:

More information

10.2. Behavior models

10.2. Behavior models User behavior research 10.2. Behavior models Overview Why do users seek information? How do they seek information? How do they search for information? How do they use libraries? These questions are addressed

More information

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1) Houghton Mifflin Reading Correlation to the Standards for English Language Arts (Grade1) 8.3 JOHNNY APPLESEED Biography TARGET SKILLS: 8.3 Johnny Appleseed Phonemic Awareness Phonics Comprehension Vocabulary

More information

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,

More information

Distant Supervised Relation Extraction with Wikipedia and Freebase

Distant Supervised Relation Extraction with Wikipedia and Freebase Distant Supervised Relation Extraction with Wikipedia and Freebase Marcel Ackermann TU Darmstadt ackermann@tk.informatik.tu-darmstadt.de Abstract In this paper we discuss a new approach to extract relational

More information

José Carlos Pinto -

José Carlos Pinto - BRISPE 2012 II Brazilian Meeting on Research Integrity, Science and Publication Ethics Porto Alegre RS, 01 de Junho, 2012 Science, Technology, Innovation, Collaborative Research and Research Integrity:

More information

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics (L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes

More information

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks 3rd Grade- 1st Nine Weeks R3.8 understand, make inferences and draw conclusions about the structure and elements of fiction and provide evidence from text to support their understand R3.8A sequence and

More information

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Good Judgment Project: A large scale test of different methods of combining expert predictions The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania

More information

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming Data Mining VI 205 Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming C. Romero, S. Ventura, C. Hervás & P. González Universidad de Córdoba, Campus Universitario de

More information

Corpus Linguistics (L615)

Corpus Linguistics (L615) (L615) Basics of Markus Dickinson Department of, Indiana University Spring 2013 1 / 23 : the extent to which a sample includes the full range of variability in a population distinguishes corpora from archives

More information

THE UNIVERSITY OF SYDNEY Semester 2, Information Sheet for MATH2068/2988 Number Theory and Cryptography

THE UNIVERSITY OF SYDNEY Semester 2, Information Sheet for MATH2068/2988 Number Theory and Cryptography THE UNIVERSITY OF SYDNEY Semester 2, 2017 Information Sheet for MATH2068/2988 Number Theory and Cryptography Websites: It is important that you check the following webpages regularly. Intermediate Mathematics

More information

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures Ulrike Baldewein (ulrike@coli.uni-sb.de) Computational Psycholinguistics, Saarland University D-66041 Saarbrücken,

More information

Grade 11 Language Arts (2 Semester Course) CURRICULUM. Course Description ENGLISH 11 (2 Semester Course) Duration: 2 Semesters Prerequisite: None

Grade 11 Language Arts (2 Semester Course) CURRICULUM. Course Description ENGLISH 11 (2 Semester Course) Duration: 2 Semesters Prerequisite: None Grade 11 Language Arts (2 Semester Course) CURRICULUM Course Description ENGLISH 11 (2 Semester Course) Duration: 2 Semesters Prerequisite: None Through the integrated study of literature, composition,

More information

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence. NLP Lab Session Week 8 October 15, 2014 Noun Phrase Chunking and WordNet in NLTK Getting Started In this lab session, we will work together through a series of small examples using the IDLE window and

More information

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract

More information

Women in Orthopaedic Fellowships: What Is Their Match Rate, and What Specialties Do They Choose?

Women in Orthopaedic Fellowships: What Is Their Match Rate, and What Specialties Do They Choose? Clin Orthop Relat Res (2016) 474:1957 1961 DOI 10.1007/s11999-016-4829-9 Clinical Orthopaedics and Related Research A Publication of The Association of Bone and Joint Surgeons SYMPOSIUM: WOMEN AND UNDERREPRESENTED

More information

Rubric for Scoring English 1 Unit 1, Rhetorical Analysis

Rubric for Scoring English 1 Unit 1, Rhetorical Analysis FYE Program at Marquette University Rubric for Scoring English 1 Unit 1, Rhetorical Analysis Writing Conventions INTEGRATING SOURCE MATERIAL 3 Proficient Outcome Effectively expresses purpose in the introduction

More information

Arizona s English Language Arts Standards th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS

Arizona s English Language Arts Standards th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS Arizona s English Language Arts Standards 11-12th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS 11 th -12 th Grade Overview Arizona s English Language Arts Standards work together

More information

Problems of the Arabic OCR: New Attitudes

Problems of the Arabic OCR: New Attitudes Problems of the Arabic OCR: New Attitudes Prof. O.Redkin, Dr. O.Bernikova Department of Asian and African Studies, St. Petersburg State University, St Petersburg, Russia Abstract - This paper reviews existing

More information

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand 1 Introduction Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand heidi.quinn@canterbury.ac.nz NWAV 33, Ann Arbor 1 October 24 This paper looks at

More information

NCEO Technical Report 27

NCEO Technical Report 27 Home About Publications Special Topics Presentations State Policies Accommodations Bibliography Teleconferences Tools Related Sites Interpreting Trends in the Performance of Special Education Students

More information

An Interactive Intelligent Language Tutor Over The Internet

An Interactive Intelligent Language Tutor Over The Internet An Interactive Intelligent Language Tutor Over The Internet Trude Heift Linguistics Department and Language Learning Centre Simon Fraser University, B.C. Canada V5A1S6 E-mail: heift@sfu.ca Abstract: This

More information

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI

More information

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR ROLAND HAUSSER Institut für Deutsche Philologie Ludwig-Maximilians Universität München München, West Germany 1. CHOICE OF A PRIMITIVE OPERATION The

More information

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &

More information

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing Available online at www.sciencedirect.com ScienceDirect Procedia - Social and Behavioral Sciences 141 ( 2014 ) 124 128 WCLTA 2013 Using Corpus Linguistics in the Development of Writing Blanka Frydrychova

More information

Memory-based grammatical error correction

Memory-based grammatical error correction Memory-based grammatical error correction Antal van den Bosch Peter Berck Radboud University Nijmegen Tilburg University P.O. Box 9103 P.O. Box 90153 NL-6500 HD Nijmegen, The Netherlands NL-5000 LE Tilburg,

More information

Common Core State Standards for English Language Arts

Common Core State Standards for English Language Arts Reading Standards for Literature 6-12 Grade 9-10 Students: 1. Cite strong and thorough textual evidence to support analysis of what the text says explicitly as well as inferences drawn from the text. 2.

More information

Detecting English-French Cognates Using Orthographic Edit Distance

Detecting English-French Cognates Using Orthographic Edit Distance Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National

More information

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Jung-Tae Lee and Sang-Bum Kim and Young-In Song and Hae-Chang Rim Dept. of Computer &

More information

Integration of ICT in Teaching and Learning

Integration of ICT in Teaching and Learning Integration of ICT in Teaching and Learning Dr. Pooja Malhotra Assistant Professor, Dept of Commerce, Dyal Singh College, Karnal, India Email: pkwatra@gmail.com. INTRODUCTION 2 st century is an era of

More information

A Case-Based Approach To Imitation Learning in Robotic Agents

A Case-Based Approach To Imitation Learning in Robotic Agents A Case-Based Approach To Imitation Learning in Robotic Agents Tesca Fitzgerald, Ashok Goel School of Interactive Computing Georgia Institute of Technology, Atlanta, GA 30332, USA {tesca.fitzgerald,goel}@cc.gatech.edu

More information

Pre-vocational training. Unit 2. Being a fitness instructor

Pre-vocational training. Unit 2. Being a fitness instructor Pre-vocational training Unit 2 Being a fitness instructor 1 Contents Unit 2 Working as a fitness instructor: teachers notes Unit 2 Working as a fitness instructor: answers Unit 2 Working as a fitness instructor:

More information

Mandarin Lexical Tone Recognition: The Gating Paradigm

Mandarin Lexical Tone Recognition: The Gating Paradigm Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition

More information

PowerTeacher Gradebook User Guide PowerSchool Student Information System

PowerTeacher Gradebook User Guide PowerSchool Student Information System PowerSchool Student Information System Document Properties Copyright Owner Copyright 2007 Pearson Education, Inc. or its affiliates. All rights reserved. This document is the property of Pearson Education,

More information

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011 CAAP Content Analysis Report Institution Code: 911 Institution Type: 4-Year Normative Group: 4-year Colleges Introduction This report provides information intended to help postsecondary institutions better

More information

LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume 11 : 12 December 2011 ISSN

LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume 11 : 12 December 2011 ISSN LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume ISSN 1930-2940 Managing Editor: M. S. Thirumalai, Ph.D. Editors: B. Mallikarjun, Ph.D. Sam Mohanlal, Ph.D. B. A. Sharada, Ph.D.

More information

Proof Theory for Syntacticians

Proof Theory for Syntacticians Department of Linguistics Ohio State University Syntax 2 (Linguistics 602.02) January 5, 2012 Logics for Linguistics Many different kinds of logic are directly applicable to formalizing theories in syntax

More information

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Project in the framework of the AIM-WEST project Annotation of MWEs for translation Project in the framework of the AIM-WEST project Annotation of MWEs for translation 1 Agnès Tutin LIDILEM/LIG Université Grenoble Alpes 30 october 2014 Outline 2 Why annotate MWEs in corpora? A first experiment

More information

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,

More information

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain Andreas Vlachos Computer Laboratory University of Cambridge Cambridge, CB3 0FD, UK av308@cl.cam.ac.uk Caroline Gasperin Computer

More information

Written by: YULI AMRIA (RRA1B210085) ABSTRACT. Key words: ability, possessive pronouns, and possessive adjectives INTRODUCTION

Written by: YULI AMRIA (RRA1B210085) ABSTRACT. Key words: ability, possessive pronouns, and possessive adjectives INTRODUCTION STUDYING GRAMMAR OF ENGLISH AS A FOREIGN LANGUAGE: STUDENTS ABILITY IN USING POSSESSIVE PRONOUNS AND POSSESSIVE ADJECTIVES IN ONE JUNIOR HIGH SCHOOL IN JAMBI CITY Written by: YULI AMRIA (RRA1B210085) ABSTRACT

More information

Welcome to the Purdue OWL. Where do I begin? General Strategies. Personalizing Proofreading

Welcome to the Purdue OWL. Where do I begin? General Strategies. Personalizing Proofreading Welcome to the Purdue OWL This page is brought to you by the OWL at Purdue (http://owl.english.purdue.edu/). When printing this page, you must include the entire legal notice at bottom. Where do I begin?

More information

Grade 2 Unit 2 Working Together

Grade 2 Unit 2 Working Together Grade 2 Unit 2 Working Together Content Area: Language Arts Course(s): Time Period: Generic Time Period Length: November 13-January 26 Status: Published Stage 1: Desired Results Students will be able to

More information

Best Practices in Internet Ministry Released November 7, 2008

Best Practices in Internet Ministry Released November 7, 2008 Best Practices in Internet Ministry Released November 7, 2008 David T. Bourgeois, Ph.D. Associate Professor of Information Systems Crowell School of Business Biola University Best Practices in Internet

More information

Efficient Online Summarization of Microblogging Streams

Efficient Online Summarization of Microblogging Streams Efficient Online Summarization of Microblogging Streams Andrei Olariu Faculty of Mathematics and Computer Science University of Bucharest andrei@olariu.org Abstract The large amounts of data generated

More information

J j W w. Write. Name. Max Takes the Train. Handwriting Letters Jj, Ww: Words with j, w 321

J j W w. Write. Name. Max Takes the Train. Handwriting Letters Jj, Ww: Words with j, w 321 Write J j W w Jen Will Directions Have children write a row of each letter and then write the words. Home Activity Ask your child to write each letter and tell you how to make the letter. Handwriting Letters

More information

ENGLISH. Progression Chart YEAR 8

ENGLISH. Progression Chart YEAR 8 YEAR 8 Progression Chart ENGLISH Autumn Term 1 Reading Modern Novel Explore how the writer creates characterisation. Some specific, information recalled e.g. names of character. Limited engagement with

More information

Association Between Categorical Variables

Association Between Categorical Variables Student Outcomes Students use row relative frequencies or column relative frequencies to informally determine whether there is an association between two categorical variables. Lesson Notes In this lesson,

More information