Corpus-Based Terminology Extraction
|
|
- Luke Terry
- 6 years ago
- Views:
Transcription
1 CHAPTER NUMBER Corpus-Based Termiology Extractio ALEXANDRE PATRY AND PHILIPPE LANGLAIS Termiology maagemet is a key compoet of may atural laguage processig activities such as machie traslatio (Laglais ad Carl, 2004), text summarizatio ad text idexatio. With the rapid developmet of sciece ad techology cotiuously icreasig the umber of techical terms, termiology maagemet is certai to become of the utmost importace i more ad more cotet-based applicatios. While the automatic idetificatio of terms from texts has bee the focus of past studies (Jacquemi, 2001) (Castellví et al, 2001), the curret tred i Termiology Maagemet (TM) has shifted to the issue of term etworkig (Kageura et al, 2004). A possible explaatio of this shiftig may lie i the fact that Termiology Extractio (TE), although beig a oisy activity, ecompasses well established techiques that seem difficult to improve sigificatly upo. Despite this shift, we do believe that better extractio of terms could carry over subsequet steps of TM. A traditioal TE system usually ivolves a subtle mixture of liguistic rules ad statistical metrics i order to idetify a list of cadidate terms where it is hoped that terms are raked first. We distiguish our approach to TE from traditioal oes i two differet ways. First, we give back to the user a active role i the extractio process. That is, istead of ecodig a static defiitio of what might or might ot be a term, we let the user specify his ow. We do so by askig him to set up a traiig corpus (a corpus where the terms have bee idetified by a huma) from which our extractor will lear how to defie a term. Secod, our approach is completely automatic ad is readily adapted to the tools (part-of-speech tagger, 1
2 Alexadre Patry ad Philippe Laglais lemmatizer) ad metrics of the user. Oe might object that requirig a traiig corpus is askig the user to do a part of the job the machie is supposed to do, but we see it i a differet way. We cosider that a little help from the user could pay back i flexibility. The structure of our paper outlies the three steps ivolved i our approach. I the followig sectio, we describe our algorithm to idetify cadidate terms. I the third sectio, we itroduce the differet metrics we compute to score them. The fourth sectio explais how we applied AdaBoost (Freud ad Schapire, 1999), a machie learig algorithm, to rak ad idetify a list of terms. We the evaluate our approach o a corpus which was set up by the Office québécois de la lague fraçaise to evaluate commercially available term extractors. We show that our classifier outperforms the idividual metrics used i this study. Fially, we discuss some limitatios of our approach ad propose future works to be doe. Extractio of cadidate terms It is a commo practice to extract cadidate terms usig a part-of-speech (POS) tagger ad a automato (a program extractig word sequeces correspodig to predefied POS patters). Usually, those patters are maually hadcrafted ad target ou phrases, sice most of the terms of iterest are ou phrases (Justeso ad Katz, 1995). Typical examples of such patters ca be foud i (Jacquemi, 2001). As poited out i (Justeso ad Katz, 1995), relyig o a POS tagger ad legitimate patter recogitio is error proe, sice taggers are ot perfect. This might be especially true for very domai specific texts where a tagger is likely to be more erratic. To overcome this problem without givig up the use of POS patters (sice they are easy to desig ad to use), we propose a way to use a traiig corpus i order to automate the creatio of a automato. There are may potetial advatages with this approach. First, the POS tagger ad the taggig errors, to the extet that they are cosistet, will be automatically assimilated by the automato. Secod, this gives to the user the opportuity to specify the terms that are of iterest for him. If may terms ivolvig verbs are foud i the traiig corpus, the automato will reflect that iterest as well. We also observed i iformal experimets that wide spread patters ofte fails to extract may terms foud i our traiig corpus. Several approaches ca be applied whe geeratig a automato from sequeces of POS ecoutered i a traiig corpus. A 2
3 Corpus-Based Termiology Extractio straightforward approach is to memorize all the sequeces see i the traiig corpus. A sequece of words is thus a cadidate term oly if its sequece of POS tags has bee see before. This approach is simple but aive. It caot geerate ew patters that are slight variatios of the oes see at traiig time, ad a isolated taggig error ca lead to a bad patter. To avoid those problems, we propose to geerate the patters usig a laguage model traied o the POS tags of the terms foud i the traiig corpus. A laguage model is a fuctio computig the probability that a sequece of words has bee geerated by a certai laguage. I our case, the words are POS tags ad the laguage is the oe recogizig the sequeces of tags correspodig to terms. Our laguage model ca be described as follow: where P( w1 ) =! P( wi H i ) i= 1 w 1 is a sequece of POS tags ad summarizes the iformatio of the! 1 H i is called the history which i previous tags. To build a automato, we oly have to set a threshold ad geerate all the patters whose probability is higher tha it. A excerpt of such a automato is give i Figure 1. Probability Patter NomC AdjQ NomC Prep NomC NomC Dete-dart-ddef NomC NomC Verb-ParPas NomC Prep Dete-dart-ddef NomC Figure 1 Excerpt of a automatically geerated automato. Aother advatage of such a automato is that all patters are associated with a probability, givig more iformatio tha a biary value (legitimate or ot). Ideed, the POS patter probability is oe of the umerous metrics that we feed our classifier with. Scorig the cadidate terms I the previous sectio, we showed a way to geerate a automato that extracts a set of cadidate terms that we ow wat to rak ad/or filter. Followig may other works o term extractio, we score each 3
4 Alexadre Patry ad Philippe Laglais cadidate usig various metrics. May differet oes have bee idetified i (Daille, 1994) ad (Castellví et al, 2001). We do ot believe that a sigle metric is sufficiet, but istead thik that it is more fruitful to use several of them ad trai a classifier to lear how to take beefit of each of them. Because we thik they are iterestig for the task, we retaied the followig metrics: the frequecy, the legth, the log-likelyhood, the etropy, tf idf ad the POS patter probabilities discussed i the previous sectio. Recall however that our approach is ot restricted to these metrics, but istead ca beefit from ay other oe that ca be computed automatically. Aloe, the frequecy is ot a robust metric to assess the termiological property of a cadidate, but it does carry useful iformatio, as does also the legth of terms. I (Duig, 1993), Duig advocates the use of log-likelyhood to measure whether two evets that occur together do so as a coicidece or ot. I our case, we wat to measure the cohesio of a complex cadidate term (a cadidate term composed of two words or more) by verifyig if its words occur together as a coicidece or ot. The loglikelyhood ratio of two adjacet words ( u ad v) ca be computed with the followig formula (Daille, 1994):! loguv = a log a + b logb + c log c + d log d + N log N! ( a + c) log( a + c)! ( a + b) log( a + b)! ( c + d) log( c + d)! ( d + b) log( d + b) where a is the umber of times uv appears i the documet, b the umber of times u appears ot followed by v, c the umber of times v appears ot preceded by u, N the corpus size ad d the umber of cadidate terms that does ot ivolve u or v. Followig (Russell, 1998), to compute log-likelyhood o cadidate terms ivolvig more tha two words, we keep the miimum value amog the log-likelyhood of each possible split i the cadidate term. With the ituitio that terms are coheret uits that ca appear surrouded by various differet words, we use as well the etropy to rate a cadidate term. The etropy of a cadidate is computed by averagig its left ad right etropy: e e( w ) = 1 e left left ( w ) + e 1 2 ( s) =! right { u: us" C} ( w ) h 1 us ( ) s 4
5 Corpus-Based Termiology Extractio h( x) = x log x where w 1 is the cadidate term ad C is the corpus from which we are extractig the terms. Fially, to weight the saliece of a cadidate term, we also use tf idf. This metric is based o the idea that terms describig a documet should appear ofte i it but should ot appear i may other documets. It is computed by dividig the frequecy of a cadidate term by the umber of documets i a out-of-domai corpus that cotais it. Because tf idf is usually computed o oe word, whe we evaluated complex cadidate terms, we computed tf idf o each of its words ad kept five values: the first, the last, the miimum, the maximum ad the average. I our experimets, the out-of-domai corpus was composed of texts take from the Frech Caadia parliametary debates (the so-called Hasard), totalizig 1.4 millio seteces. Idetifyig terms amog cadidates Oce each cadidate terms is scored, we must decide which oes should fially be elected a term. To accomplish this task, we trai a biary classifier (a fuctio which qualifies a cadidate as a term or ot) o the face of the scores we computed for a cadidate. We use the AdaBoost learig algorithm (Freud ad Schapire, 1999) to build this classifier. AdaBoost is a simple but efficiet learig techique that combies may weak classifiers (a weak classifier must be right more tha half of the time) ito a stroger oe. To achieve this, it trais them successively, each time focusig o examples that have bee hard to classify correctly by the previous weak classifiers. I our experimets, the weak classifiers were biary stumps (biary classifiers that compare oe of the score to a give threshold to classify a cadidate term) ad we limited their umber to 50. A example of such a classifier is preseted i Figure 2. Experimets e right ( s) =! { u: su" C} su ( ) Our commuity lacks a commo bechmark o which we could compare our result with others. I this work, we applied our approach to a corpus called EAU. It is composed of six texts dealig with water supply. Its complex terms have bee listed by some members or the h s 5
6 Alexadre Patry ad Philippe Laglais Office québécois de la lague fraçaise for a project called ATTRAIT (Atelier de Travail Iformatisé du Termiologue) whose mai objective was to evaluate existig software solutios for the termiologist 1. Iput: A scored cadidate term c " = 0 if etropy(c) > 1.6 the " = " else " = " if legth(c) > 1.6 the " = " else " = " if " > 0 the retur term else retur ot-term Figure 2 A excert from a classifier geerated by the Adaboost learig algorithm. I our experimets, we kept the preprocessig stage as simple as possible. The corpus ad the list of terms were automatically tokeized, lemmatized ad had their POS tagged with a i-house package (Foster, 1991). Oce preprocessed, the EAU corpus is composed words ad 208 terms. Of these 208 terms, 186 appear without sytactic variatio (as they were listed) a total of 400 times. Sice the terms of our evaluatio corpus are already idetified, it is straightforward to compute the precisio ad the recall of our system. Precisio (resp. recall) is the ratio of terms correctly idetified by the system over the total umber of terms idetified as such (resp. over the total umber of terms maually idetified i the list). We evaluated our system usig five fold cross-validatio. This meas that the corpus was partitioed ito five subsets ad that five experimets were ru each time testig with a differet subset ad traiig the automato ad the classifier with the four others. Each traiig set (resp. testig set) was composed of about (resp. 3000) words cotaiig a average of about 150 (resp. 50) terms. Because oly complex terms are listed ad because we do ot cosider term variatios, our results oly cosider complex terms that appear without variatio. Also, after iformal experimets, we set the miimum probability of a patter to be accepted by our automato to The performace of our system, averaged o the five fold of the cross-validatio, ca be foud i Table 1. From the results, we ca see that the automato has a high recall but a low precisio, which was to be expected. Ideed, the automato is oly 1. See for more details o this project. 6
7 Corpus-Based Termiology Extractio a rough filter that elimiates easy to elimiate word sequeces, but keep as much terms as possible. O the other had, the selectio did ot perform as well as we expected. Its low recall ad precisio could be explaied by the metrics that are ot as expressive as we though ad by the fact that 75% of the terms i our test corpora appears oly oe time. Whe a term appears oly oe time, its frequecy ad etropy become useless. The results preseted i Table 2 seem to cofirm our hypothesis. Extractio Idetificatio Part µ! Precisio Recall Precisio Recall Overall system Precisio Recall Table 1 Mea ( µ ) ad stadard deviatio (! ) of the precisio ad recall of the differet parts of our system. Because we wated to compare our system with the idividual metrics that it uses, we had to modify it such that it raks the cadidate terms istead of simply acceptig or rejectig them. To do so, we made our system retur " istead of term or ot term (see Figure 2). We the sorted the cadidate terms i decreasig order of their " value. A commo practice whe comparig rakig algorithms is to build their ROC (receivig operator curve), which shows the ratio of good idetificatios (y axis) agaist the ratio of bad idetificatio (x axis) for all the acceptatio thresholds. The best curve will augmet i y faster tha i x, so will have a greater area uder it. We ca see i Figure 3 that our system performs better tha etropy or log-likelyhood aloe. This leads us to believe that differet scores carry differet iformatio ad that combiig them, as we did it, is fruitful. Discussio ad future works I this paper, we preseted a approach to automatically geerate a ed-to-ed term extractor from a traiig corpus. We also proposed a way to combie may statistical scores i order to extract terms more efficietly tha whe each score is used i isolatio. Because of the ature of the traiig algorithm, we ca easily exted 7
8 Alexadre Patry ad Philippe Laglais the set of metrics we cosidered here. Eve a priori kowledge could be itegrated by specifyig keywords before the extractio ad settig a score to oe whe a cadidate term cotais a keyword or zero otherwise. The same flexibility is achieved whe the automato is created. By geeratig it directly from the output of the POS tagger, our solutio does ot deped of a particular tagger ad is tolerat to cosistet taggig errors. Criteria µ! Cadidates appearig oe time Precisio Recall Cadidates appearig at least two times Precisio Recall Table 2 Compariso of the performace of the term idetificatio part for cadidates appearig with differet frequecies. Figure 3 The ROC of our system (AdaBoost) agaist two other score whe we traied our system o oe half of our corpus ad tested o the other. A greater area uder the curve is better. A shortcomig of this work is that we did ot treat term variatios. Termiology variatio is a well-kow pheomeo, whose amout is estimated accordig to (Kageura et al., 2004) from 15% to 35%. We thik that the best way to deal with them i our framework would be to 8
9 Corpus-Based Termiology Extractio itroduce a preprocessig stage where variatios are ormalized to a caoical form. Term variatios have bee extesively studied i (Jacquemi, 2001) ad (Daille, 2003). I our experimets, we focused o complex terms. Because some scores do ot apply to simple terms (e.g. log-likelyhood ad legth), we thik that the best way to extract simple terms would be to trai a dedicated classifier. Ackowledgemets We would like to thak Hugo Larochelle who foud the corpus we used i our experimets ad Elliott Macklovitch who made some useful commets o the first draft of this documet. This work has bee subsidized by NSERC ad FQRNT. Refereces Castellví, M. Teresa Cabré; Bagot, Rosa Estopà; Palastresi, Jordi Vivaldi; Automatic Term Detectio: A Review of Curret Systems i Recet advaces i computatioal termiology. Joh Bejami, Daille, Béatrice; Study ad Implemetatio of Combied Techiques for Automatic Extractio of Termiology i The Balacig Act: Combiig Symbolic ad Statistical Approaches to Laguage. New Mexico State Uiversity, Las Cruces, Daille, Béatrice; Coceptual structurig through term variatios i Proceedigs of the ACL Workshop o Multiword Expressios: Aalysis, Acquisitio ad Treatmet Duig, Ted; Accurate Methods for the Statistics of Surprise ad Coicidece Foster, George; Statistical lexical disambiguatio, Master Thesis. McGill Uiversity, Motreal, Freud, Y.; Schapire, R.E.; A Short Itroductio to Boostig i Joural of Japaese Society for Artificial Itelligece Jacquemi, Christia; Spottig ad Discoverig Terms through Natural Laguage Processig. MIT Press, Justeso, Joh S.; Katz, Slava M.; Techical Termiology: Some Liguistic Properties ad a Algorithm for Idetificatio i Text i Natural Laguage Egieerig Kageura, Kyo; Daille, Béatrice; Nakagawa, Hiroshi; Chie, Lee-Feg; Recet Treds i Computatioal Termiology i Termiology. Joh Bejami,
10 Alexadre Patry ad Philippe Laglais Laglais, Philippe; Carl, Michael; Geeral-purpose statistical traslatio egie ad domai specific texts: Would it work? i Termiology. Joh Bejami,
Natural language processing implementation on Romanian ChatBot
Proceedigs of the 9th WSEAS Iteratioal Coferece o SIMULATION, MODELLING AND OPTIMIZATION Natural laguage processig implemetatio o Romaia ChatBot RALF FABIAN, MARCU ALEXANDRU-NICOLAE Departmet for Iformatics
More informationarxiv: v1 [cs.dl] 22 Dec 2016
ScieceWISE: Topic Modelig over Scietific Literature Networks arxiv:1612.07636v1 [cs.dl] 22 Dec 2016 A. Magalich, V. Gemmetto, D. Garlaschelli, A. Boyarsky Uiversity of Leide, The Netherlads {magalich,
More information'Norwegian University of Science and Technology, Department of Computer and Information Science
The helpful Patiet Record System: Problem Orieted Ad Kowledge Based Elisabeth Bayega, MS' ad Samso Tu, MS2 'Norwegia Uiversity of Sciece ad Techology, Departmet of Computer ad Iformatio Sciece ad Departmet
More informationE-LEARNING USABILITY: A LEARNER-ADAPTED APPROACH BASED ON THE EVALUATION OF LEANER S PREFERENCES. Valentina Terzieva, Yuri Pavlov, Rumen Andreev
Titre du documet / Documet title E-learig usability : A learer-adapted approach based o the evaluatio of leaer's prefereces Auteur(s) / Author(s) TERZIEVA Valetia ; PAVLOV Yuri (1) ; ANDREEV Rume (2) ;
More informationManagement Science Letters
Maagemet Sciece Letters 4 (24) 2 26 Cotets lists available at GrowigSciece Maagemet Sciece Letters homepage: www.growigsciece.com/msl A applicatio of data evelopmet aalysis for measurig the relative efficiecy
More informationConsortium: North Carolina Community Colleges
Associatio of Research Libraries / Texas A&M Uiversity www.libqual.org Cotributors Collee Cook Texas A&M Uiversity Fred Heath Uiversity of Texas BruceThompso Texas A&M Uiversity Martha Kyrillidou Associatio
More informationFuzzy Reference Gain-Scheduling Approach as Intelligent Agents: FRGS Agent
Fuzzy Referece Gai-Schedulig Approach as Itelliget Agets: FRGS Aget J. E. ARAUJO * eresto@lit.ipe.br K. H. KIENITZ # kieitz@ita.br S. A. SANDRI sadra@lac.ipe.br J. D. S. da SILVA demisio@lac.ipe.br * Itegratio
More informationCONSTITUENT VOICE TECHNICAL NOTE 1 INTRODUCING Version 1.1, September 2014
preview begis oct 2014 lauches ja 2015 INTRODUCING WWW.FEEDBACKCOMMONS.ORG A serviced cloud platform to share ad compare feedback data ad collaboratively develop feedback ad learig practice CONSTITUENT
More informationApplication for Admission
Applicatio for Admissio Admissio Office PO Box 2900 Illiois Wesleya Uiversity Bloomig, Illiois 61702-2900 Apply o-lie at: www.iwu.edu Applicatio Iformatio I am applyig: Early Actio Regular Decisio Early
More informationHANDBOOK. Career Center Handbook. Tools & Tips for Career Search Success CALIFORNIA STATE UNIVERSITY, SACR AMENTO
HANDBOOK Career Ceter Hadbook CALIFORNIA STATE UNIVERSITY, SACR AMENTO Tools & Tips for Career Search Success Academic Advisig ad Career Ceter 6000 J Street Lasse Hall 1013 Sacrameto, CA 95819-6064 916-278-6231
More informationpart2 Participatory Processes
part part2 Participatory Processes Participatory Learig Approaches Whose Learig? Participatory learig is based o the priciple of ope expressio where all sectios of the commuity ad exteral stakeholders
More informationSINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)
SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,
More informationVISION, MISSION, VALUES, AND GOALS
6 VISION, MISSION, VALUES, AND GOALS 2010-2015 VISION STATEMENT Ohloe College will be kow throughout Califoria for our iclusiveess, iovatio, ad superior rates of studet success. MISSION STATEMENT The Missio
More informationalso inside Continuing Education Alumni Authors College Events
SUMMER 2016 JAMESTOWN COMMUNITY COLLEGE ALUMNI MAGAZINE create a etrepreeur creatig a busiess a artist creatig beauty a citize creatig the future also iside Cotiuig Educatio Alumi Authors College Evets
More informationA Case Study: News Classification Based on Term Frequency
A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center
More information2014 Gold Award Winner SpecialParent
Award Wier SpecialParet Dedicated to all families of childre with special eeds 6 th Editio/Fall/Witer 2014 Desig ad Editorial Awards Competitio MISSION Our goal is to provide parets of childre with special
More informationOn March 15, 2016, Governor Rick Snyder. Continuing Medical Education Becomes Mandatory in Michigan. in this issue... 3 Great Lakes Veterinary
michiga veteriary medical associatio i this issue... 3 Great Lakes Veteriary Coferece 4 What You Need to Kow Whe Issuig a Iterstate Certificate of Ispectio 6 Low Pathogeic Avia Iflueza H5 Virus Detectios
More informationEdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,
More informationMemory-based grammatical error correction
Memory-based grammatical error correction Antal van den Bosch Peter Berck Radboud University Nijmegen Tilburg University P.O. Box 9103 P.O. Box 90153 NL-6500 HD Nijmegen, The Netherlands NL-5000 LE Tilburg,
More informationCross-Lingual Text Categorization
Cross-Lingual Text Categorization Nuria Bel 1, Cornelis H.A. Koster 2, and Marta Villegas 1 1 Grup d Investigació en Lingüística Computacional Universitat de Barcelona, 028 - Barcelona, Spain. {nuria,tona}@gilc.ub.es
More informationUniversiteit Leiden ICT in Business
Universiteit Leiden ICT in Business Ranking of Multi-Word Terms Name: Ricardo R.M. Blikman Student-no: s1184164 Internal report number: 2012-11 Date: 07/03/2013 1st supervisor: Prof. Dr. J.N. Kok 2nd supervisor:
More informationReducing Features to Improve Bug Prediction
Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science
More informationThe Internet as a Normative Corpus: Grammar Checking with a Search Engine
The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a
More informationPredicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks
Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com
More informationOnline Updating of Word Representations for Part-of-Speech Tagging
Online Updating of Word Representations for Part-of-Speech Tagging Wenpeng Yin LMU Munich wenpeng@cis.lmu.de Tobias Schnabel Cornell University tbs49@cornell.edu Hinrich Schütze LMU Munich inquiries@cislmu.org
More informationLinking Task: Identifying authors and book titles in verbose queries
Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,
More informationTarget Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data
Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se
More informationProduct Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments
Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &
More informationUsing dialogue context to improve parsing performance in dialogue systems
Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,
More informationNetpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models
Netpix: A Method of Feature Selection Leading to Accurate Sentiment-Based Classification Models 1 Netpix: A Method of Feature Selection Leading to Accurate Sentiment-Based Classification Models James B.
More informationChinese Language Parsing with Maximum-Entropy-Inspired Parser
Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art
More informationDERMATOLOGY. Sponsored by the NYU Post-Graduate Medical School. 129 Years of Continuing Medical Education
Advaces i DERMATOLOGY THURSDAY - FRIDAY JUNE 7-8, 2012 New York, NY Sposored by the NYU Post-Graduate Medical School 129 Years of Cotiuig Medical Educatio THE RONALD O. PERELMAN DEPARTMENT OF DERMATOLOGY
More informationTHE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING
SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,
More informationNotes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1
Notes on The Sciences of the Artificial Adapted from a shorter document written for course 17-652 (Deciding What to Design) 1 Ali Almossawi December 29, 2005 1 Introduction The Sciences of the Artificial
More informationCross Language Information Retrieval
Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................
More informationOCR for Arabic using SIFT Descriptors With Online Failure Prediction
OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,
More informationLearning Methods in Multilingual Speech Recognition
Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex
More informationAQUA: An Ontology-Driven Question Answering System
AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.
More informationThe stages of event extraction
The stages of event extraction David Ahn Intelligent Systems Lab Amsterdam University of Amsterdam ahn@science.uva.nl Abstract Event detection and recognition is a complex task consisting of multiple sub-tasks
More informationarxiv: v1 [cs.cl] 2 Apr 2017
Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,
More informationhave to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,
A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994
More informationWord Segmentation of Off-line Handwritten Documents
Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department
More informationThe Karlsruhe Institute of Technology Translation Systems for the WMT 2011
The Karlsruhe Institute of Technology Translation Systems for the WMT 2011 Teresa Herrmann, Mohammed Mediani, Jan Niehues and Alex Waibel Karlsruhe Institute of Technology Karlsruhe, Germany firstname.lastname@kit.edu
More informationWeb as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics
(L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes
More informationMethods for the Qualitative Evaluation of Lexical Association Measures
Methods for the Qualitative Evaluation of Lexical Association Measures Stefan Evert IMS, University of Stuttgart Azenbergstr. 12 D-70174 Stuttgart, Germany evert@ims.uni-stuttgart.de Brigitte Krenn Austrian
More informationLearning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for
Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Email Marilyn A. Walker Jeanne C. Fromer Shrikanth Narayanan walker@research.att.com jeannie@ai.mit.edu shri@research.att.com
More informationOPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS
OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,
More informationMachine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler
Machine Learning and Data Mining Ensembles of Learners Prof. Alexander Ihler Ensemble methods Why learn one classifier when you can learn many? Ensemble: combine many predictors (Weighted) combina
More informationLanguage Independent Passage Retrieval for Question Answering
Language Independent Passage Retrieval for Question Answering José Manuel Gómez-Soriano 1, Manuel Montes-y-Gómez 2, Emilio Sanchis-Arnal 1, Luis Villaseñor-Pineda 2, Paolo Rosso 1 1 Polytechnic University
More informationSystem Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering
More informationLip reading: Japanese vowel recognition by tracking temporal changes of lip shape
Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Koshi Odagiri 1, and Yoichi Muraoka 1 1 Graduate School of Fundamental/Computer Science and Engineering, Waseda University,
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview
More informationMultilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities
Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities Soto Montalvo GAVAB Group URJC Raquel Martínez NLP&IR Group UNED Arantza Casillas Dpt. EE UPV-EHU Víctor Fresno GAVAB
More informationPredicting Students Performance with SimStudent: Learning Cognitive Skills from Observation
School of Computer Science Human-Computer Interaction Institute Carnegie Mellon University Year 2007 Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation Noboru Matsuda
More informationA Neural Network GUI Tested on Text-To-Phoneme Mapping
A Neural Network GUI Tested on Text-To-Phoneme Mapping MAARTEN TROMPPER Universiteit Utrecht m.f.a.trompper@students.uu.nl Abstract Text-to-phoneme (T2P) mapping is a necessary step in any speech synthesis
More informationMathematics process categories
Mathematics process categories All of the UK curricula define multiple categories of mathematical proficiency that require students to be able to use and apply mathematics, beyond simple recall of facts
More informationWE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT
WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT PRACTICAL APPLICATIONS OF RANDOM SAMPLING IN ediscovery By Matthew Verga, J.D. INTRODUCTION Anyone who spends ample time working
More informationDistant Supervised Relation Extraction with Wikipedia and Freebase
Distant Supervised Relation Extraction with Wikipedia and Freebase Marcel Ackermann TU Darmstadt ackermann@tk.informatik.tu-darmstadt.de Abstract In this paper we discuss a new approach to extract relational
More informationParsing of part-of-speech tagged Assamese Texts
IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal
More informationVariations of the Similarity Function of TextRank for Automated Summarization
Variations of the Similarity Function of TextRank for Automated Summarization Federico Barrios 1, Federico López 1, Luis Argerich 1, Rosita Wachenchauzer 12 1 Facultad de Ingeniería, Universidad de Buenos
More informationCalibration of Confidence Measures in Speech Recognition
Submitted to IEEE Trans on Audio, Speech, and Language, July 2010 1 Calibration of Confidence Measures in Speech Recognition Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IEEE
More informationCS Machine Learning
CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing
More informationCorpus Linguistics (L615)
(L615) Basics of Markus Dickinson Department of, Indiana University Spring 2013 1 / 23 : the extent to which a sample includes the full range of variability in a population distinguishes corpora from archives
More informationTwitter Sentiment Classification on Sanders Data using Hybrid Approach
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders
More informationTask Tolerance of MT Output in Integrated Text Processes
Task Tolerance of MT Output in Integrated Text Processes John S. White, Jennifer B. Doyon, and Susan W. Talbott Litton PRC 1500 PRC Drive McLean, VA 22102, USA {white_john, doyon jennifer, talbott_susan}@prc.com
More informationIntroduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition Todd Holloway Two Lecture Series for B551 November 20 & 27, 2007 Indiana University Outline Introduction Bias and
More informationMaximizing Learning Through Course Alignment and Experience with Different Types of Knowledge
Innov High Educ (2009) 34:93 103 DOI 10.1007/s10755-009-9095-2 Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge Phyllis Blumberg Published online: 3 February
More informationExperiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling
Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Notebook for PAN at CLEF 2013 Andrés Alfonso Caurcel Díaz 1 and José María Gómez Hidalgo 2 1 Universidad
More informationSwitchboard Language Model Improvement with Conversational Data from Gigaword
Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword
More informationCourse Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE
EE-589 Introduction to Neural Assistant Prof. Dr. Turgay IBRIKCI Room # 305 (322) 338 6868 / 139 Wensdays 9:00-12:00 Course Outline The course is divided in two parts: theory and practice. 1. Theory covers
More informationWhat s in a Step? Toward General, Abstract Representations of Tutoring System Log Data
What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data Kurt VanLehn 1, Kenneth R. Koedinger 2, Alida Skogsholm 2, Adaeze Nwaigwe 2, Robert G.M. Hausmann 1, Anders Weinstein
More informationLecture 1: Machine Learning Basics
1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3
More informationBridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models
Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Jung-Tae Lee and Sang-Bum Kim and Young-In Song and Hae-Chang Rim Dept. of Computer &
More informationBANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS
Daffodil International University Institutional Repository DIU Journal of Science and Technology Volume 8, Issue 1, January 2013 2013-01 BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS Uddin, Sk.
More informationA Bayesian Learning Approach to Concept-Based Document Classification
Databases and Information Systems Group (AG5) Max-Planck-Institute for Computer Science Saarbrücken, Germany A Bayesian Learning Approach to Concept-Based Document Classification by Georgiana Ifrim Supervisors
More informationAccuracy (%) # features
Question Terminology and Representation for Question Type Classication Noriko Tomuro DePaul University School of Computer Science, Telecommunications and Information Systems 243 S. Wabash Ave. Chicago,
More informationSpeech Recognition at ICSI: Broadcast News and beyond
Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI
More informationA Comparison of Two Text Representations for Sentiment Analysis
010 International Conference on Computer Application and System Modeling (ICCASM 010) A Comparison of Two Text Representations for Sentiment Analysis Jianxiong Wang School of Computer Science & Educational
More informationLanguage Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus
Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,
More informationHow to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten
How to read a Paper ISMLL Dr. Josif Grabocka, Carlotta Schatten Hildesheim, April 2017 1 / 30 Outline How to read a paper Finding additional material Hildesheim, April 2017 2 / 30 How to read a paper How
More informationMulti-Lingual Text Leveling
Multi-Lingual Text Leveling Salim Roukos, Jerome Quin, and Todd Ward IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 {roukos,jlquinn,tward}@us.ibm.com Abstract. Determining the language proficiency
More informationTowards a MWE-driven A* parsing with LTAGs [WG2,WG3]
Towards a MWE-driven A* parsing with LTAGs [WG2,WG3] Jakub Waszczuk, Agata Savary To cite this version: Jakub Waszczuk, Agata Savary. Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]. PARSEME 6th general
More informationRadius STEM Readiness TM
Curriculum Guide Radius STEM Readiness TM While today s teens are surrounded by technology, we face a stark and imminent shortage of graduates pursuing careers in Science, Technology, Engineering, and
More informationPrediction of Maximal Projection for Semantic Role Labeling
Prediction of Maximal Projection for Semantic Role Labeling Weiwei Sun, Zhifang Sui Institute of Computational Linguistics Peking University Beijing, 100871, China {ws, szf}@pku.edu.cn Haifeng Wang Toshiba
More informationTest Effort Estimation Using Neural Network
J. Software Engineering & Applications, 2010, 3: 331-340 doi:10.4236/jsea.2010.34038 Published Online April 2010 (http://www.scirp.org/journal/jsea) 331 Chintala Abhishek*, Veginati Pavan Kumar, Harish
More informationLife and career planning
Paper 30-1 PAPER 30 Life and career planning Bob Dick (1983) Life and career planning: a workbook exercise. Brisbane: Department of Psychology, University of Queensland. A workbook for class use. Introduction
More informationIntegrating Semantic Knowledge into Text Similarity and Information Retrieval
Integrating Semantic Knowledge into Text Similarity and Information Retrieval Christof Müller, Iryna Gurevych Max Mühlhäuser Ubiquitous Knowledge Processing Lab Telecooperation Darmstadt University of
More informationNumeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C
Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C Using and applying mathematics objectives (Problem solving, Communicating and Reasoning) Select the maths to use in some classroom
More informationA New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation
A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation SLSP-2016 October 11-12 Natalia Tomashenko 1,2,3 natalia.tomashenko@univ-lemans.fr Yuri Khokhlov 3 khokhlov@speechpro.com Yannick
More information*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN
From: AAAI Technical Report WS-98-08. Compilation copyright 1998, AAAI (www.aaai.org). All rights reserved. Recommender Systems: A GroupLens Perspective Joseph A. Konstan *t, John Riedl *t, AI Borchers,
More informationEvaluation of Usage Patterns for Web-based Educational Systems using Web Mining
Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Dave Donnellan, School of Computer Applications Dublin City University Dublin 9 Ireland daviddonnellan@eircom.net Claus Pahl
More informationEvaluation of Usage Patterns for Web-based Educational Systems using Web Mining
Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Dave Donnellan, School of Computer Applications Dublin City University Dublin 9 Ireland daviddonnellan@eircom.net Claus Pahl
More informationThe Good Judgment Project: A large scale test of different methods of combining expert predictions
The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania
More informationDetecting English-French Cognates Using Orthographic Edit Distance
Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National
More informationHLTCOE at TREC 2013: Temporal Summarization
HLTCOE at TREC 2013: Temporal Summarization Tan Xu University of Maryland College Park Paul McNamee Johns Hopkins University HLTCOE Douglas W. Oard University of Maryland College Park Abstract Our team
More informationIBM Software Group. Mastering Requirements Management with Use Cases Module 6: Define the System
IBM Software Group Mastering Requirements Management with Use Cases Module 6: Define the System 1 Objectives Define a product feature. Refine the Vision document. Write product position statement. Identify
More informationTerm Weighting based on Document Revision History
Term Weighting based on Document Revision History Sérgio Nunes, Cristina Ribeiro, and Gabriel David INESC Porto, DEI, Faculdade de Engenharia, Universidade do Porto. Rua Dr. Roberto Frias, s/n. 4200-465
More informationSemi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.
Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link
More informationThe taming of the data:
The taming of the data: Using text mining in building a corpus for diachronic analysis Stefania Degaetano-Ortlieb, Hannah Kermes, Ashraf Khamis, Jörg Knappen, Noam Ordan and Elke Teich Background Big data
More informationA Pipelined Approach for Iterative Software Process Model
A Pipelined Approach for Iterative Software Process Model Ms.Prasanthi E R, Ms.Aparna Rathi, Ms.Vardhani J P, Mr.Vivek Krishna Electronics and Radar Development Establishment C V Raman Nagar, Bangalore-560093,
More information