Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt
|
|
- Homer Norris
- 6 years ago
- Views:
Transcription
1 Outline Using Web Data for Linguistic Purposes NCLT, Dublin City University
2 Outline Outline 1 Corpora as linguistic tools 2 Limitations of web data Strategies to enhance web data 3
3 Corpora as linguistic tools Any natural corpus will be skewed. Some sentences won t occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description [based upon it] would be no more than a mere list. (Chomsky 1959, 159) What do you think of corpus linguistics? It doesn t exist. (Chomsky answering a question by Bas Aarts, reported in a talk at the Corpus Linguistics conference, Freiburg 2001)
4 Corpora as linguistic tools Any natural corpus will be skewed. Some sentences won t occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description [based upon it] would be no more than a mere list. (Chomsky 1959, 159) What do you think of corpus linguistics? It doesn t exist. (Chomsky answering a question by Bas Aarts, reported in a talk at the Corpus Linguistics conference, Freiburg 2001)
5 Corpora as linguistic tools Corpora crashed into computational linguistics at the 1989 ACL meeting in Vancouver: but they were large, messy, ugly objects clearly lacking in theoretical integrity in all sorts of ways... (Kilgariff, 2003) Special Issue of CL on Using Large Corpora (Church and Mercer, 1993) changed role of corpora in computational linguistics
6 Corpora as linguistic tools Corpora crashed into computational linguistics at the 1989 ACL meeting in Vancouver: but they were large, messy, ugly objects clearly lacking in theoretical integrity in all sorts of ways... (Kilgariff, 2003) Special Issue of CL on Using Large Corpora (Church and Mercer, 1993) changed role of corpora in computational linguistics
7 Web as corpus Corpora as linguistic tools First publications at ACL 1999 Since then the web was used as a data source for: Word Sense Disambiguation (Rigau et al., 2002) Machine Translation (Way and Gough, 2003) Overcoming data sparseness in Language Modeling (Volk, 2001; Lapata and Keller, 2003) Answers for Question-Answering applications (Dumais et al., 2002; Zheng, 2002) New instances for Ontologies (Agirre et al., 2000) Sublanguage corpora for Translation (Varantola, 2000) Language Teaching (Fletcher, 2002)
8 What is a corpus? McEnery and Wilson (1996) Sampling and representativeness Finite (and fixed) size Machine-readable Standard reference Manning and Schütze (1999) Certain amount of data from a certain domain of interest Kilgariff (2003) A collection of texts Is the Web a Corpus?
9 What is a corpus? McEnery and Wilson (1996) Sampling and representativeness Finite (and fixed) size Machine-readable Standard reference Manning and Schütze (1999) Certain amount of data from a certain domain of interest Kilgariff (2003) A collection of texts Is the Web a Corpus?
10 Requirements for corpus design Standardisation Comparison/Exchange with respect to other corpora Flexibility Adding new layers of annotation, multimodality Detailed linguistic annotation with good search facilities Consistency in annotation Import/Export Add new data, create subcorpora, export search results
11 Issues in corpus creation Where to get the data? How to digitalise the data? Accessiblity, data sparseness Timeconsuming, costly How to annotate the data? Timeconsuming, linguistic decisions, inter-annotatior agreement How to guarantee representativity and reliability? The philologist s dilemma God s truth fallacy Mystery of vanishing reliability (Rissanen, 1989) How to get enough data? There s no data like more data
12 Issues in corpus creation Where to get the data? How to digitalise the data? Accessiblity, data sparseness Timeconsuming, costly How to annotate the data? Timeconsuming, linguistic decisions, inter-annotatior agreement How to guarantee representativity and reliability? The philologist s dilemma God s truth fallacy Mystery of vanishing reliability (Rissanen, 1989) How to get enough data? There s no data like more data
13 Limitations of web data Strategies to enhance web data Web as Solution for Sparse Data Problems? Advantages Lots of data freely available already digitalised Disadvantages No (reliable) meta-information No annotation, no control of search tool No control of precision and recall of search results (essential for quantitative studies) No control of contents No stability results can not be replicated
14 Limitations of web data Strategies to enhance web data Web as Solution for Sparse Data Problems? Advantages Lots of data freely available already digitalised Disadvantages No (reliable) meta-information No annotation, no control of search tool No control of precision and recall of search results (essential for quantitative studies) No control of contents No stability results can not be replicated
15 No control of the search tool Limitations of web data Strategies to enhance web data Problem: No control of indexing and search strategies Found on Jean Veronis blog in Feb 2005: If you type Chirac OR Sarkozy, you get half the number results of Chirac alone, which may have a political explanation... but is a weird approach to boolean logic. If you search the in the English pages, you get 1% of the number you get for the all languages together. Does this mean that the is 99 times more frequent in languages other than English? (
16 No control of the search tool Limitations of web data Strategies to enhance web data Indexing and search strategies of a commercial search engine may be modified at any time without notice Google: index update with in-depth correction of extrapolation routines and boolean logic (Mar 2005) (
17 No control of the search tool Limitations of web data Strategies to enhance web data Google IE Google ALL cat cat OR cat dog dog OR dog cat OR dog dog OR cat the the OR the (Google in November 2006)
18 Limitations of web data Strategies to enhance web data Lots of problems with web data... Can we use it at all for linguistic purposes? What type of research questions can be answered by using web data?
19 Limitations of web data Strategies to enhance web data Example: Productivity of non-medical -itis (Lüdeling and Evert, 2004) medical -itis: Combines with neoclassical stems denoting body parts Semantics: Inflammation of X (arthritis, appendicitis) non-medical -itis: Derived from medical -itis Semantics: hysteria or excessively doing something Possibly they are apt to become too ambitious - they rarely succumb to the disease of fontitis but are only too apt to have bad attacks of linkitis and activitis. (BNC, CG9:500)
20 Limitations of web data Strategies to enhance web data Example: Productivity of non-medical -itis (Lüdeling and Evert, 2004) medical -itis: Combines with neoclassical stems denoting body parts Semantics: Inflammation of X (arthritis, appendicitis) non-medical -itis: Derived from medical -itis Semantics: hysteria or excessively doing something Possibly they are apt to become too ambitious - they rarely succumb to the disease of fontitis but are only too apt to have bad attacks of linkitis and activitis. (BNC, CG9:500)
21 Limitations of web data Strategies to enhance web data Example: Productivity of non-medical -itis Quantitative: Is word formation with non-medical -itis productive? Qualitative: With which bases does non-medical -itis combine? Distributional: In which contexts are the resulting complex words used? Comparative: What are the differences between the English and the German affix? Is one of them more productive than the other? Diachronic: When did non-medical -itis start to appear and what is its development?
22 Limitations of web data Strategies to enhance web data Example: Productivity of non-medical -itis Type of Study BNC DWDS Google quantitative (find new types) yes yes no qualitative (find new token) yes yes yes distributional (look at context) yes yes yes comparative (meta-data, number yes no no of token/category) diachronic (date of origin) no yes no : BNC: not diachronic, too old DWDS: not (yet) stable enough, only accessible through web interface Web: no meta-data, no annotation, not stable
23 Limitations of web data Strategies to enhance web data Example: Productivity of non-medical -itis Type of Study BNC DWDS Google quantitative (find new types) yes yes no qualitative (find new token) yes yes yes distributional (look at context) yes yes yes comparative (meta-data, number yes no no of token/category) diachronic (date of origin) no yes no : BNC: not diachronic, too old DWDS: not (yet) stable enough, only accessible through web interface Web: no meta-data, no annotation, not stable
24 Limitations of web data Strategies to enhance web data How to overcome the limitations of web data? Two strategies: 1 Edit data from the search engine WebCorp (Kehoe and Renouf, 2002) KWicFinder (Fletcher, 2001) The Linguist s Search Enginge (Elkiss and Resnik, 2004) 2 Create your own corpus from the web BootsCaT (Baroni and Bernardini, 2004) Do it your own: Crawling, post-processing, annotating and indexing web data
25 WebCorp (Kehoe and Renouf, 2002) Limitations of web data Strategies to enhance web data Web-based interface to comercial search engines More powerful query syntax (wildcards) Output: keyword in context word frequency lists collocation statistics source document Limitations Same as the original search engine (Normalisations, stability, lack of control, no meta-information, no linguistic annotation) High precision, but low recall (for I like *ing less (10) than the BNC (295)) No random subset of results but dependent on search engine ranking (popularity,...)
26 Limitations of web data Strategies to enhance web data BootCaT (Baroni and Bernhardi, 2004) Create specialised language corpora for terminographical work Build general corpora in the size of the BNC (Sharoff, submitted; Select initial seeds Run Google Queries Retrieve Corpus Extract Seeds (Unigram Terms) Extract Multi Word Terms No meta-information Linguistic annotation, control of search results Stability, Replicability Limited in size
27 Limitations of web data Strategies to enhance web data WaCky: kool ynitiative Informal initiative to rapidly build 1-billion-token proof-of-concept Web-corpora in 3 languages and a toolkit to collect, process and exploit such large corpora
28 Corpora as linguistic tools Corpora are a useful tool for linguistics but have to follow certain design criteria Linguistic studies based on web corpora are highly problematic But: often do simple algorithms using web data outperform more sophisticated methods based on smaller, but controlled data sets Use the web where it makes sense, but keep pitfalls in mind!
29 Thank You! Questions?
30 References (1) Corpora as linguistic tools Baroni, Marco and Silvia Bernardini (2004). BootCaT: Bootstrapping corpora and terms from the Web. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC-2004), Lisbon. BNC: Chomsky, Noam (1957). Syntactic structures. The Hague, 159. Church, Kenneth W.; Mercer, Robert L. (1993). Introduction to the special issue on Computational Linguistics using large corpora. Computational Linguistics, 19(1), DWDS: Elkiss, Aaron and Philip Resnik (2004). The Linguist s Search Engine User s Guide. Available at: (March 29, 2005).
31 References (2) Corpora as linguistic tools Fletcher, William H. (2001) Concordancing the Web with KWiCFinder. In: Proceedings of the 3rd North American Symposium on Corpus Linguistics and Language Teaching, Boston. Draft version: (March 22, 2005). Google: Kehoe, Andrew and Antoinette Renouf (2002). WebCorp: Applying the Web to linguistics and linguistics to the Web. In: Proceedings of the WWW 2002 Conference. Honolulu. Kilgariff, Adam and Gregory Grefenstette (2003). Introduction to the Special Issue on the, Computational Linguistics Volume 29, Number 3. Lüdeling, Evert, and Baroni (to appear). Using Web Data for Linguistic Purposes.
32 References (3) Corpora as linguistic tools Lüdeling, Anke and Stefan Evert, (2004). The emergence of productive non-medical -itis: corpus evidence and qualitative analysis in Proceedings of the First International Conference on Linguistic Evidence Tübingen, Germany. Manning and Schütze (1999). Foundations of Statistical Natural Language Processing. MIT Pres. McEnery, Tony and Andrew Wilson (1996). Corpus Linguistics. Edinburgh: Edinburgh University Press. Rissanen, M. (1989). Three problems connected with the use of diachronic corpora. ICAME Journal 13: Sharoff, Serge (submitted). Open-source Corpora: using the net to fish for linguistic data. WaCky: Way, A. and N. Gough (2003). Developing and Validating an Example-Based Machine Translation System using the World Wide Web. Computational Linguistics: special issue on.
Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics
(L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes
More informationEACL th Conference of the European Chapter of the Association for Computational Linguistics. Proceedings of the 2nd International Workshop on
EACL-2006 11 th Conference of the European Chapter of the Association for Computational Linguistics Proceedings of the 2nd International Workshop on Web as Corpus Chairs: Adam Kilgarriff Marco Baroni April
More informationA Case Study: News Classification Based on Term Frequency
A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center
More informationAQUA: An Ontology-Driven Question Answering System
AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.
More informationThe Internet as a Normative Corpus: Grammar Checking with a Search Engine
The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a
More informationProcedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing
Available online at www.sciencedirect.com ScienceDirect Procedia - Social and Behavioral Sciences 141 ( 2014 ) 124 128 WCLTA 2013 Using Corpus Linguistics in the Development of Writing Blanka Frydrychova
More informationModeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures
Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures Ulrike Baldewein (ulrike@coli.uni-sb.de) Computational Psycholinguistics, Saarland University D-66041 Saarbrücken,
More informationMethods for the Qualitative Evaluation of Lexical Association Measures
Methods for the Qualitative Evaluation of Lexical Association Measures Stefan Evert IMS, University of Stuttgart Azenbergstr. 12 D-70174 Stuttgart, Germany evert@ims.uni-stuttgart.de Brigitte Krenn Austrian
More informationLinking Task: Identifying authors and book titles in verbose queries
Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,
More informationA High-Quality Web Corpus of Czech
A High-Quality Web Corpus of Czech Johanka Spoustová, Miroslav Spousta Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University Prague, Czech Republic {johanka,spousta}@ufal.mff.cuni.cz
More informationMeasuring Web-Corpus Randomness: A Progress Report
Measuring Web-Corpus Randomness: A Progress Report Massimiliano Ciaramita (m.ciaramita@istc.cnr.it) Istituto di Scienze e Tecnologie Cognitive (ISTC-CNR) Via Nomentana 56, Roma, 00161 Italy Marco Baroni
More informationSINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)
SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,
More informationCorpus Linguistics (L615)
(L615) Basics of Markus Dickinson Department of, Indiana University Spring 2013 1 / 23 : the extent to which a sample includes the full range of variability in a population distinguishes corpora from archives
More informationSearch right and thou shalt find... Using Web Queries for Learner Error Detection
Search right and thou shalt find... Using Web Queries for Learner Error Detection Michael Gamon Claudia Leacock Microsoft Research Butler Hill Group One Microsoft Way P.O. Box 935 Redmond, WA 981052, USA
More informationIntroduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)
Introduction Beáta B. Megyesi Uppsala University Department of Linguistics and Philology beata.megyesi@lingfil.uu.se Introduction 1(48) Course content Credits: 7.5 ECTS Subject: Computational linguistics
More informationCross Language Information Retrieval
Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................
More informationThe Web for Corpus and the Web as Corpus in Translator Training 1
The Web for Corpus and the Web as Corpus in Translator Training 1 Miriam Buendía-Castro, Clara Inés López-Rodríguez University of Granada, SPAIN ABSTRACT Corpora are rich information sources that can provide
More informationReview in ICAME Journal, Volume 38, 2014, DOI: /icame
Review in ICAME Journal, Volume 38, 2014, DOI: 10.2478/icame-2014-0012 Gaëtanelle Gilquin and Sylvie De Cock (eds.). Errors and disfluencies in spoken corpora. Amsterdam: John Benjamins. 2013. 172 pp.
More informationWeb as a Corpus: Going Beyond the n-gram
Web as a Corpus: Going Beyond the n-gram Preslav Nakov Qatar Computing Research Institute, Tornado Tower, floor 10 P.O.box 5825 Doha, Qatar pnakov@qf.org.qa Abstract. The 60-year-old dream of computational
More informationWord Sense Disambiguation
Word Sense Disambiguation D. De Cao R. Basili Corso di Web Mining e Retrieval a.a. 2008-9 May 21, 2009 Excerpt of the R. Mihalcea and T. Pedersen AAAI 2005 Tutorial, at: http://www.d.umn.edu/ tpederse/tutorials/advances-in-wsd-aaai-2005.ppt
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview
More informationWord Segmentation of Off-line Handwritten Documents
Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department
More informationCROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2
1 CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2 Peter A. Chew, Brett W. Bader, Ahmed Abdelali Proceedings of the 13 th SIGKDD, 2007 Tiago Luís Outline 2 Cross-Language IR (CLIR) Latent Semantic Analysis
More informationVisual CP Representation of Knowledge
Visual CP Representation of Knowledge Heather D. Pfeiffer and Roger T. Hartley Department of Computer Science New Mexico State University Las Cruces, NM 88003-8001, USA email: hdp@cs.nmsu.edu and rth@cs.nmsu.edu
More informationPostprint.
http://www.diva-portal.org Postprint This is the accepted version of a paper presented at CLEF 2013 Conference and Labs of the Evaluation Forum Information Access Evaluation meets Multilinguality, Multimodality,
More informationUsing dialogue context to improve parsing performance in dialogue systems
Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,
More informationObjectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition
Chapter 2: The Representation of Knowledge Expert Systems: Principles and Programming, Fourth Edition Objectives Introduce the study of logic Learn the difference between formal logic and informal logic
More informationUsing Web Searches on Important Words to Create Background Sets for LSI Classification
Using Web Searches on Important Words to Create Background Sets for LSI Classification Sarah Zelikovitz and Marina Kogan College of Staten Island of CUNY 2800 Victory Blvd Staten Island, NY 11314 Abstract
More informationLQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization
LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization Annemarie Friedrich, Marina Valeeva and Alexis Palmer COMPUTATIONAL LINGUISTICS & PHONETICS SAARLAND UNIVERSITY, GERMANY
More informationEdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,
More informationA Semantic Similarity Measure Based on Lexico-Syntactic Patterns
A Semantic Similarity Measure Based on Lexico-Syntactic Patterns Alexander Panchenko, Olga Morozova and Hubert Naets Center for Natural Language Processing (CENTAL) Université catholique de Louvain Belgium
More informationUsing Small Random Samples for the Manual Evaluation of Statistical Association Measures
Using Small Random Samples for the Manual Evaluation of Statistical Association Measures Stefan Evert IMS, University of Stuttgart, Germany Brigitte Krenn ÖFAI, Vienna, Austria Abstract In this paper,
More informationModule 12. Machine Learning. Version 2 CSE IIT, Kharagpur
Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should
More informationThe Language of Football England vs. Germany (working title) by Elmar Thalhammer. Abstract
The Language of Football England vs. Germany (working title) by Elmar Thalhammer Abstract As opposed to about fifteen years ago, football has now become a socially acceptable phenomenon in both Germany
More informationIntroduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.
to as a linguistic theory to to a member of the family of linguistic frameworks that are called generative grammars a grammar which is formalized to a high degree and thus makes exact predictions about
More informationConstructing Parallel Corpus from Movie Subtitles
Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing
More informationLanguage Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus
Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,
More informationThe stages of event extraction
The stages of event extraction David Ahn Intelligent Systems Lab Amsterdam University of Amsterdam ahn@science.uva.nl Abstract Event detection and recognition is a complex task consisting of multiple sub-tasks
More informationControlled vocabulary
Indexing languages 6.2.2. Controlled vocabulary Overview Anyone who has struggled to find the exact search term to retrieve information about a certain subject can benefit from controlled vocabulary. Controlled
More informationProviding student writers with pre-text feedback
Providing student writers with pre-text feedback Ana Frankenberg-Garcia This paper argues that the best moment for responding to student writing is before any draft is completed. It analyses ways in which
More informationStefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio
Content 1. Empirical linguistics 2. Text corpora and corpus linguistics 3. Concordances 4. Application I: The German progressive 5. Part-of-speech tagging 6. Fequency analysis 7. Application II: Compounds
More informationDeveloping True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability
Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability Shih-Bin Chen Dept. of Information and Computer Engineering, Chung-Yuan Christian University Chung-Li, Taiwan
More informationTarget Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data
Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se
More informationSpecification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments
Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,
More informationLanguage Independent Passage Retrieval for Question Answering
Language Independent Passage Retrieval for Question Answering José Manuel Gómez-Soriano 1, Manuel Montes-y-Gómez 2, Emilio Sanchis-Arnal 1, Luis Villaseñor-Pineda 2, Paolo Rosso 1 1 Polytechnic University
More information2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases
POS Tagging Problem Part-of-Speech Tagging L545 Spring 203 Given a sentence W Wn and a tagset of lexical categories, find the most likely tag T..Tn for each word in the sentence Example Secretariat/P is/vbz
More informationBridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models
Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Jung-Tae Lee and Sang-Bum Kim and Young-In Song and Hae-Chang Rim Dept. of Computer &
More informationarxiv: v1 [cs.cl] 2 Apr 2017
Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,
More informationA NOTE ON UNDETECTED TYPING ERRORS
SPkClAl SECT/ON A NOTE ON UNDETECTED TYPING ERRORS Although human proofreading is still necessary, small, topic-specific word lists in spelling programs will minimize the occurrence of undetected typing
More informationT Seminar on Internetworking
T-110.5191 Seminar on Internetworking T-110.5191@tkk.fi Aalto University School of Science 1 Agenda Course Organization Important dates Signing up First draft, Full paper, Final paper What is a good seminar
More informationUniversity of the Basque Country
University of the Basque Country Faculty of Computer Science Department of Computer Languages and Systems Dr. Xabier Arregi / Dr. Kepa Sarasola PhD Thesis The Web as a Corpus of Basque Igor Leturia Donostia
More informationDistant Supervised Relation Extraction with Wikipedia and Freebase
Distant Supervised Relation Extraction with Wikipedia and Freebase Marcel Ackermann TU Darmstadt ackermann@tk.informatik.tu-darmstadt.de Abstract In this paper we discuss a new approach to extract relational
More informationMatching Similarity for Keyword-Based Clustering
Matching Similarity for Keyword-Based Clustering Mohammad Rezaei and Pasi Fränti University of Eastern Finland {rezaei,franti}@cs.uef.fi Abstract. Semantic clustering of objects such as documents, web
More informationNotes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1
Notes on The Sciences of the Artificial Adapted from a shorter document written for course 17-652 (Deciding What to Design) 1 Ali Almossawi December 29, 2005 1 Introduction The Sciences of the Artificial
More informationThe MEANING Multilingual Central Repository
The MEANING Multilingual Central Repository J. Atserias, L. Villarejo, G. Rigau, E. Agirre, J. Carroll, B. Magnini, P. Vossen January 27, 2004 http://www.lsi.upc.es/ nlp/meaning Jordi Atserias TALP Index
More informationEnhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion
More informationDEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS
DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS Julia Tmshkina Centre for Text Techitology, North-West University, 253 Potchefstroom, South Africa 2025770@puk.ac.za
More informationSoftware Maintenance
1 What is Software Maintenance? Software Maintenance is a very broad activity that includes error corrections, enhancements of capabilities, deletion of obsolete capabilities, and optimization. 2 Categories
More informationOn-Line Data Analytics
International Journal of Computer Applications in Engineering Sciences [VOL I, ISSUE III, SEPTEMBER 2011] [ISSN: 2231-4946] On-Line Data Analytics Yugandhar Vemulapalli #, Devarapalli Raghu *, Raja Jacob
More informationSystematic reviews in theory and practice for library and information studies
Systematic reviews in theory and practice for library and information studies Sue F. Phelps, Nicole Campbell Abstract This article is about the use of systematic reviews as a research methodology in library
More informationLessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities
Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities Simon Clematide, Isabel Meraner, Noah Bubenhofer, Martin Volk Institute of Computational Linguistics
More informationAn Evaluation of POS Taggers for the CHILDES Corpus
City University of New York (CUNY) CUNY Academic Works Dissertations, Theses, and Capstone Projects Graduate Center 9-30-2016 An Evaluation of POS Taggers for the CHILDES Corpus Rui Huang The Graduate
More informationEdexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE
Edexcel GCSE Statistics 1389 Paper 1H June 2007 Mark Scheme Edexcel GCSE Statistics 1389 NOTES ON MARKING PRINCIPLES 1 Types of mark M marks: method marks A marks: accuracy marks B marks: unconditional
More informationNatural Language Processing. George Konidaris
Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2017 Natural Language Processing Understanding spoken/written sentences in a natural language. Major area of research in AI. Why? Humans
More informationGeneration of Referring Expressions: Managing Structural Ambiguities
Generation of Referring Expressions: Managing Structural Ambiguities Imtiaz Hussain Khan and Kees van Deemter and Graeme Ritchie Department of Computing Science University of Aberdeen Aberdeen AB24 3UE,
More informationLinguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis
International Journal of Arts Humanities and Social Sciences (IJAHSS) Volume 1 Issue 1 ǁ August 216. www.ijahss.com Linguistic Variation across Sports Category of Press Reportage from British Newspapers:
More informationALMA MATER STUDIORUM UNIVERSITÀ DI BOLOGNA CORSO DI LAUREA IN. MEDIAZIONE LINGUISTICA INTERCULTURALE (Classe L-12) ELABORATO FINALE
ALMA MATER STUDIORUM UNIVERSITÀ DI BOLOGNA SCUOLA DI LINGUE E LETTERATURE, TRADUZIONE E INTERPRETAZIONE SEDE DI FORLÌ CORSO DI LAUREA IN MEDIAZIONE LINGUISTICA INTERCULTURALE (Classe L-12) ELABORATO FINALE
More informationAn Interactive Intelligent Language Tutor Over The Internet
An Interactive Intelligent Language Tutor Over The Internet Trude Heift Linguistics Department and Language Learning Centre Simon Fraser University, B.C. Canada V5A1S6 E-mail: heift@sfu.ca Abstract: This
More informationThe taming of the data:
The taming of the data: Using text mining in building a corpus for diachronic analysis Stefania Degaetano-Ortlieb, Hannah Kermes, Ashraf Khamis, Jörg Knappen, Noam Ordan and Elke Teich Background Big data
More informationMETHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS
METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS Ruslan Mitkov (R.Mitkov@wlv.ac.uk) University of Wolverhampton ViktorPekar (v.pekar@wlv.ac.uk) University of Wolverhampton Dimitar
More informationExploration. CS : Deep Reinforcement Learning Sergey Levine
Exploration CS 294-112: Deep Reinforcement Learning Sergey Levine Class Notes 1. Homework 4 due on Wednesday 2. Project proposal feedback sent Today s Lecture 1. What is exploration? Why is it a problem?
More informationProceedings of the 19th COLING, , 2002.
Crosslinguistic Transfer in Automatic Verb Classication Vivian Tsang Computer Science University of Toronto vyctsang@cs.toronto.edu Suzanne Stevenson Computer Science University of Toronto suzanne@cs.toronto.edu
More informationThe Ups and Downs of Preposition Error Detection in ESL Writing
The Ups and Downs of Preposition Error Detection in ESL Writing Joel R. Tetreault Educational Testing Service 660 Rosedale Road Princeton, NJ, USA JTetreault@ets.org Martin Chodorow Hunter College of CUNY
More informationRadius STEM Readiness TM
Curriculum Guide Radius STEM Readiness TM While today s teens are surrounded by technology, we face a stark and imminent shortage of graduates pursuing careers in Science, Technology, Engineering, and
More informationPredicting Students Performance with SimStudent: Learning Cognitive Skills from Observation
School of Computer Science Human-Computer Interaction Institute Carnegie Mellon University Year 2007 Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation Noboru Matsuda
More informationGACE Computer Science Assessment Test at a Glance
GACE Computer Science Assessment Test at a Glance Updated May 2017 See the GACE Computer Science Assessment Study Companion for practice questions and preparation resources. Assessment Name Computer Science
More informationIntra-talker Variation: Audience Design Factors Affecting Lexical Selections
Tyler Perrachione LING 451-0 Proseminar in Sound Structure Prof. A. Bradlow 17 March 2006 Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Abstract Although the acoustic and
More informationThe Karlsruhe Institute of Technology Translation Systems for the WMT 2011
The Karlsruhe Institute of Technology Translation Systems for the WMT 2011 Teresa Herrmann, Mohammed Mediani, Jan Niehues and Alex Waibel Karlsruhe Institute of Technology Karlsruhe, Germany firstname.lastname@kit.edu
More informationOn document relevance and lexical cohesion between query terms
Information Processing and Management 42 (2006) 1230 1247 www.elsevier.com/locate/infoproman On document relevance and lexical cohesion between query terms Olga Vechtomova a, *, Murat Karamuftuoglu b,
More informationA Statistical Approach to the Semantics of Verb-Particles
A Statistical Approach to the Semantics of Verb-Particles Colin Bannard School of Informatics University of Edinburgh 2 Buccleuch Place Edinburgh EH8 9LW, UK c.j.bannard@ed.ac.uk Timothy Baldwin CSLI Stanford
More informationUnsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model
Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Xinying Song, Xiaodong He, Jianfeng Gao, Li Deng Microsoft Research, One Microsoft Way, Redmond, WA 98052, U.S.A.
More informationScienceDirect. Malayalam question answering system
Available online at www.sciencedirect.com ScienceDirect Procedia Technology 24 (2016 ) 1388 1392 International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST - 2015) Malayalam
More informationConcepts and Properties in Word Spaces
Concepts and Properties in Word Spaces Marco Baroni 1 and Alessandro Lenci 2 1 University of Trento, CIMeC 2 University of Pisa, Department of Linguistics Abstract Properties play a central role in most
More informationLearning Computational Grammars
Learning Computational Grammars John Nerbonne, Anja Belz, Nicola Cancedda, Hervé Déjean, James Hammerton, Rob Koeling, Stasinos Konstantopoulos, Miles Osborne, Franck Thollard and Erik Tjong Kim Sang Abstract
More informationThe Use of Statistical, Computational and Modelling Tools in Higher Learning Institutions: A Case Study of the University of Dodoma
International Journal of Computer Applications (975 8887) The Use of Statistical, Computational and Modelling Tools in Higher Learning Institutions: A Case Study of the University of Dodoma Gilbert M.
More informationMULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY
MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract
More informationNew Ways of Connecting Reading and Writing
Sanchez, P., & Salazar, M. (2012). Transnational computer use in urban Latino immigrant communities: Implications for schooling. Urban Education, 47(1), 90 116. doi:10.1177/0042085911427740 Smith, N. (1993).
More informationDeveloping a TT-MCTAG for German with an RCG-based Parser
Developing a TT-MCTAG for German with an RCG-based Parser Laura Kallmeyer, Timm Lichte, Wolfgang Maier, Yannick Parmentier, Johannes Dellert University of Tübingen, Germany CNRS-LORIA, France LREC 2008,
More informationText-mining the Estonian National Electronic Health Record
Text-mining the Estonian National Electronic Health Record Raul Sirel rsirel@ut.ee 13.11.2015 Outline Electronic Health Records & Text Mining De-identifying the Texts Resolving the Abbreviations Terminology
More informationCross-lingual Text Fragment Alignment using Divergence from Randomness
Cross-lingual Text Fragment Alignment using Divergence from Randomness Sirvan Yahyaei, Marco Bonzanini, and Thomas Roelleke Queen Mary, University of London Mile End Road, E1 4NS London, UK {sirvan,marcob,thor}@eecs.qmul.ac.uk
More informationGetting Started with Deliberate Practice
Getting Started with Deliberate Practice Most of the implementation guides so far in Learning on Steroids have focused on conceptual skills. Things like being able to form mental images, remembering facts
More informationAn Introduction to the Minimalist Program
An Introduction to the Minimalist Program Luke Smith University of Arizona Summer 2016 Some findings of traditional syntax Human languages vary greatly, but digging deeper, they all have distinct commonalities:
More informationCross-Lingual Text Categorization
Cross-Lingual Text Categorization Nuria Bel 1, Cornelis H.A. Koster 2, and Marta Villegas 1 1 Grup d Investigació en Lingüística Computacional Universitat de Barcelona, 028 - Barcelona, Spain. {nuria,tona}@gilc.ub.es
More informationThe Smart/Empire TIPSTER IR System
The Smart/Empire TIPSTER IR System Chris Buckley, Janet Walz Sabir Research, Gaithersburg, MD chrisb,walz@sabir.com Claire Cardie, Scott Mardis, Mandar Mitra, David Pierce, Kiri Wagstaff Department of
More informationCS Machine Learning
CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing
More informationTwitter Sentiment Classification on Sanders Data using Hybrid Approach
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders
More informationTHE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING
SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,
More informationOCR for Arabic using SIFT Descriptors With Online Failure Prediction
OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,
More informationFinding Translations in Scanned Book Collections
Finding Translations in Scanned Book Collections Ismet Zeki Yalniz Dept. of Computer Science University of Massachusetts Amherst, MA, 01003 zeki@cs.umass.edu R. Manmatha Dept. of Computer Science University
More informationUCEAS: User-centred Evaluations of Adaptive Systems
UCEAS: User-centred Evaluations of Adaptive Systems Catherine Mulwa, Séamus Lawless, Mary Sharp, Vincent Wade Knowledge and Data Engineering Group School of Computer Science and Statistics Trinity College,
More information