Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Outline Using Web Data for Linguistic Purposes NCLT, Dublin City University

Outline Outline 1 Corpora as linguistic tools 2 Limitations of web data Strategies to enhance web data 3

Corpora as linguistic tools Any natural corpus will be skewed. Some sentences won t occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description [based upon it] would be no more than a mere list. (Chomsky 1959, 159) What do you think of corpus linguistics? It doesn t exist. (Chomsky answering a question by Bas Aarts, reported in a talk at the Corpus Linguistics conference, Freiburg 2001)

Corpora as linguistic tools Corpora crashed into computational linguistics at the 1989 ACL meeting in Vancouver: but they were large, messy, ugly objects clearly lacking in theoretical integrity in all sorts of ways... (Kilgariff, 2003) Special Issue of CL on Using Large Corpora (Church and Mercer, 1993) changed role of corpora in computational linguistics

Web as corpus Corpora as linguistic tools First publications at ACL 1999 Since then the web was used as a data source for: Word Sense Disambiguation (Rigau et al., 2002) Machine Translation (Way and Gough, 2003) Overcoming data sparseness in Language Modeling (Volk, 2001; Lapata and Keller, 2003) Answers for Question-Answering applications (Dumais et al., 2002; Zheng, 2002) New instances for Ontologies (Agirre et al., 2000) Sublanguage corpora for Translation (Varantola, 2000) Language Teaching (Fletcher, 2002)

What is a corpus? McEnery and Wilson (1996) Sampling and representativeness Finite (and fixed) size Machine-readable Standard reference Manning and Schütze (1999) Certain amount of data from a certain domain of interest Kilgariff (2003) A collection of texts Is the Web a Corpus?

Requirements for corpus design Standardisation Comparison/Exchange with respect to other corpora Flexibility Adding new layers of annotation, multimodality Detailed linguistic annotation with good search facilities Consistency in annotation Import/Export Add new data, create subcorpora, export search results

Issues in corpus creation Where to get the data? How to digitalise the data? Accessiblity, data sparseness Timeconsuming, costly How to annotate the data? Timeconsuming, linguistic decisions, inter-annotatior agreement How to guarantee representativity and reliability? The philologist s dilemma God s truth fallacy Mystery of vanishing reliability (Rissanen, 1989) How to get enough data? There s no data like more data

Limitations of web data Strategies to enhance web data Web as Solution for Sparse Data Problems? Advantages Lots of data freely available already digitalised Disadvantages No (reliable) meta-information No annotation, no control of search tool No control of precision and recall of search results (essential for quantitative studies) No control of contents No stability results can not be replicated

No control of the search tool Limitations of web data Strategies to enhance web data Problem: No control of indexing and search strategies Found on Jean Veronis blog in Feb 2005: If you type Chirac OR Sarkozy, you get half the number results of Chirac alone, which may have a political explanation... but is a weird approach to boolean logic. If you search the in the English pages, you get 1% of the number you get for the all languages together. Does this mean that the is 99 times more frequent in languages other than English? (http://aixtal.blogspot.com/2005/02/web-googles-missing-pagesmystery.html)

No control of the search tool Limitations of web data Strategies to enhance web data Indexing and search strategies of a commercial search engine may be modified at any time without notice Google: index update with in-depth correction of extrapolation routines and boolean logic (Mar 2005) (http://aixtal.blogspot.com/2005/03/google-snapshot-ofupdate.html)

No control of the search tool Limitations of web data Strategies to enhance web data Google IE Google ALL cat 1 190 000 389 000 000 cat OR cat 1 190 000 465 000 000 dog 854 000 275 000 000 dog OR dog 850 000 353 000 000 cat OR dog 1 400 000 448 000 000 dog OR cat 1 360 000 454 000 000 the 15 000 000 5 380 000 000 the OR the 15 000 000 9 190 000 000 (Google in November 2006)

Limitations of web data Strategies to enhance web data Lots of problems with web data... Can we use it at all for linguistic purposes? What type of research questions can be answered by using web data?

Limitations of web data Strategies to enhance web data Example: Productivity of non-medical -itis (Lüdeling and Evert, 2004) medical -itis: Combines with neoclassical stems denoting body parts Semantics: Inflammation of X (arthritis, appendicitis) non-medical -itis: Derived from medical -itis Semantics: hysteria or excessively doing something Possibly they are apt to become too ambitious - they rarely succumb to the disease of fontitis but are only too apt to have bad attacks of linkitis and activitis. (BNC, CG9:500)

Limitations of web data Strategies to enhance web data Example: Productivity of non-medical -itis Quantitative: Is word formation with non-medical -itis productive? Qualitative: With which bases does non-medical -itis combine? Distributional: In which contexts are the resulting complex words used? Comparative: What are the differences between the English and the German affix? Is one of them more productive than the other? Diachronic: When did non-medical -itis start to appear and what is its development?

Limitations of web data Strategies to enhance web data Example: Productivity of non-medical -itis Type of Study BNC DWDS Google quantitative (find new types) yes yes no qualitative (find new token) yes yes yes distributional (look at context) yes yes yes comparative (meta-data, number yes no no of token/category) diachronic (date of origin) no yes no : BNC: not diachronic, too old DWDS: not (yet) stable enough, only accessible through web interface Web: no meta-data, no annotation, not stable

Limitations of web data Strategies to enhance web data How to overcome the limitations of web data? Two strategies: 1 Edit data from the search engine WebCorp (Kehoe and Renouf, 2002) KWicFinder (Fletcher, 2001) The Linguist s Search Enginge (Elkiss and Resnik, 2004) 2 Create your own corpus from the web BootsCaT (Baroni and Bernardini, 2004) Do it your own: Crawling, post-processing, annotating and indexing web data

WebCorp (Kehoe and Renouf, 2002) Limitations of web data Strategies to enhance web data Web-based interface to comercial search engines More powerful query syntax (wildcards) Output: keyword in context word frequency lists collocation statistics source document Limitations Same as the original search engine (Normalisations, stability, lack of control, no meta-information, no linguistic annotation) High precision, but low recall (for I like *ing less (10) than the BNC (295)) No random subset of results but dependent on search engine ranking (popularity,...)

Limitations of web data Strategies to enhance web data BootCaT (Baroni and Bernhardi, 2004) Create specialised language corpora for terminographical work Build general corpora in the size of the BNC (Sharoff, submitted; http://corpus.leeds.ac.uk/internet.html) Select initial seeds Run Google Queries Retrieve Corpus Extract Seeds (Unigram Terms) Extract Multi Word Terms No meta-information Linguistic annotation, control of search results Stability, Replicability Limited in size

Limitations of web data Strategies to enhance web data WaCky: kool ynitiative Informal initiative to rapidly build 1-billion-token proof-of-concept Web-corpora in 3 languages and a toolkit to collect, process and exploit such large corpora

Corpora as linguistic tools Corpora are a useful tool for linguistics but have to follow certain design criteria Linguistic studies based on web corpora are highly problematic But: often do simple algorithms using web data outperform more sophisticated methods based on smaller, but controlled data sets Use the web where it makes sense, but keep pitfalls in mind!

Thank You! Questions?

References (1) Corpora as linguistic tools Baroni, Marco and Silvia Bernardini (2004). BootCaT: Bootstrapping corpora and terms from the Web. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC-2004), Lisbon. BNC: http://www.natcorp.ox.ac.uk/ Chomsky, Noam (1957). Syntactic structures. The Hague, 159. Church, Kenneth W.; Mercer, Robert L. (1993). Introduction to the special issue on Computational Linguistics using large corpora. Computational Linguistics, 19(1), 1-24. DWDS: http://www.dwds.de Elkiss, Aaron and Philip Resnik (2004). The Linguist s Search Engine User s Guide. Available at: http://lse.umiacs.umd.edu:8080/lseuser (March 29, 2005).

References (2) Corpora as linguistic tools Fletcher, William H. (2001) Concordancing the Web with KWiCFinder. In: Proceedings of the 3rd North American Symposium on Corpus Linguistics and Language Teaching, Boston. Draft version: http://kwicfinder.com/fletchercllt2001.pdf (March 22, 2005). Google: http://www.google.com Kehoe, Andrew and Antoinette Renouf (2002). WebCorp: Applying the Web to linguistics and linguistics to the Web. In: Proceedings of the WWW 2002 Conference. Honolulu. Kilgariff, Adam and Gregory Grefenstette (2003). Introduction to the Special Issue on the, Computational Linguistics Volume 29, Number 3. Lüdeling, Evert, and Baroni (to appear). Using Web Data for Linguistic Purposes.

References (3) Corpora as linguistic tools Lüdeling, Anke and Stefan Evert, (2004). The emergence of productive non-medical -itis: corpus evidence and qualitative analysis in Proceedings of the First International Conference on Linguistic Evidence Tübingen, Germany. Manning and Schütze (1999). Foundations of Statistical Natural Language Processing. MIT Pres. McEnery, Tony and Andrew Wilson (1996). Corpus Linguistics. Edinburgh: Edinburgh University Press. Rissanen, M. (1989). Three problems connected with the use of diachronic corpora. ICAME Journal 13: 16-19. Sharoff, Serge (submitted). Open-source Corpora: using the net to fish for linguistic data. WaCky: http://wacky.sslmit.unibo.it/doku.php Way, A. and N. Gough (2003). Developing and Validating an Example-Based Machine Translation System using the World Wide Web. Computational Linguistics: special issue on.