Content 1. Empirical linguistics 2. Text corpora and corpus linguistics 3. Concordances 4. Application I: The German progressive 5. Part-of-speech tagging 6. Fequency analysis 7. Application II: Compounds 8. Co-occurrence analysis 9. Application III: Word senses in lexicography 10. Keyword analysis 6.1 Type-token ratio 6.2 Corpus analysis software III: Corpus Browser 6.3 Frequency classes Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 1] Frequencies 6.1 Type-token ratio Lexical frequency counts In lexical frequency counts the number of particular lexemes, wordforms or word groups is computed. Type-token ratio The term type-token ratio refers to the quotient of the number of different linguistic entities (type) in a given corpus and the number of the occurrences of these types in the corpus. Type-token ratio (lexemes): number of different lexemes / number of realizations of the different word forms belonging to this lexeme. Type-token ratio (word form): number of different word forms / number of all realizations of this word form. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 2] 1
1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook Type-token ratio (here: 69451:2132747 0,033; word fom types) Word list (with rank and frequency) Search: frequency list of all word forms and type-token ratio in part of the English corpus of the LCC (newspapers) Start (no search term) Sort (here: accord. to frequency) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 3] 6.2 Software III: Corpus Browser Corpus analysis software III: Corpus Browser Corpus Browser Developer: Volker Boehlke (University of Leipzig). Version: 1.00 (Windows). Search: offline. Software: locally installed. Access: free download. Corpora: integrated into the program; own corpora can be created. Languages: 14 languages (see next slide). URL: http://corpora.informatik.uni-leipzig.de/download.html. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 4] 2
6.2 Software III: Corpus Browser The corpus size is measured by the number of sentences included in the corpus. When downloaded as Plain Text Files, the corpora can also be used under AntConc. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 5] Frequency classes 6.3 Frequency analysis Online dictionary Wortschatz Uni Leipzig C Frequency classes are determined relative to the frequency of the most frequent word in a corpus Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 6] 3
6.3 Frequency classes French corpus from the Leipzig Corpus Collection Search term (here: vite) Results: absolute frequency frequency class corpus examples significant left and right neighbors co-occurrences Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 7] Variation-in-time diagrams Variation-in-time diagrams I: DWDS Size of phases: 10 years corpus size per phase: same size for all phases (10 mio. running words) Frequency information: absolut (hits per decade) Accessibility: via http://www.dwds.de Frack tailcoat Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 8] 4
Variation-in-time diagrams II: IDS Size of phases: 1 year corpus size per phase: differs (but very large) frequency information: relative to frequency that would be expected if all hits were distributed evenly over the whole span of time (0-line; computed relative to the corpus size in every phase) Accessability: soon via http://www.owid.de/ pls/db/p4_module. woerterbuch Bildschirmschoner screensaver Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 9] Types of usage gradients Internet internet Wellness wellness Medaillenspiegel medal table Wiedereinrichter farmer (East G.) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 10] 5
Usage gradients 1900-2000 (DWDS) and 1990-2008 (IDS) Digitalkamera digital camera Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 11] Usage gradients 1900-2000 (DWDS) and 1990-2008 (IDS) Download download Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 12] 6
Usage gradients 1900-2000 (DWDS) and 1990-2008 (IDS) Einheitswährung uniform currency Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 13] 7