Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Similar documents
The following information has been adapted from A guide to using AntConc.

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

The Language of Football England vs. Germany (working title) by Elmar Thalhammer. Abstract

SECTION 12 E-Learning (CBT) Delivery Module

MOODLE 2.0 GLOSSARY TUTORIALS

Houghton Mifflin Online Assessment System Walkthrough Guide

Cross Language Information Retrieval

PowerTeacher Gradebook User Guide PowerSchool Student Information System

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Using SAM Central With iread

Linking Task: Identifying authors and book titles in verbose queries

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Lexical Collocations (Verb + Noun) Across Written Academic Genres In English

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

A Case Study: News Classification Based on Term Frequency

Can Human Verb Associations help identify Salient Features for Semantic Verb Classification?

Probability and Statistics Curriculum Pacing Guide

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

1. Introduction. 2. The OMBI database editor

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume 11 : 12 December 2011 ISSN

Modeling function word errors in DNN-HMM based LVCSR systems

THE VERB ARGUMENT BROWSER

Situational Virtual Reference: Get Help When You Need It

SCOPUS An eye on global research. Ayesha Abed Library

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Variation of English passives used by Swedes

Memory-based grammatical error correction

On document relevance and lexical cohesion between query terms

Specification of the Verity Learning Companion and Self-Assessment Tool

A guided tour: An overview of the CCITL system Commonwealth Center for Instructional Technology and Learning

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Rental Property Management: An Android Application

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Achim Stein: Diachronic Corpora Aston Corpus Summer School 2011

Session Six: Software Evaluation Rubric Collaborators: Susan Ferdon and Steve Poast

Mathematics Success Grade 7

Word Sense Disambiguation

Modeling user preferences and norms in context-aware systems

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

A corpus-based approach to the acquisition of collocational prepositional phrases

2.1 The Theory of Semantic Fields

Combining a Chinese Thesaurus with a Chinese Dictionary

Multi-Lingual Text Leveling

Read&Write Gold is a software application and can be downloaded in Macintosh or PC version directly from

Netsmart Sandbox Tour Guide Script

DO NOT DISCARD: TEACHER MANUAL

SkillPort Quick Start Guide 7.0

Bigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora

Using Small Random Samples for the Manual Evaluation of Statistical Association Measures

Understanding Games for Teaching Reflections on Empirical Approaches in Team Sports Research

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

TA Certification Course Additional Information Sheet

Development of the First LRs for Macedonian: Current Projects

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Methods for the Qualitative Evaluation of Lexical Association Measures

Corpus Linguistics (L615)

TIPS PORTAL TRAINING DOCUMENTATION

Modeling function word errors in DNN-HMM based LVCSR systems

ScienceDirect. Malayalam question answering system

Procedia - Social and Behavioral Sciences 154 ( 2014 )

STUDENT MOODLE ORIENTATION

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Quick Start Guide 7.0

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Progressive Aspect in Nigerian English

Using Blackboard.com Software to Reach Beyond the Classroom: Intermediate

Switchboard Language Model Improvement with Conversational Data from Gigaword

Routledge Library Editions: The English Language: Pronouns And Word Order In Old English: With Particular Reference To The Indefinite Pronoun Man

The Effect of Written Corrective Feedback on the Accuracy of English Article Usage in L2 Writing

Constructing Parallel Corpus from Movie Subtitles

Włodzimierz Sobkowiak. Phonetics of EFL Dictionary Definitions. 2006, 249 pp. ISBN Anglistyka. Poznań: Wydawnictwo Poznańskie.

Using NVivo to Organize Literature Reviews J.J. Roth April 20, Goals of Literature Reviews

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

Copyright 2017 DataWORKS Educational Research. All rights reserved.

Advanced Grammar in Use

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Biome I Can Statements

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Moodle 2 Assignments. LATTC Faculty Technology Training Tutorial

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Intel-powered Classmate PC. SMART Response* Training Foils. Version 2.0

Lexicology and Lexicography

2 User Guide of Blackboard Mobile Learn for CityU Students (Android) How to download / install Bb Mobile Learn? Downloaded from Google Play Store

NOT SO FAIR AND BALANCED:

Introduction to Yearbook / Newspaper Course Syllabus

How does Social Media influence career decisions? Robert Marzell Otto Pompe

Search right and thou shalt find... Using Web Queries for Learner Error Detection

CVEN SUSTAINABILITY IN CONSTRUCTION

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Using Task Context to Improve Programmer Productivity

Literature and the Language Arts Experiencing Literature

Universiteit Leiden ICT in Business

Transcription:

Content 1. Empirical linguistics 2. Text corpora and corpus linguistics 3. Concordances 4. Application I: The German progressive 5. Part-of-speech tagging 6. Fequency analysis 7. Application II: Compounds 8. Co-occurrence analysis 9. Application III: Word senses in lexicography 10. Keyword analysis 6.1 Type-token ratio 6.2 Corpus analysis software III: Corpus Browser 6.3 Frequency classes Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 1] Frequencies 6.1 Type-token ratio Lexical frequency counts In lexical frequency counts the number of particular lexemes, wordforms or word groups is computed. Type-token ratio The term type-token ratio refers to the quotient of the number of different linguistic entities (type) in a given corpus and the number of the occurrences of these types in the corpus. Type-token ratio (lexemes): number of different lexemes / number of realizations of the different word forms belonging to this lexeme. Type-token ratio (word form): number of different word forms / number of all realizations of this word form. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 2] 1

1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook Type-token ratio (here: 69451:2132747 0,033; word fom types) Word list (with rank and frequency) Search: frequency list of all word forms and type-token ratio in part of the English corpus of the LCC (newspapers) Start (no search term) Sort (here: accord. to frequency) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 3] 6.2 Software III: Corpus Browser Corpus analysis software III: Corpus Browser Corpus Browser Developer: Volker Boehlke (University of Leipzig). Version: 1.00 (Windows). Search: offline. Software: locally installed. Access: free download. Corpora: integrated into the program; own corpora can be created. Languages: 14 languages (see next slide). URL: http://corpora.informatik.uni-leipzig.de/download.html. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 4] 2

6.2 Software III: Corpus Browser The corpus size is measured by the number of sentences included in the corpus. When downloaded as Plain Text Files, the corpora can also be used under AntConc. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 5] Frequency classes 6.3 Frequency analysis Online dictionary Wortschatz Uni Leipzig C Frequency classes are determined relative to the frequency of the most frequent word in a corpus Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 6] 3

6.3 Frequency classes French corpus from the Leipzig Corpus Collection Search term (here: vite) Results: absolute frequency frequency class corpus examples significant left and right neighbors co-occurrences Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 7] Variation-in-time diagrams Variation-in-time diagrams I: DWDS Size of phases: 10 years corpus size per phase: same size for all phases (10 mio. running words) Frequency information: absolut (hits per decade) Accessibility: via http://www.dwds.de Frack tailcoat Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 8] 4

Variation-in-time diagrams II: IDS Size of phases: 1 year corpus size per phase: differs (but very large) frequency information: relative to frequency that would be expected if all hits were distributed evenly over the whole span of time (0-line; computed relative to the corpus size in every phase) Accessability: soon via http://www.owid.de/ pls/db/p4_module. woerterbuch Bildschirmschoner screensaver Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 9] Types of usage gradients Internet internet Wellness wellness Medaillenspiegel medal table Wiedereinrichter farmer (East G.) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 10] 5

Usage gradients 1900-2000 (DWDS) and 1990-2008 (IDS) Digitalkamera digital camera Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 11] Usage gradients 1900-2000 (DWDS) and 1990-2008 (IDS) Download download Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 12] 6

Usage gradients 1900-2000 (DWDS) and 1990-2008 (IDS) Einheitswährung uniform currency Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 13] 7