Harnessing Keyness: Corpus-based Approach to ESP Material Development

Similar documents
Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

The following information has been adapted from A guide to using AntConc.

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Using Moodle in ESOL Writing Classes

Lexical Collocations (Verb + Noun) Across Written Academic Genres In English

Procedia - Social and Behavioral Sciences 98 ( 2014 ) International Conference on Current Trends in ELT

The Language of Football England vs. Germany (working title) by Elmar Thalhammer. Abstract

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Corpus Linguistics (L615)

Bigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora

Linking Task: Identifying authors and book titles in verbose queries

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Field Experience Management 2011 Training Guides

The taming of the data:

English for Specific Purposes World ISSN Issue 34, Volume 12, 2012 TITLE:

Methods for the Qualitative Evaluation of Lexical Association Measures

Understanding Language

Student Course Evaluation Class Size, Class Level, Discipline and Gender Bias

Reviewed by Florina Erbeli

A Case Study: News Classification Based on Term Frequency

Rule Learning With Negation: Issues Regarding Effectiveness

Multi-Lingual Text Leveling

Guru: A Computer Tutor that Models Expert Human Tutors

Rule Learning with Negation: Issues Regarding Effectiveness

Advanced Grammar in Use

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

A THEORETICAL FRAMEWORK FORA TASK-BASED SYLLABUS FOR PRIMARY SCHOOLS IN SOUTH AFRICA

Higher Education Six-Year Plans

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

A Study on professors and learners perceptions of real-time Online Korean Studies Courses

Lexical Trends in Young Adult Literature: A Corpus-Based Approach

TIMSS Highlights from the Primary Grades

Using Blackboard.com Software to Reach Beyond the Classroom: Intermediate

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

The Effect of Extensive Reading on Developing the Grammatical. Accuracy of the EFL Freshmen at Al Al-Bayt University

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

CEFR Overall Illustrative English Proficiency Scales

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

TITLE: Shakespeare: The technical words. DATE(S): Project will run for four weeks during June or July

Variation of English passives used by Swedes

Running head: LISTENING COMPREHENSION OF UNIVERSITY REGISTERS 1

The Journal of Specialised Translation Issue 10 - July 2008

Team Work in International Programs: Why is it so difficult?

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique

The role of the first language in foreign language learning. Paul Nation. The role of the first language in foreign language learning

GREAT Britain: Film Brief

A typical day at Trebinshun

University-Based Induction in Low-Performing Schools: Outcomes for North Carolina New Teacher Support Program Participants in

Group Assignment: Software Evaluation Model. Team BinJack Adam Binet Aaron Jackson

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Lecture 1: Machine Learning Basics

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

By. Candra Pantura Panlaysia Dr. CH. Evy Tri Widyahening, S.S., M.Hum Slamet Riyadi University Surakarta ABSTRACT

CORPUS ANALYSIS CORPUS ANALYSIS QUANTITATIVE ANALYSIS

TAIWANESE STUDENT ATTITUDES TOWARDS AND BEHAVIORS DURING ONLINE GRAMMAR TESTING WITH MOODLE

LISTENING STRATEGIES AWARENESS: A DIARY STUDY IN A LISTENING COMPREHENSION CLASSROOM

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Analyzing the Usage of IT in SMEs

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

English Language and Applied Linguistics. Module Descriptions 2017/18

What effect does science club have on pupil attitudes, engagement and attainment? Dr S.J. Nolan, The Perse School, June 2014

arxiv:cmp-lg/ v1 22 Aug 1994

TRANSNATIONAL TEACHING TEAMS INDUCTION PROGRAM OUTLINE FOR COURSE / UNIT COORDINATORS

DOES RETELLING TECHNIQUE IMPROVE SPEAKING FLUENCY?

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Individual Differences & Item Effects: How to test them, & how to test them well

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Cambridge NATIONALS. Creative imedia Level 1/2. UNIT R081 - Pre-Production Skills DELIVERY GUIDE

IMPROVED MANUFACTURING PROGRAM ALIGNMENT W/ PBOS

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Visit us at:

New Jersey Department of Education

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Handbook for Teachers

ACADEMIC TECHNOLOGY SUPPORT

Analysis of Enzyme Kinetic Data

Eye Level Education. Program Orientation

MMOG Subscription Business Models: Table of Contents

Assignment 1: Predicting Amazon Review Ratings

Language and Tourism in Sabah, Malaysia and Edinburgh, Scotland

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Multi Method Approaches to Monitoring Data Quality

We re Listening Results Dashboard How To Guide

Modeling function word errors in DNN-HMM based LVCSR systems

International Conference on Education and Educational Psychology (ICEEPSY 2012)

A corpus-based sociolinguistic study of amplifiers in British English

Evaluation of Teach For America:

Tailoring i EW-MFA (Economy-Wide Material Flow Accounting/Analysis) information and indicators

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Certification Inspection Report BRITISH COLUMBIA PROGRAM at

Number of students enrolled in the program in Fall, 2011: 20. Faculty member completing template: Molly Dugan (Date: 1/26/2012)

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

TIPS PORTAL TRAINING DOCUMENTATION

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

May To print or download your own copies of this document visit Name Date Eurovision Numeracy Assignment

Developing a Language for Assessing Creativity: a taxonomy to support student learning and assessment

Transcription:

Harnessing Keyness: Corpus-based Approach to ESP Material Development John Blake Japan Advanced Institute of Science and Technology Concordancers often provide an option to generate lists of keywords. Keywords are the words that occur disproportionately more frequently in a particular text type (e.g. business English) compared to another text type (e.g. general English). This is one way of distinguishing technical or domain-specific words from general words. Novice users of concordancers tend to expect that the keyword lists produced are identical, yet there are significant differences in the lists generated. This paper shows how keyword lists are affected by the choices of concordancer, reference corpus and statistical test. ESP materials developers can use this knowledge to make a more informed choice of the variables so that the most appropriate keyword list for the target audience can be created. Introduction The identification of words that deserve inclusion in teaching materials is a difficulty that many materials developers face. There are many factors to consider in the selection of vocabulary, such as frequency, appropriacy, expediency, need and level. The most frequent words in a text are relatively easy to identify, but are not necessarily the most useful words to highlight in ESP materials. This is because grammatical words and high frequency general words are likely to occupy the top positions. Words that are key, however, are likely to merit inclusion. Concordancers can be harnessed to identify the frequency and keyness of vocabulary. Simply put, keyness is a measure of the frequency with which a word occurs disproportionately in a particular text type. Keyness is assessed by comparing the relative frequency of a word in a focus corpus to a reference corpus using a statistical test. Words that are key are called keywords (Scott, 1997). Novice users may expect all concordancers to produce the same keyword list for a text. However, this is not the case. Different concordancers, reference corpora and statistical tests result in radically different keyword lists. Concordancers can be classified into four generations (McEnery and Hardie, 2012) although the first two generations are now obsolete. Fourth generation concordancers can deal with large corpora and are far more powerful than third generation concordancers (Table 1). Some concordancers provide options to upload a reference corpus to which the focus corpus can be compared while others provide a range of corpora from which the user can select. Concordancers may have a default statistical test (e.g. chi-squared in AntConc) or provide alternatives for the user to select from. Keyword list generation is underpinned by comparing the ratios of words occurring in the focus and reference corpora using statistical tests. Kilgarriff (2012, p.5) highlights two statistical problems. First, the resolution of dividing by zero when there are no occurrences of a word in the reference corpus. Second, overcoming the domination of words which occur rarely in the reference corpus. Different tests use different methods to address these issues. Insert Table 1 about here 1

This paper explores how the choice of concordancer, reference corpus and statistical test generates different lists of keywords. Materials developers can use this knowledge to make more informed choices of which vocabulary to focus on in their tailor-made materials. Method A corpus of texts comprising all the research articles published in the International Business Review (IBR) from February 2010 to October 2013 was manually collected and concatenated into a single text file. Table 2 shows the composition of this focus corpus. Insert Table 2 about here The three variables (concordancers, reference corpora and statistical tests) were each tested in turn. A popular third generation concordancer, AntConc 3.2.4w (Anthony, 2012) and a popular fourth generation concordancer, Sketch Engine (Kilgarriff et al., 2014), were selected for comparison. The raw frequency word count for each concordancer was first calculated. Keyword lists were generated using the British Academic Written English (BAWE) corpus and the Brown corpus in Sketch Engine (Table 3). A keyword list was then generated using the Brown corpus in AntConc. This was undertaken using three different statistical tests in Sketch Engine and two different tests in AntConc. The keyword lists were then evaluated from the perspective of an ESP materials developer. Findings Insert Table 3 about here Findings regarding each of the three variables are described, interpreted and evaluated in the following sections. Concordancers The raw count of frequency of words in both AntConc and Sketch Engine results in the same order for the top ten words, yet only the word count for that is identical (Table 4). This raw word count difference can be accounted for by differences in the operational definition of a word and the process of tokenization. Anthony (2013) notes that Wordsmith Tools and AntConc count contractions differently, e.g. we'll is counted as one word in Wordsmith, but two words in AntConc. Word count is just one variable in the calculation of keyness. Since results differ at the level of raw word count, this difference may be exacerbated when other variables are added. Insert Table 4 about here Each concordancer offers different functionality with regard to calculating keyness. For example, AntConc allows users to upload their own reference corpus and provides the 2

choice of either chi-squared or log-likelihood for the statistical test while Sketch Engine subscription incorporates access to numerous reference corpora and 4 different statistical tests. For most material developers, the functionality of the concordancer is most likely of more importance than a thorough understanding of the definition of words and tokenization process used. Reference corpora Table 5 shows the keyword lists created in Sketch Engine using the same statistical test (Midway) but with difference reference corpora. Keyword lists created when using the BAWE corpus and Brown corpus shared five of the top ten results. The remaining five words in BAWE appeared more specialized than the Brown corpus. The BAWE keyword list, therefore, appears more appropriate for learners with a stronger vocabulary base. Scott (2009) claims that there is no bad reference corpus. However, different reference corpora yield radically different keyword lists. The genre and diachrony of a corpus are found to significantly affect keyness (Goh, 2010). Given that different reference corpora impact the generated keyword lists, materials developers would be well advised to compare the results using different reference corpora. Insert Table 5 about here Statistical tests As shown in Table 6, selecting the log-likelihood and chi-squared tests in AntConc using the Brown corpus resulted in identical lists for the first eight keywords. Simple ratios, such as log-likelihood and chi-squared, (Kilgarriff, 2012, p.5) produce keyword lists dominated by rare words. Gabrielatos and Marchi (2012) oppose the use of log-likelihood and chi-squared to calculate keyness due to frequency bias and assumptions on the random nature of language. Insert Table 6 about here Table 7 shows the keyword lists generated in Sketch Engine using the BAWE Corpus, but selecting different statistical tests. The simple maths version (Kilgarriff, 2009) in Sketch Engine names the tests clearly (e.g. Common, Rare) and is not based on the assumption that language is random (Kilgarriff, 2005). Rare resulted in higher occurrence of rare words while Common resulted in a skew to more common words. When selecting vocabulary for less proficient students, it may be prudent to use a keyword list generated using Common. Insert Table 7 about here Conclusion The three variables of concordancer, reference corpora and statistical tests greatly affect the keyword lists generated. Although AntConc has many advantages particularly in classroom-based data-driven learning, fourth generation concordancers that can deal with larger corpora and provide reference corpora could save materials developers a great deal of time. Sketch Engine provides an easy, quick and affordable way to calculate a variety of 3

keyword lists. The availability of 20 reference corpora and 4 appropriately-named statistical tests make it easy to tailor keyword lists to the intended learners. Selecting a general English reference corpus and the Common statistical test in Sketch Engine is likely to generate keyword lists that are more suitable for lower level students. References Anthony, L. (2012). AntConc (Version 3.2.4) [Computer Software]. Tokyo, Japan: Waseda University. Anthony, L. (2013). A critical look at software tools in corpus linguistics. Linguistic Research, 30 (2), 141 161. Gabrielatos, C. and Marchi, A. (2012). Keyness: Appropriate metrics and practical issues. Paper presented at Corpus-assisted Discourse Studies International Conference 2012. University of Bologna, Italy. 13 14 September, 2012. Goh, G-Y. ( 2010). Choosing a reference corpus for keyword extraction. Linguistic Research, 28 (1), 239-256. Hardie, A. (2012). CQPweb combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics 17 (3), 380 409. Kilgarriff, A. (2005). Language is never ever ever random. Corpus Linguistics and Linguistic Theory 1 (2), 263 276. Kilgarriff, A. (2009). Simple maths for keywords. In Mahlberg, M., González-Díaz, V. & Smith, C. (eds.), Proceedings of the Corpus Linguistics Conference CL2009. University of Liverpool, UK, 20 23 July 2009. Kigarriff, A. (2012). Getting to know your corpus. Text, Speech and Dialogue, 7499, 3 15. Kilgarriff, A. et al. (2014). The Sketch Engine: Ten years on. Lexicography, 1 (1), 7 36. McEnery, T, & Hardie, A. (2012). Corpus linguistics: Method, Theory and Practice. Cambridge: Cambridge University Press. O'Donnell, M. (2013). UAM Corpus Tool (Versions 2.8 & 3.1) [Computer Software]. Wagsoft Systems. Rayson, P. (2008). W-matrix corpus analysis and comparison tool. Lancaster University. Scott, M. (1997). PC analysis of key words and key key words. System, 25 (1), 1 13. Scott, M. (2009). In search of a bad reference corpus. In D. Archer (ed.), What's in Word-list? Investigating Word Frequency and Keyword Extraction (pp.79 92). Oxford: Ashgate. Scott, M. (2012). WordSmith Tools (Version 6) [Computer Software]. Liverpool: Lexical Analysis Software. 4

Biodata John Blake is a research lecturer at the Japan Advanced Institute of Science and Technology. He has taught English at universities and schools for over 20 years in Japan, Thailand, Hong Kong and the UK. His current research interest is corpus analysis of scientific research articles. johnb@jaist.ac.jp 5

Table 1 Current Generations of Concordancers 3 rd generation 4 th generation Location personal computers web servers Size of corpora Small corpora - low millions Large corpora 100 million+ Examples AntConc (Anthony, 2012) UAM Corpus Tool (O`Donnell, 2013) Wordsmith Tools (Scott, 2012) CQPweb (Hardie, 2013) Sketch Engine (Kilgariff et al., 2014) W-matrix (Rayson, 2008) Table 2 IBR Focus Corpus Count (made in AntConc 3.2.4w) Tokens 2,516,051 Words 1,966,650 Sentences 77,547 Table 3 Outline of Reference Corpora Used BAWE corpus Brown corpus Date created 2000s 1960s Type of corpus Academic General Type of English British American Words 6,506,995 1,000,000 Table 4 Raw Frequency Results Sketch Engine AntConc 1 the 106,022 106,064 2 and 77,508 77,542 3 of 72,733 72,990 4 to 47,454 47,834 5 in 41,791 42,056 6 a 32,007 32,336 7 that 23,092 23,092 8 is 21,249 21,245 9 for 17,293 17,303 10 as 14,309 14,329 6

Table 5 Keyword Lists using BAWE and Brown in Sketch Engine with Midway Test BAWE Brown 1 firms firms 2 firm firm 3 export export 4 foreign Table 5 subsidiary variables 6 internationalization international 7 FDI markets 8 subsidiaries knowledge 9 markets foreign 10 MNEs market Table 6 Keyword Lists using Log-likelihood and Chi-squared Tests in AntConc with Brown Corpus Log-likelihood Chi-squared 1 the the 2 firms firms 3 firm firm 4 al et 5 et al 6 in In 7 knowledge knowledge 8 market market 9 this international 10 table foreign Table 7 Keyword Lists using Three Statistical Tests in Sketch Engine with BAWE Corpus Rare Midway Common 1 OFDI firms and 2 offshoring firm firms 3 Vahlne export firm 4 multinationality foreign foreign 5 Full-size subsidiary knowledge 6 MathML internationalization international 7 Kogut FDI market 8 BOP subsidiaries country 9 MathJax markets Table 10 Ghoshal MNEs performance 7