Dan Joyce, Eiken Foundation of Japan Fumiyo Nakatsuhara, CRELLA, University of Bedfordshire

Similar documents
How do we balance statistical evidence with expert judgement when aligning tests to the CEFR?

DOES RETELLING TECHNIQUE IMPROVE SPEAKING FLUENCY?

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

Assessing speaking skills:. a workshop for teacher development. Ben Knight

CEFR Overall Illustrative English Proficiency Scales

Linking the Common European Framework of Reference and the Michigan English Language Assessment Battery Technical Report

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

REVIEW OF CONNECTED SPEECH

The Common European Framework of Reference for Languages p. 58 to p. 82

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Teachers Guide Chair Study

Textbook Evalyation:

Differences in Perceived Fluency and Utterance Fluency across Speech Elicitation Tasks: A Pilot Study

THE ACQUISITION OF INFLECTIONAL MORPHEMES: THE PRIORITY OF PLURAL S

CaMLA Working Papers

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Analyzing Linguistically Appropriate IEP Goals in Dual Language Programs

The ELA/ELD Framework Companion: a guide to assist in navigating the Framework

TAIWANESE STUDENT ATTITUDES TOWARDS AND BEHAVIORS DURING ONLINE GRAMMAR TESTING WITH MOODLE

The Effect of Written Corrective Feedback on the Accuracy of English Article Usage in L2 Writing

Artwork and Drama Activities Using Literature with High School Students

Psychometric Research Brief Office of Shared Accountability

Integrating Grammar in Adult TESOL Classrooms

Testing Reading Through Summary

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

To appear in The TESOL encyclopedia of ELT (Wiley-Blackwell) 1 RECASTING. Kazuya Saito. Birkbeck, University of London

Assessment for Student Learning: Institutional-level Assessment Board of Trustees Meeting, August 23, 2016

Language Acquisition Chart

Candidates must achieve a grade of at least C2 level in each examination in order to achieve the overall qualification at C2 Level.

Practical Research. Planning and Design. Paul D. Leedy. Jeanne Ellis Ormrod. Upper Saddle River, New Jersey Columbus, Ohio

Rater Cognition in L2 Speaking Assessment: A Review of the Literature

International Conference on Education and Educational Psychology (ICEEPSY 2012)

Laporan Penelitian Unggulan Prodi

L.E.A.P. Learning Enrichment & Achievement Program

MULTIPLE-CHOICE DISCOURSE COMPLETION TASKS IN JAPANESE ENGLISH LANGUAGE ASSESSMENT ERIC SETOGUCHI University of Hawai i at Manoa

Evidence-Centered Design: The TOEIC Speaking and Writing Tests

International Conference on Current Trends in ELT

Exploring the adaptability of the CEFR in the construction of a writing ability scale for test for English majors

Running head: LISTENING COMPREHENSION OF UNIVERSITY REGISTERS 1

Lower and Upper Secondary

Model of Lesson Study Approach during Micro Teaching

Improving Speaking Fluency in a Task-Based Language Teaching Approach: The Case of EFL Learners at PUNIV-Cazenga

DOCTOR OF PHILOSOPHY BOARD PhD PROGRAM REVIEW PROTOCOL

The Political Engagement Activity Student Guide

Focus on. Learning THE ACCREDITATION MANUAL 2013 WASC EDITION

Corpus Linguistics (L615)

CONTENUTI DEL CORSO (presentazione di disciplina, argomenti, programma):

The Implementation of Interactive Multimedia Learning Materials in Teaching Listening Skills

Metadiscourse in Knowledge Building: A question about written or verbal metadiscourse

ELS LanguagE CEntrES CurriCuLum OvErviEw & PEDagOgiCaL PhiLOSOPhy

To provide students with a formative and summative assessment about their learning behaviours. To reinforce key learning behaviours and skills that

MASTER S THESIS GUIDE MASTER S PROGRAMME IN COMMUNICATION SCIENCE

The Effect of Extensive Reading on Developing the Grammatical. Accuracy of the EFL Freshmen at Al Al-Bayt University

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

EQuIP Review Feedback

Guidelines for Writing an Internship Report

DSTO WTOIBUT10N STATEMENT A

Vicente Amado Antonio Nariño HH. Corazonistas and Tabora School

The Effect of Personality Factors on Learners' View about Translation

THE EFFECTS OF TASK COMPLEXITY ALONG RESOURCE-DIRECTING AND RESOURCE-DISPERSING FACTORS ON EFL LEARNERS WRITTEN PERFORMANCE

Introduction to the Common European Framework (CEF)

DOES OUR EDUCATIONAL SYSTEM ENHANCE CREATIVITY AND INNOVATION AMONG GIFTED STUDENTS?

EXAMPLES OF SPEAKING PERFORMANCES AT CEF LEVELS A2 TO C2. (Taken from Cambridge ESOL s Main Suite exams)

Observing Teachers: The Mathematics Pedagogy of Quebec Francophone and Anglophone Teachers

Co-teaching in the ESL Classroom

Focus Groups and Student Learning Assessment

Integrating Common Core Standards and CASAS Content Standards: Improving Instruction and Adult Learner Outcomes

The Extend of Adaptation Bloom's Taxonomy of Cognitive Domain In English Questions Included in General Secondary Exams

Classroom Connections Examining the Intersection of the Standards for Mathematical Content and the Standards for Mathematical Practice

Evidence for Reliability, Validity and Learning Effectiveness

Degree Qualification Profiles Intellectual Skills

Further, Robert W. Lissitz, University of Maryland Huynh Huynh, University of South Carolina ADEQUATE YEARLY PROGRESS

ONE TEACHER S ROLE IN PROMOTING UNDERSTANDING IN MENTAL COMPUTATION

SACS Reaffirmation of Accreditation: Process and Reports

Colorado s Unified Improvement Plan for Schools for Online UIP Report

The Effects of Strategic Planning and Topic Familiarity on Iranian Intermediate EFL Learners Written Performance in TBLT

Writing an Effective Research Proposal

Unit 7 Data analysis and design

The Common European Framework and the European Language Portfolio: involving learners and their judgements in the assessment process

Number of students enrolled in the program in Fall, 2011: 20. Faculty member completing template: Molly Dugan (Date: 1/26/2012)

Vocabulary Usage and Intelligibility in Learner Language

This Performance Standards include four major components. They are

self-regulated learning Boekaerts, 1997, 1999; Pintrich, 1999a, 2000; Wolters, 1998; Zimmerman, 2000

Master s Degree Programme in East Asian Studies

Rendezvous with Comet Halley Next Generation of Science Standards

Teaching and Examination Regulations Master s Degree Programme in Media Studies

Prentice Hall Literature Common Core Edition Grade 10, 2012

Arizona s English Language Arts Standards th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS

Delaware Performance Appraisal System Building greater skills and knowledge for educators

Merbouh Zouaoui. Melouk Mohamed. Journal of Educational and Social Research MCSER Publishing, Rome-Italy. 1. Introduction

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

THE ORAL PROFICIENCY OF ESL TEACHER TRAINEES IN DIFFERENT DISCOURSE DOMAINS

The College Board Redesigned SAT Grade 12

International Journal of Foreign Language Teaching & Research Volume 5, Issue 20, Winter 2017

Exploring the Development of Students Generic Skills Development in Higher Education Using A Web-based Learning Environment

Assessment and Evaluation

A guidance for assessing and communicating uncertainties

Stimulation for Interaction. 1. Is your character old or young? He/She is old/young/in-between OR a child/a teenager/a grown-up/an old person

Transcription:

Dan Joyce, Eiken Foundation of Japan Fumiyo Nakatsuhara, CRELLA, University of Bedfordshire

1 What is TEAP 2 TEAP Speaking Test: Development 3 TEAP Speaking Test: Structure 4 Validation Studies 5 Study 1 and Study 2 6 Results 7 Summary/Conclusion 2

a new test of academic English for university entrance purposes in Japan Key terms: Benchmark university level EFL context of Japan model for revision four skills Japanese high school students Japanese Ministry of Education Course of Study for HS 3

Eiken Eiken Foundation of Japan Sophia University CRELLA Centre for Research in English Language Learning and Assessment 4

CRELLA s role (Dr. Fumiyo Nakatsuhara) Literature review Language function surveys - 167 HS & 24 Sophia Uni. Teachers (using O Sullivan et al s (2002) checklist) Exchanging ideas & info Developing draft test specs, draft rating scales, draft examiner frame etc. Mini-trial Trial : Study 1 Pilot: Study 2 A priori Validation studies 5

Why use the CEFR? positive contribution to English-language learning and teaching in Japan by providing useful feedback to test takers beyond the usual pass/fail decisions associated with Japanese university entrance exams. facilitate stakeholders understanding of test scores and task requirements provide scores that indicate test takers approximate level in terms of a well-known external criterion 6

The role of the CEFR TEAP has used the CEFR as a reference point for defining relevant levels of proficiency. TEAP has used relevant descriptors from the scales of the CEFR as a springboard from which TEAP-specific descriptors were developed. 7

CEFR Levels targeted by TEAP Takes account of the levels of English proficiency that we can legitimately expect high school students to display (A2-B1) Looks forward to a higher level of proficiency beyond high school (B1-B2). Acts as a bridge between high school and the TLU domain of the academic context of learning at Japanese universities. As a minimum level of proficiency to access the language used in first-year university classrooms, TEAP focuses on B1-B2. In order to provide meaningful feedback to as wide a range of test takers as possible, TEAP takes account of the A2 level of proficiency. 8

Part Task Level Language functions (Cognitive demands: grammatical encoding) 1 Interview A2 -Providing specific personal information 2 Role play A2/B1 3 Monologue B1/B2 4 Extended interview B2 -Initiating interaction -Asking for information/opinions -Commenting -Agreeing/disagreeing -Justifying opinions -Elaborating -Expressing opinions -Justifying opinions -Comparing -Speculating -Elaborating 9

1. Level of tasks were designed to increase across test (A2 B2) 2. Tasks were designed to reflect language functions considered important by high school and university teachers 3. Part 2: role play designed to operationalise asking for information, opinions Due to anticipated lack of familiarity with this test task in Japan, important part of the a priori validation studies 10

Relevant CEFR scales and other rating scales consulted A number of focus group discussions Modifications based on the mini-trial test results The analytic rating scales used by raters have five categories: 1. Pronunciation 2. Grammatical Range and Accuracy 3. Lexical Range and Accuracy 4. Fluency 5. Interactional Effectiveness Score bands B2 B1 A2 Below A2 11

12

Test-taker and examiner feedback questionnaires Language functions of test-taker speech samples CONTEXT VALIDITY COGNITIVE VALIDITY Linguistic and discourse features of test-taker speech samples Rating scores SCORING VALIDITY Rater feedback questionnaire & post-marking focus group discussion Collected various sources of empirical evidence that offered useful information to verify or modify the draft test materials and rating scales.

14

Research Questions RQ1: To what extent does the test elicit intended language functions in each task? (Study 1) RQ2: Is there any evidence from test-takers output language that validates the descriptors used to define the levels on each rating scale? (Study 1) RQ3: What are the participating examiners and students perceptions of the testing procedures? (Studies 1) RQ4: What are the participating raters perceptions of the testing and rating procedures? (Studies 1 and 2) RQ5: How well does the test function in terms of scoring validity, after incorporating modifications suggested in Study 1? (Study 2) 15

Participants 23 1 st year university students 3 trained examiners & 3 trained raters Data collection Speaking test sessions were video-recorded and transcribed Rating of the video-recorded performances by 3 raters, using the draft rating scales Examiner, student and rater feedback questionnaires Raters focus group discussion Data Analysis Language function analysis (RQ1) Linguistic and discourse analysis of students speech samples (RQ2) Analysis of questionnaires and focus group discussion data (RQ3 & RQ4) Modifying test materials, scales, etc. 16

Participants 120 3 rd year high school students 5 trained examiners & 6 trained raters Data collection Video-recorded speaking test performances were rated by 6 trained raters, using the modified rating scales Rater feedback questionnaires Data Analysis Analysis of rater feedback questionnaire (RQ4) Analysis of rating scores (RQ5): FACETS analysis 17

Transcripts were analysed and instances of functions from the list based on O Sullivan et al (2002) were counted 19

Informational Interactional Managing interaction Expressing opinions Giving info Elaborating 20

Informational Interactional Managing interaction Asking for info Initiating Asking for opinions Commenting Reciprocating 21

Informational Interactional Managing interaction Elaborating Justifying opinions Agreeing, disagreeing 22

Informational Interactional Managing Elaborating Expressing interaction opinions Justifying opinions Comparing Speculating 23

Targeted functions were elicited by the relevant parts of the test as intended [Suggestions for modifications] Part 3 (Monologue): Limit the examiner s contribution only to nonverbalised response tokens (such as nodding, smiling) Part 4 (Extended Interview): Standardize the way that examiners end the test 24

(Following Brown s (2006a) methodology) List key assessment areas specified in each rating category Identify linguistic and discourse features that could quantify the key areas Analyse candidate output language for these features Compare the results across different proficiency groups to see to what extent each of these features differs between adjacent levels of the rating scales. 25

No. of unfilled pauses (utterance initial) per 50 words Ratio of repair, false starts and repetition to AS units Articulation rate Level 1 2 3 1 2 3 1 2 3 Note: No inferential statistics due to the small sample size 26

All examined features broadly exhibited changes in the expected direction across the 3 levels. The rating scales are in general differentiating testtakers performance in a way congruent with the test designers intention. For some scales, the differences between levels were greater at one boundary than the other. In accordance with previous research (e.g. Brown, 2006a) indicating that specific aspects of performance are probably more relevant to differentiate particular levels. Worth following up in a larger-scale study 27

28

Examiner training/post-interviewing questionnaires The training session was useful The test timings, instructions, questions and general test administration were appropriate [Suggestions for modification] The test instructions could be clearer They need guidelines for what they should do when they feel a need for deviating from the interlocutor frame. Student feedback questionnaire The role-play task especially was received positively, which confirmed the use of this innovative task in the Japanese context. 29

Rater training/post-marking questionnaires and post-marking rater discussion + Test score analysis The training session was useful and effective [Suggestions for modifications] Some adjustments to the wording of descriptors in the Interactional Effectiveness scale (too easy in the draft scales) Raters need to be more explicitly instructed that an overall impression should not influence their individual analytic scores, especially on the Pronunciation scale. All modifications suggested in Study 1 were discussed by the project team, and revised rating scales and test materials were prepared for Study 2. 30

31

Score Analysis Multi-faceted Rasch analysis with 3 facets: examinees (N=120), raters (N=6) and rating categories (N=5) The scoring system generally worked well No misfitting rater or rating category All raters behaved with an acceptable level of consistency For all rating categories, the rating scale steps progressed in the expected way Rater questionnaire Analysis The revised rating scales worked better Study 2 results demonstrated that changes made after Study 1 functioned in ways that test designers intended 32

When coupled with thorough validation studies to guide its use, the CEFR can become a useful tool in test development. We can be confident that the TEAP Speaking Test is operationalising the test construct which the test was designed to measure. But on-going validation studies are as important as a priori validation! 34

For a full validation report, see Nakatsuhara (forthcoming, online) For more information about TEAP, see the following URL: https://www.eiken.or.jp/teap/ Thank You! d-joyce@eiken.or.jp 35

Brown, A. (2006a). Candidate discourse in the revised IELTS Speaking Test. P. McGovern & S. Walsh (Eds.), IELTS Research Report, Vol. 6, 71-89. Canberra: British Council & IDP Australia. Brown, A. (2006b). An examination of the rating process in the revised IELTS Speaking Test. P. McGovern & S. Walsh (Eds.), IELTS Research Report, Vol. 6, 41-69. Canberra: British Council & IDP Australia. MEXT (2008). The course of study for upper secondary school. Retrieved May 1, 2010 from http://www.mext.go.jp/a_menu/shotou/new-cs/index.htm O Sullivan, B., Weir, C. J., & Saville, N. (2002). Using observation checklists to validate speaking-test tasks. Language Testing, 19 (1), 33-56. Taylor, L. (Ed.) (2011). Examining Speaking: Research and practice in second language speaking. Cambridge: Cambridge University Press. Weir, C.J. (2005). Language Testing and Validation: An Evidence- Based Approach. Basingstoke: Palgrave Macmillan. 36