,QWURGXFWLRQ &RQWH[W RI WKH VWXG\

The Common European Framework of Reference for Languages: Learning, teaching, assessment (Council of Europe, 2001), abbreviated to CEFR, has become the Council of Europe s most influential project on second language education. Alderson notes: Clearly the influence of the Framework has been widespread and deep, impacting on curricula, syllabuses, teaching materials, tests and assessment systems and the development of scales of language proficiency geared to the six main levels of the CEFR. (Alderson, 2002b: 8) The main purpose of the CEFR is to provide a common basis for the elaboration of language syllabuses, examinations, and textbooks, by describing in a comprehensive way what language learners have to learn to do in order to use a language effectively for communication (Council of Europe, 2001: 1). The CEFR language proficiency scales, probably the best known part of the 2001 volume (Little, 2006: 184), are the result of a large research project in Switzerland (North, 2000a; North & Schneider, 1998). They comprise statements called descriptors, which were designed following an action-orientated approach: language users are seen as members of a society who have tasks to accomplish, including those that are not language-related (Council of Europe, 2001: 9). The CEFR scales describe what learners can do to accomplish such tasks at a number of levels, thus falling in the category of behavioural scales (Brindley, 1998: 112). The CEFR behavioural scales describe language proficiency into six common reference levels: A1 (the lowest) through A2, B1, B2, C1 and C2 (the highest). The sets of language descriptors are central to the CEFR s descriptive scheme of language use (Little, 2006: 169). They serve one of the main aims of the Council of Europe as described in Chapter 3 of the CEFR volume: to help partners to describe the levels of proficiency required by existing standards, tests and examinations in order to facilitate comparisons between different systems of qualifications (Council of Europe, 2001: 21). Such comparability of language qualifications in Europe was difficult to achieve prior to the CEFR because of the plethora of diverse educational systems and traditions. The present study is concerned with this comparability of language qualifications by using the CEFR as the common reference point. Comparability of language qualifications would be plausible using the CEFR, since language tests could be compared relatively easily by referring to learners who would sit a test as B1, B2 etc. Language examination providers appeared very eager to claim linkage to the CEFR by placing their tests at one of the six levels of proficiency described in the 1

2001 volume. As a result, calls have been made for the Council of Europe to actively facilitate examination providers link their tests to the Framework as well as validate such a linkage claim (Council of Europe, 2003: 5). The response of the Council of Europe to that need was the publication of a pilot version of the Manual for relating language examinations to the CEFR (Council of Europe, 2003; Figueras et al., 2005), henceforth the Manual. It was accompanied by a Reference Supplement (Takala, 2004) providing further guidance on conducting standard setting (Kaftandjieva, 2004) and qualitative (Banerjee, 2004) and quantitative analysis (Verhelst, 2004a, 2004b, 2004c, 2004d). The linking process set out in the Manual comprises a set of activities to be followed by examination providers. First, a panel of judges needs to be recruited and trained in using the CEFR to achieve adequate familiarisation with its content. After training, these judges are required to analyse test content and examinee performance in relation to the CEFR. The linking process, which will be discussed in detail in Section 2.2.2, can be summarised as follows: This stage is meant to ensure that the judges are familiar with the content of the CEFR and its scales before proceeding further in the linking process. The Manual recommends Familiarisation be repeated before each of the next two stages and suggests a number of familiarisation tasks. This stage involves describing the content of the test to be related to the CEFR first on its own right and then in relation to the levels and categories of the CEFR. Forms for the mapping of the test are provided in the Manual. The outcome of this stage is a claim regarding the content of the test in relation to the CEFR. This stage examines the performance by test-takers and relates this performance to the CEFR. Much of the process suggested in the Manual comes from the educational measurement literature, in particular research in setting performance standards and cut-off scores (e.g. Cizek, 2001), which is further discussed in the Reference Supplement. This stage introduces two categories of empirical validation: internal validation, aiming at establishing the quality of the test on its own right, and external validation, aiming at a confirmation of the linking claim by either using an anchor test properly calibrated to the CEFR or by using judgements of teachers familiar with the CEFR. The outcome of this stage is the confirmation or rejection of the claims made in stages 2 and 3, using analysed test data. Despite the wealth of information offered to carry out the proposed linking process, the role of the judges is not widely discussed in the Manual, even though it is the judges decisions that will produce a claim of how test content and examinee performance relate to the CEFR. This book addresses this issue by aiming to answer two general questions: How are judgements made in the CEFR linking context? 2

How do participants in this context interact with the CEFR scales when describing test content and examinee performance? The following section explains why I chose to conduct research into the linking process set out in the Manual and into how judges make decisions during the three first stages of the linking process. Soon after the publication of the 2001 volume, the impact of the CEFR on language education was strong, as revealed by a number of case studies (Alderson, 2002c; Morrow, 2004). Also, local and international examination providers started claiming linkage of their tests to the CEFR; however, it was not always clear how such a linkage claim was built. In addition, there was a trend to refer to the CEFR in the field of language teaching, where for example EFL textbook authors claimed linkage of their materials to the CEFR. Closer investigation of such claims by Tsagari (2006b) revealed that it was unclear how authors decided the relevance of their textbooks to the CEFR. In the language assessment field, debates in the two language testing discussion lists (EALTA and LTEST-L), the topics of papers in language testing conferences (inter alia Carlsen & Moe, 2005; Figueras, 2006) and a symposium during the 2007 EALTA conference 1 revealed growing interest in the use of the CEFR and in the process of relating test content and examinee performance to it. The decision to conduct the research reported in this book was made because the CEFR linking process attracted even more attention, when, following the publication of the Manual in 2003, the Council of Europe invited examination providers to pilot it. The Council intended to collect feedback on the linking process, aiming to revise the Manual in 2008, produce calibrated samples to the CEFR and publish a case studies book (Council of Europe, 2003: x). By 2006, 40 institutions from 20 countries had responded to the invitation of the Council or Europe and participated in the piloting of the Manual (Martyniuk, 2006). The experience of using the CEFR in contexts other than the linking one was not trouble-free. The CEFR was found inadequate for designing test specifications for receptive skills (Alderson et al., 2006), for measuring grammar-based progression of learners (Keddle, 2004), for describing the construct of vocabulary (Huhta & Figueras, 2004) and for designing proficiency scales (Generalitat de Catalunya, 2006). Criticism emerged regarding the nature of the CEFR linking process, as well. The appropriacy of using the CEFR to claim equivalence of tests which have been constructed for different audiences and purposes was disputed (Fulcher, 2004b: 261; Weir, 2005: 282). Moreover, the intentions of examination providers claiming linkage to the CEFR were questioned as being driven by commercial 1 see http://www.ealta.eu.org/conference/2007/programme.htm 3

interests (Fulcher, 2004b: 260). At the same time, the danger of viewing a test as not valid because of not claiming CEFR linkage was also pointed out (Fulcher, 2004a). In similar vein, the authors of the Manual caution as to the interpretation of equivalence of heterogeneous tests claiming linkage to the same CEFR levels (Figueras et al., 2005: 273). They acknowledge that two tests can be linked to each other through the CEFR proficiency scales, without however being claimed to be equivalent, because they will tend to assess different constructs, based on different specifications (Council of Europe, 2003: 20). The above made it clear that this book should address the emerging need of the testing community to better understand the nature of the CEFR linking process and by extension to this systematically research how judges, the main participants in the linking process, make decisions. Investigating judgements during the linking process is fundamental for the validity of the CEFR linkage. This is because the CEFR linking process, having standard setting in its core, depends on human judgement (Kaftandjieva, 2004: 4). Because of this dependence on human judgement, standard setting specialists in the US have started investigating the thought processes and experiences of judges in non CEFR-related situations as part of validating the outcome of a standard setting meeting (McGinty, 2005). Such an investigation into the thought processes and experiences of judges employed qualitative methods (Buckendahl, 2005) and has not been attempted in the CEFR linking context to date. Consequently, this part of the validity of the CEFR linkage is still unexplored. Even though the Manual offers useful guidance on how to build and validate a CEFR linkage claim, such validation does not include suggestions for investigating the judges perspective and primarily focuses on quantitative analysis of judgements (in the form of ratings) of examinee performance in relation to the CEFR. As a result, it is far from clear how participants in a CEFR linking study make decisions and how their decision-making process affects the resulting CEFR claim. Based on the US standard setting literature mentioned earlier, it is reasonable to argue that systematic investigation into the nature of such judgement-making in the CEFR linking context could potentially contribute to a better understanding of how a valid CEFR linkage claim is built. It was exactly this need to better understand the judgemental aspect of the process of relating an examination to the CEFR that shaped the research agenda of this study. Systematic investigation into how judgements and decision-making take place in the CEFR linking context is important for a number of other reasons, apart from its contribution to the validity of the CEFR linkage claim. As mentioned at the beginning of this section, the scales of the Framework had already been proven problematic in a number of contexts (e.g. Alderson et al., 2006; Generalitat de Catalunya, 2006). This could be the case during the CEFR linking process. Therefore, there is a need for research into how the users of the 4

CEFR interact with its scales in the linking context and what problems they face during this process. In addition, such a research enquiry could potentially address the issue of the validity of the CEFR scales in the linking context, because a scale, like a test, has validity in relation to contexts in which it has been shown to work (Council of Europe, 2001: 22). As Kaftandjieva (2004: 17) notes, there is some evidence that the CEFR scales are valid as standards to judge performance (Kaftandjieva & Takala, 2002; North, 2002b), but this does not guarantee that the scales will be validly interpreted as standards in any context they are used. Hence, it could be argued that more research is needed to establish the validity of the CEFR scales in the linking context, when used to judge test content and examinee performance. An additional area to address when investigating judgements during the CEFR linking process has to do with the degree to which expert judges agree when judging test content and examinee performance. The level of agreement reached by expert judges has been debated in the language testing literature. Some researchers provide evidence that expert judges might disagree about what test items examine, as well their difficulty level (Alderson, 1993; Buck, 1991), whereas others suggest that if judges are adequately trained, then agreement can be achieved (Lumley, 1993; Weir, Hughes, & Porter, 1990). Because the outcome of a CEFR linking project is based on expert judgements, it is important, for the validity of a linking claim, to examine whether expert judges agree regarding the language ability depicted in the CEFR scales and when they use the scales to judge test content and examinee performance. Since problems in the literature have been identified regarding uses of the CEFR scales in particular contexts, as I will discuss in detail in Chapter 2, potential problems in the linking context might arise, hindering expert judgements and affecting the validity of the linking claim. Finally, it should be stressed that researching how judgements are made during the process of relating examinations to the CEFR is important because of the consequences of a linking claim on various test users. Uses of tests and interpretations of results might be based on the CEFR level claimed by a test provider, for example when teachers choose examinations for their students, when universities admit students and when employers recruit new staff. Inevitably, invalid claims as to the CEFR level of a test might affect the appropriateness of such decisions. In I described the context of this book and discussed the importance of investigating the CEFR linking process set out in the Manual. In I describe types of linking results from different tests and I discuss in detail the CEFR linking process. Because the CEFR scales are central in this linking process, I examine the development of the CEFR scales and claims by the authors of the 2001 volume regarding their purposes and functions. I also 5

relate the main issues discussed in language testing research about proficiency scales to the CEFR scales. I then review studies on the CEFR scales to identify problems with the use of the descriptors which could potentially appear in the linking context. I finally present the many-facet Rasch model (Linacre, 1994), because it was used in the development of the CEFR scales and also in this book. In I report on the preliminary study conducted prior to the main research project. In this study I examine whether participants are primarily affected by the wording in the CEFR scales when they sort descriptors into the six levels and also whether they understand the wording of the descriptors similarly. In I explain how the issues discussed in the relevant literature and the findings of the preliminary study helped identify areas to investigate during the main research project. I conducted the main body of the research by organising three studies (Chapters 5-7), one for each stage involving judgements by a panel (Familiarisation, Specification and Standardisation). I carried out these studies during a research project at Trinity College London (henceforth Trinity ), a UK-based EFL examination provider and one of the 40 institutions piloting the Manual with the aim to relate two language examinations to the CEFR. investigates judgements made during the Familiarisation stage, by employing many-facet Rasch measurement. In this chapter I examine whether the judges scale the descriptors from the lowest to the highest level as they appear in the CEFR volume. This is crucial for the next linking stages, for which in-depth understanding of the CEFR levels is a prerequisite. In I investigate group dynamics during the Specification stage. I analyse group dynamics during the group discussions of the Trinity judges qualitatively, by employing a coding frame based on Davidson and Lynch (2002: Ch. 6). Apart from group dynamics, I also investigate problems the judges faced with the CEFR scales while working on test content description. In I explore judgements made during the Standardisation stage. I adopt a mixed methodology approach by combining both quantitative and qualitative analyses. I analyse the judges ratings of examinee performance in relation to the CEFR quantitatively, to establish whether the judges are consistent when using the CEFR scales as their marking criteria. I also analyse the group discussions of the Trinity judges qualitatively, by employing a coding frame based on the educational measurement literature mentioned in Section 1.2, to examine the factors that affect decision-making when judging examinee performance in relation to the CEFR. I also discuss problems faced by the judges during the Standardisation stage when using the CEFR scales. In I summarise the findings of this research study and its contribution. I also discuss the implications for the CEFR linking process set out in the Manual and the use of the CEFR in the linking context. I finally point out the limitations of the study and recommend directions for future research. 6