Calibration of teachers scores

Similar documents
Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Probability and Statistics Curriculum Pacing Guide

Lecture 1: Machine Learning Basics

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Grade 6: Correlated to AGS Basic Math Skills

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

Evidence for Reliability, Validity and Learning Effectiveness

Python Machine Learning

CS Machine Learning

12- A whirlwind tour of statistics

Centre for Evaluation & Monitoring SOSCA. Feedback Information

How to Judge the Quality of an Objective Classroom Test

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

A Comparison of Charter Schools and Traditional Public Schools in Idaho

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

NCEO Technical Report 27

BENCHMARK TREND COMPARISON REPORT:

(Sub)Gradient Descent

Analysis of Enzyme Kinetic Data

Ohio s Learning Standards-Clear Learning Targets

School Size and the Quality of Teaching and Learning

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Shockwheat. Statistics 1, Activity 1

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

This scope and sequence assumes 160 days for instruction, divided among 15 units.

On-the-Fly Customization of Automated Essay Scoring

Probability estimates in a scenario tree

Professor Christina Romer. LECTURE 24 INFLATION AND THE RETURN OF OUTPUT TO POTENTIAL April 20, 2017

Pre-Algebra A. Syllabus. Course Overview. Course Goals. General Skills. Credit Value

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Cooperative Game Theoretic Models for Decision-Making in Contexts of Library Cooperation 1

South Carolina English Language Arts

Extending Place Value with Whole Numbers to 1,000,000

UK Institutional Research Brief: Results of the 2012 National Survey of Student Engagement: A Comparison with Carnegie Peer Institutions

GCE. Mathematics (MEI) Mark Scheme for June Advanced Subsidiary GCE Unit 4766: Statistics 1. Oxford Cambridge and RSA Examinations

Spinners at the School Carnival (Unequal Sections)

Guide to the Uniform mark scale (UMS) Uniform marks in A-level and GCSE exams

Lecture 10: Reinforcement Learning

DIDACTIC MODEL BRIDGING A CONCEPT WITH PHENOMENA

Sector Differences in Student Learning: Differences in Achievement Gains Across School Years and During the Summer

CONSISTENCY OF TRAINING AND THE LEARNING EXPERIENCE

Wisconsin 4 th Grade Reading Results on the 2015 National Assessment of Educational Progress (NAEP)

Math Placement at Paci c Lutheran University

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

STA 225: Introductory Statistics (CT)

PEER EFFECTS IN THE CLASSROOM: LEARNING FROM GENDER AND RACE VARIATION *

Mandarin Lexical Tone Recognition: The Gating Paradigm

Lecture 2: Quantifiers and Approximation

9.85 Cognition in Infancy and Early Childhood. Lecture 7: Number

NCSC Alternate Assessments and Instructional Materials Based on Common Core State Standards

TOPICS LEARNING OUTCOMES ACTIVITES ASSESSMENT Numbers and the number system

4.0 CAPACITY AND UTILIZATION

Life and career planning

Assignment 1: Predicting Amazon Review Ratings

Certified Six Sigma Professionals International Certification Courses in Six Sigma Green Belt

Physics 270: Experimental Physics

The Effect of Written Corrective Feedback on the Accuracy of English Article Usage in L2 Writing

Radius STEM Readiness TM

Statewide Framework Document for:

WHEN THERE IS A mismatch between the acoustic

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

How do adults reason about their opponent? Typologies of players in a turn-taking game

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

learning collegiate assessment]

School Competition and Efficiency with Publicly Funded Catholic Schools David Card, Martin D. Dooley, and A. Abigail Payne

Why Did My Detector Do That?!

Australian Journal of Basic and Applied Sciences

A Note on Structuring Employability Skills for Accounting Students

The University of Michigan-Flint. The Committee on the Economic Status of the Faculty. Annual Report to the Regents. June 2007

MINUTE TO WIN IT: NAMING THE PRESIDENTS OF THE UNITED STATES

Introduction to Simulation

VOL. 3, NO. 5, May 2012 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

Tun your everyday simulation activity into research

Genevieve L. Hartman, Ph.D.

Multiple Measures Assessment Project - FAQs

Evaluation of a College Freshman Diversity Research Program

MYCIN. The MYCIN Task

CONSTRUCTION OF AN ACHIEVEMENT TEST Introduction One of the important duties of a teacher is to observe the student in the classroom, laboratory and

Mathematics subject curriculum

Fighting for Education:

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Pedagogical Content Knowledge for Teaching Primary Mathematics: A Case Study of Two Teachers

Integrating simulation into the engineering curriculum: a case study

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

ECON 365 fall papers GEOS 330Z fall papers HUMN 300Z fall papers PHIL 370 fall papers

The Indices Investigations Teacher s Notes

ACBSP Related Standards: #3 Student and Stakeholder Focus #4 Measurement and Analysis of Student Learning and Performance

Individual Differences & Item Effects: How to test them, & how to test them well

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

CAN PICTORIAL REPRESENTATIONS SUPPORT PROPORTIONAL REASONING? THE CASE OF A MIXING PAINT PROBLEM

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Longitudinal Analysis of the Effectiveness of DCPS Teachers

Mathematics Success Level E

Match Quality, Worker Productivity, and Worker Mobility: Direct Evidence From Teachers

Transcription:

Calibration of teachers scores Bruce Brown & Anthony Kuk Department of Statistics & Applied Probability 1. Introduction. In the ranking of the teaching effectiveness of staff members through their student feedback scores, there can be a problem if it is perceived to be more difficult to obtain high scores in some courses than others. The result would be, therefore, that staff taking difficult courses are disadvantaged. For example, it is often claimed that GEM and courses at the 1000 level are more difficult to score well in student feedback than courses at higher levels. Another criterion which is sometimes suggested as a classifier for difficulty of courses is the class size, it being easier to obtain good student scores for smaller classes. The problem of finding a suitable adjustment to feedback scores, in order to correct for the difficulty of the courses taught; is not as easy as one might think. On the surface, based on Faculty of Science data for AY2003/04, there is not a whole lot of difference between the average feedback score of 3.934 for 1000 level modules and the average scores for other levels which range from 3.901 to 4.144. However, this direct comparison is valid only if the allocation of lecturers to modules is done at random which is certainly not the case. To see why a direct comparison is not valid, consider the extreme case where all the best teachers are assigned to teach 1000 level modules and the worst teachers to teach another level, resulting in an inflated score for 1000 level modules and a deflated score for the other level which lead to no significant difference between the two (this is similar to a situation in clinical trials where the stronger persons are assigned to the control group and the frail persons to the treatment group and as a result treatment effect cannot be established). This situation is not as far-fetched as one would think because there maybe a tendency, perhaps out of strategic considerations, for individual departments to assign the better teachers to teach 1000 level modules. If this is the case, and there is some evidence of this from Table 1, then the naïve calibration method of subtracting the level-average from the individual teacher scores would be misleading. Because of the selection effect, the 1000 level average would be artificially inflated relative to the scenario of random allocation of lecturers and subtracting this inflated average (over-subtraction) will not do justice to those lecturers teaching 1000 level modules. It is as a safeguard against selection bias like this that makes randomized allocation the gold standard in clinical trials. Unfortunately, randomization is not an option in the assignment of teachers to modules, and so we have to think of ways of calibrating feedback scores that remain valid under non-random assignment. It is possible to calculate suitable corrections in an equitable manner, using some simple principles of statistical experimental design and analysis. The purpose of the present document is to outline how this can be done, using a matched pairs design. Correction formulas are derived for stratification in terms of course level, but can be applied equally 1

well, with obvious modifications, to any other desired stratification, such as class size. To be concrete, we focus on the student response to one particular question in the teacher evaluation form, namely, the question on the overall effectiveness of the lecturer. The same procedure could be applied to the other questions as well. The focus of this document is on the calibration of student feedback scores; student comments and peer reviews are other important indicators that should be considered to give a more balanced view. Expressed in mathematical terms, when staff member j teaches a course at level i, an average student assessment score y ij. is recorded. The dot notation signifies averaging over the student responses in the class. A basic premise is that the score y ij. depends in an additive way on two factors: an intrinsic, unobservable teaching ability score µ j for the jth staff member, and another factor α i associated with the level of the module taught, so that y = µ + α + error. (1) ij. j i The error term will be zero-mean, with an approximate normal distribution due to the central limit effects of averaging, and a variance which is proportional to the reciprocal of the class size. The goal is to estimate the { µ j } terms for individual staff members, and to create a ranking based on these estimated values, but in order to do this it is necessary to estimate and hence eliminate the level effects { α i } attributable to teaching courses at different levels. Using the hat notation to denote a statistical estimate, if estimates µ could be estimated by { } ˆi α were available, then the teacher effectiveness scores { j } µ ˆ j = y ij ˆ. α i, and an overall estimate of µ j would be the average of these µ ˆ j terms over the various courses taught by the staff member j during an academic year. But at first sight there is no obvious way to estimate the level effects { α i }, because of non-random allocation of modules to lecturers. 2. A matched-pairs design. However, an approach which provides an effective way to estimate the level effects is made possible by virtue of the fact that there are many teachers who have taught modules at two or more different levels during the same academic year. If staff member j teaches courses at levels i 1, i 2 in semesters 1, 2 respectively, then the average student assessments are modelled as y = µ + α + error, ij. j i 1 1 y = µ + α + error, i2 j. j i2 2

and the difference is d = y y = α α + error. (2) j ij. i j. i i 1 2 1 2 Note how the lecturer effect µ j is cancelled, regardless of how the teaching allocation was done, because we are taking the difference of two scores obtained by the same lecturer, so that what is left is an unbiased estimate of the level difference. We can set α =, making level 1 modules the baseline for comparisons; this will not affect the 1 0 estimates of differences between the various { α i }. The estimate of α i for i > 1 becomes an adjustment to be subtracted from each feedback average score for a level i module. Because there will be many staff members providing an observed difference d in any academic year, all the { α i } terms are estimable. Intuitively, a kind of overall repeated d, will provide the estimates { α }. In averaging, based upon the observe differences { j } practice, the estimates are obtained by fitting the linear additive model (1) to the student feedback data using the method of unweighted or weighted least-squares. To obtain the calculated estimates, the data are coded in a systematic way and entered into any standard statistical computer package for analysis. 3. Example: analysis of Faculty of Science feedback data The method proposed is illustrated by analysis of the Faculty of Science feedback data for AY2003/04. First, the data was organized by deleting entries for teachers who did not teach at different levels in AY2003/04. Then the frequency of courses at different levels was examined. There were only two cases at level 6000 and one at 8000, so these were combined with the level 000 cases. There were still some small numbers, so the three USP groups were combined, as were the two GEM groups. After these combining steps, some teachers were teaching modules all of the one level (for example, all USP modules, or all GEM modules), so their entries had to be deleted as well. After all deletions and grouping, there was a total of seven levels, from 1000 to 000, plus USP and GEM, with frequencies below. Note that the omission of those lecturers who did not teach modules at different levels is for expository purpose only to highlight the fact that the level effects are estimated from the matched pairs data. In actual fact, the same estimates of level effects (relative to a baseline) would be obtained if we keep all the data in fitting the additive model (1) because the scores from lecturers who have taught at only one level will only contribute to the estimation of individual lecturer effects, but not the difference of level effects. Tally for Discrete Variables: MODULE LEVEL (n = 482) Module level 1000 2000 3000 4000 000 GEM USP count 64 71 108 114 79 29 17 j ˆi 3

(i) Unweighted analysis. The first analysis to be discussed treats all the observed differences { d j } as being of equal accuracy, i.e., the fact that feedback averages based on large class sizes are more accurate, is ignored. The matched pairs model, to eliminate all but the module effect, was then analysed. Residual plots were made for caut ionary diagnostic purposes. The first plot, in Fig 1, is of residuals versus fitted values, reveals no severe deficiency and shows no apparent large outliers, though there is an apparent slight reduction of variability for higher scores, which is consistent with the compression of scores near to the upper bound of units, and may suggest the application of a transformation to remedy this effect, though it is unlikely that such a transformation would lead to drastic changes in the conclusions. The second plot, in Fig 2, is a normal probability plot of the residuals, and is close to a straight line, providing a general confirmation of normality. There are a small number of residuals which are larger than expected, but not large or frequent enough to influence the results unduly. An ANOVA shows that there are highly significant differences attributable to the variable module level. The fit indicated by R-squared is good. Analysis of Variance for RATING, using Adjusted SS for Tests Source DF Seq SS Adj SS Adj MS F P LECTURER 163 39.8264 41.41942 0.2411 3.31 0.000 MODULE LEVEL 6 6.01376 6.01376 1.00229 13.04 0.000 Error 312 23.97701 23.97701 0.0768 Total 481 69.81722 S = 0.277217 R-Sq = 6.66% R-Sq(adj) = 47.06% 4

1.0 Residuals Versus the Fitted Values (response is RATING) 0. Residual 0.0-0. -1.0 3.0 3. 4.0 Fitted Value 4..0 Fig 1: residuals plotted against fitted values. 99.9 Normal Probability Plot of the Residuals (response is RATING) 99 Percent 9 90 80 70 60 0 40 30 20 10 1 0.1-1.0-0. 0.0 Residual 0. 1.0 Fig 2: normal probability plot of residuals.

Estimated module level effects, standard errors, and adjustments Module level Raw average Estimated effect Estimated adjustment Standard error 1000 3.934 3.767 0 0 2000 3.901 3.88-0.092 0.063 3000 3.947 3.979-0.212 0.0 4000 4.02 4.039-0.272 0.04 000 4.144 4.214-0.447 0.08 GEM 4.143 3.847-0.080 0.073 USP 3.942 4.141-0.37 0.104 Table 1: fitted module level effects in the unweighted model. The estimated effects due to the different module levels are listed in Table 1, above. Several of the estimated adjustments for different module levels are significant, or nearly so. An analysis including weighting proportional to class sizes would better utilize the available information, and could be expected to yield more strongly significant results. There is, apparently, a definite trend whereby it is more difficult to obtain high student feedback scores for lower level or GEM modules. It is easier to get higher scores where the students are at a higher level, or are more motivated, as in USP modules. There is a general trend for higher level modules to lead to higher feedback scores, Note that the raw averages of scores for GEM and level 1000 modules are substantially higher than module level effects estimated unbiasedly using the additive model (1). This lends support to the theory that there is a tendency to allocate the better teachers more to teach 1000 and GEM modules than to other levels. Since the feedback scores for GEM and 1000 level modules are artificially inflated, the naïve method of adjustment based on raw level averages will under-adjust for this group of modules. The method that we proposed is able to circumvent this complication of non-random allocation of teachers, by creating matched pairs of teacher scores which differ only in module level. An alternative explanation is that model (1) allows arbitrary lecturer effects explicitly and so is general enough to cover the case of non-random assignment of teaching duties. (ii) Weighted analysis. In least-squares statistical analyses, the optimal form of weighting is for weights to be chosen proportional to the reciprocal of observational variance. In the present analysis the variance of the average feedback scores for individual modules should be proportional to the reciprocal of class size if students respond independently, so the correct weights are the class sizes themselves. Carrying out a weighted version of the analysis just described, and applying it to all the Science faculty teachers in AY 2003/2004, i.e., not just those involved in the matchedpairs part of the design, gives results described in the following output. Table 2 is an ANOVA which tests for the significance of the effects of different module levels. 6

Tests of between-module effects(b) Source Type III Sum of Squares df Mean Square F Sig. lecturer 6419.663 267 24.044 9.781.000 module 186.31 6 31.08 12.63.000 Error 882.47 39 2.48 Corrected Total 7471.479 632 a R Squared =.882 (Adjusted R Squared =.792) b Weighted Least Squares Regression - Weighted by class size Table 2: ANOVA to test for the effects of different module levels. The adjustments, with the exception of that for USP, now have smaller standard errors, and are all significant. The estimated pattern of adjustments rises smoothly as level increases, with the exception of level 000 modules, whose adjustment term is high, at 0.433. The difficulty of scoring highly in GEM modules is roughly the same as for level 2000 modules, and in the same respect, USP modules are roughly equivalent to level 000 modules. This may reflect a high level of interest and motivation among students in the USP programme. Parameter Estimates(b) Estimated Estimated Standar Module level effect adjustment d error 1000 3.787 0.0 0.0 2000 3.870-0.083 0.044 3000 3.934-0.147 0.036 4000 3.980-0.193 0.04 000 4.220-0.433 0.0 GEM 3.864-0.077 0.038 USP 4.14-0.367 0.11 b Weighted Least Squares Regression - Weighted by size Table 3: fitted module level effects in the weighted analysis. Rankings within DSAP. Finally, the effects upon within-department rankings of teachers, due to adjustment of scores using the estimated module level effects, is shown in Figures 3 and 4 below, for the Department of Statistics & Applied Probability, in AY 2003/04. Note that rank = 1 is best, rank = 27 is worst. The ranking before and after adjustment are broadly similar, while displaying some distinct changes. The personnel in the top one-third (9 out of 27) remain unchanged by both rankings. These nine teachers 7

form two clusters. The first cluster consists of the top 4 teachers according to raw scores. Their average raw scores range from 4.48 to 4.31, a very narrow range of 0.073. After adjustment, the teacher who was ranked first dropped down to third position, because one of the modules that he taught was a 000 level module, which resulted in a downward adjustment. The score of the teacher originally ranked second was adjusted even more because the only module that he had taught was a 000 level module. The teacher originally ranked fourth was ranked first after adjustment because he taught only 1000 level module. While this may sound drastic, one should keep in mind that their raw scores only differ by 0.073 and so the levels of the modules taught should and are expected to have a bearing on the final rankings. The second cluster of teachers has average raw scores ranging from 4.211 to 4.277, a difference of 0.066. Again, because of this narrow range, there are interchanges of positions within this group of teachers after adjustments that took into account the levels of the modules taught. 3 raw scores 0% adjustment 3 adjusted scores Fig 3: Original and adjusted scores for teaching staff in DSAP, AY 2003/2004. 8

30 Original vs adjusted ranks, DSAP teachers, AY 2003/04 2 adjusted ranks 20 1 10 0 0 10 1 original ranks 20 2 30 Fig 4: Original and adjusted ranks for teaching staff in DSAP, AY 2003/2004. There is no doubt that making the indicated adjustments, by subtracting the estimated confounding terms for module levels from individual average feedback scores, will substantially remedy any perceived difficulty of scoring highly on lower level courses. There is a small but noticeable effect upon the resulting rank-ordering of teachers, which is shown, for the present example from DSAP, in the graph above. The proposed adjustment method could be a bit controversial because it seems to favour lecturers teaching lower level modules and penalize those who teach high level modules. It should be stressed, however, that the required adjustments are estimated from the differences in scores received by the same lecturer and so should be credible. By taking the difference, the effects of lecturers are removed and the remaining systematic effects can be reasonably attributed to levels of modules alone. We believe that the changed rankings are made upon an equitable basis. Certainly, calibration by subtracting the levelspecific average without making allowance for selection effect in the assignment of teaching duties will result in under-adjustment, whereas the adjustment proposed in this working paper appears to be somewhat drastic, particularly when it is applied to feedback scores for 000 level modules. The truth may lie somewhere in between. One possibility is to average the raw score with the adjusted score, or equivalently to adjust by 0%, and the resulting scores can be read out from the broken line in Figure 3. Moving this line up and down corresponds to less than and more than 0% adjustment. 9

4. More sophisticated analyses We have tried two more sophisticated analyses. For the first one, we took notice that the individual student responses for the high scoring modules tend to exhibit less variability. We performed a new weighted analysis that factored this into account but the conclusions were not changed by much and so will not be reported here. Our second analysis was an attempt to remove the compression or ceiling effect at both ends of the 1 to scale by applying the logistic transformation to the average scores, i.e., we fit model (1) to the transformed average scores. Again, we obtained more or less the same results which will not be repeated here. These lend support to the robustness of the simple analyses that we performed in section 3. 10