Welcome to this presentation about Smarter Balanced Assessment scores. This presentation is brought to you by the Assessment Development team at

Welcome to this presentation about Smarter Balanced Assessment scores. This presentation is brought to you by the Assessment Development team at OSPI. My name is Kara Todd, and I ll be your guide today. My contact information is included on the last slide. Please contact me if you have questions or feedback after viewing this information. 1

The purpose of this presentation is to help answer some basic questions about the test scores that we have received from the field. We won t go very far into the statistics, and we ll keep our examples as simple as we can, but we hope you ll come away with a better understanding of how to look at these scores. We ve included links to resources throughout the presentation, and in the FAQ document posted next to this video, for those of you who wish to delve deeper into the topics covered here. The intended audience for this information includes District Assessment Coordinators, principals, data or instructional coaches, and teachers, since those are the folks who regularly contact OSPI with these types of questions. This information can also be shared with parents. If you ve ever asked or ever been asked how the Interim Assessment Block scores or the Claim score categories are determined, this presentation is for you! We ll even answer a few other questions along the way. You are welcome to share this presentation and the information contained in it with educators and parents within the state of Washington. It might be a great addition to professional learning community discussions about Smarter Balanced testing, especially if your school or district is using the Interim Assessments and looking at data in AIR Ways. Educators outside of Washington who also administer the Smarter Balanced 2

Assessments are welcome to view this information, just know that your state may have different reporting systems and may not have all the vendor systems mentioned. Let s get started! First we will discuss the summative scores, and then we will discuss the interim test scores. 2

For each content area Washington reports three things: an overall scale score, an achievement level, and a claim achievement category. These scores are what show up in the Score Reports section of the Online Reporting System. 3

These scores are also what is shown on the Family Report that is mailed home to families each fall. This report is also referred to as the Individual Student Report. The first page has the ELA/Literacy results, [CLICK] then the second page has the Mathematics results. There are 3 main parts to this report: On the left hand side is the scale score [CLICK], then the achievement levels [CLICK] are on the right, and then the claim scores [CLICK] are in a table at the bottom. 4

First lets talk about the scale score. The scores are a 4-digit number on a scale from about 2000 up to about 3000. Smarter Balanced decided to use this 4-digit number range to make it different than the 3-digit numbers that most states had previously used for reporting state test scores. They chose a range of about one thousand to ensure a statistically acceptable level of precision about student scores. The decision to use a scale score is related to the fact that the Smarter Balanced Assessment is an adaptive test. Students who are performing well will begin to see harder items as they progress through the test, and students who are not performing well will begin to see some easier items. As a result, not all students in a grade level see the exact same set of items. Having a scale score let s us put all the results from a group of students on to the same measuring stick. So, how do the scores get onto that stick? 5

They do NOT get there by using only a simple calculation of raw-points-earned divided by total-points-possible. The points earned and the points possible are involved, but Smarter also takes into consideration the difficulty level of the items that the student saw. Using the item difficulty is what helps us get scores on to the same scale for students who saw different items. 6

I am going to simplify things here in order to give you an example. This grey line is the scale from 2000 to 3000 that scores end up on. I m going to pretend that items on a test are really spread out across the scale, and that they start at the very easy end of the scale and get harder with each item. Let s say that two students, Anya and Beatrice, take a math test. They both answer all of the addition questions correctly, then Anya answers none of the multiplication questions correctly, but Beatrice answers most of the multiplication questions correctly. [CLICK] So, here are the 5 addition questions Anya answered correctly in blue, with the same 5 questions answered by Beatrice just above in green, and then the 4 multiplication questions Beatrice answered correctly, also in green. The multiplication questions are harder than the addition questions, so Beatrice ends up with a higher scale score. Now let s add a third student, Calli, in dark orange. [CLICK] Calli takes a similar math test. The questions are still about addition and multiplication, but the digits are different. Because all the math questions have a difficulty value, Calli s score can be put on the same scale with Anya and Beatrice. Notice that Calli gets the same number of items correct as Beatrice, but since two of her questions were harder, she gets a higher scale score. 7

To summarize: the overall scale score for Math or ELA uses all the items that a student saw on the test, information about the difficulty of those items, and whether or not the student answered them correctly. Then some elaborate statistics calculations are done to put the results onto the scale of 2000 to 3000. (If you want to know more about the elaborate statistics calculations, we ve got some resources for you listed in the FAQ document.) 7

Now we will turn our attention to the achievement levels. They are shown on the report as part of this thermometer-type chart, with a statement about each level. 8

The scale scores were divided into four categories of performance, which WA reports as Levels 1, 2, 3 and 4. The statements are called an Achievement Level Descriptor or ALD. The ALDs are statements about what knowledge and skills students display at each level. The statements shown on the Individual Student Report are high-level summaries of a very detailed set of information. [CLICK] You can find a link to the detailed ALDs on our website. [CLICK] The score that divides between two levels is called a cut score. Educators used the ALDs to determine the places to make 3 cuts, resulting in the 4 levels. The cut score that divides achievement Level 2 from Level 3 is referred to as the standard. Students with scores above this cut have met the achievement standard. I ll keep referring back to this standard cut score throughout the rest of this presentation, so take note that it s the cut between Level 2 and Level 3 for the overall score. The thermometer graphic on the Individual Student Report might lead one to think that the scale from 2000 to 3000 is just broken into 4 equal parts. It s not 9

Here is how the ELA grade 7 scale is broken up: Orange is Level 1, yellow is Level 2, green is Level 3, and blue is Level 4. The standard cut is at the point between 2551 and 2552, not exactly at 2500. Note that there are different ranges of points in each of the levels: you can see how the orange and blue are much bigger than the yellow and green, and there is a range of 72 points in the yellow, and 88 points in the green. Also note that these are the specific cuts for 7 th grade ELA. The 7 th grade math cuts are in different spots than these, and the ELA cuts for 6 th and 8 th grade are also at different places on the scale. The range of scales scores for each achievement level [CLICK] can be found on this Smarter Balanced webpage. 10

The specific cuts for the four levels were set back in 2014 and 2015, during a series of meetings starting with classroom teachers and other educators from all of the Smarter Balanced states, and ending with our WA State Board of Education. You can find more details about the process for setting those cut scores at the sites posted here. A special note about the high school scores: in August 2015, WA s State Board of Education established a cut score for high school students that is called the graduation cut score. This score is lower than the standard cut between Levels 2 and 3 for high school, and is only used for meeting assessment requirements for graduation from high school in Washington state. It is not used for anything that we will be discussing further in this presentation. 11

Before we talk about Claim Scores, we need to discuss a part of the score report that may not get much attention, but plays an important role: the Standard Error of Measurement, or SEM. [CLICK] Each student scale score is reported along with an SEM value. This is the ± value displayed next to the 4-digit scale score. Each student score can have a different SEM value. This value, much like the scale score itself, is dependent on the items a student saw, which ones were answered correctly, and those the student did not answer correctly. [CLICK] The SEM is a statistics value that indicates a range where the student s score is likely to be if they took the test several times. You might hear it referred to as the margin of error or as the standard error. 12

Here is where the SEM shows up on the Individual Score Report: by the thermometer with a plus minus value and an error bar, and then the range [CLICK] is given in a paragraph under the scale score. The paragraph explains that if the test is taken several times, it s likely that the score could fluctuate. In this example, Jane s score could be anywhere between 2670 and 2690, which are plus and minus 10 away from her score of 2680 this time. 13

An SEM is also included with group average scores in the Online Reporting System. An important thing to know about SEM is that the smaller the group of students, the larger the SEM. Thus the individual student level is where we see the largest SEM values. 14

This slide shows some images from the Online Reporting System. The SEMs are displayed in grey next to the scale scores. The average comparisons column on the left shows that as the number of students in the group decreases, the SEM number increases. When there were 306 students included in the group at the district level, the SEM was 5. The SEM increased to 8 at the school level, and then to 16 when looking at 28 students in the classroom. [CLICK] The column of Students on the right shows scores for 4 students from this school (in no particular order) and how each student can have a different SEM. Students can have different SEMs because they saw different test items, and got different combinations of items correct. Even students with the same scale score can have different SEM numbers because they got different combinations of items correct. Now let s talk about how SEM plays a role in determining the next part of student scores 15

the Claim Achievement Category. Washington reports claim scores as one of three claim achievement categories: Above standard, At/Near standard, or Below standard. 16

Claim achievement category results come from comparing the student s claim scale score to the standard cut score, and incorporating the claim SEM. This is a multi-step process, so I m going to break it down into it s parts over the next few slides and give you an example to track as we go through the steps. [CLICK] The first step of the process is to break the items on a student s test up into their claim categories. I m going to use an imaginary student named David for our examples. He s in 7 th grade, and we have his ELA exam to use. There are four claims in ELA: Reading, Writing, Listening, and Research. There are about 45 items on the average ELA exam. Let s say that David answered 15 Reading items, 13 Writing items, 8 Listening items, and 9 Research items on his Summative ELA assessment. 17

The 2 nd step is to find the claim scale score. The sets of items from a single claim are put through the same statistical process as for the overall score, using the item difficulty value and whether or not the questions were answered correctly, and put on the same 2000 to 3000 scale. In this graphic I ve laid out the items David took into the 4 claim areas, and this time I ve used dots for items he answered correctly and Xs for the items he answered incorrectly, and then arranged them by difficulty value. [CLICK] Notice that the dots for the most difficult item answered correctly in each claim are in different spots on the scale. This is to show that the different claims can have different scale scores. Also know that the claim scale scores may be higher and/or lower than the student s overall scale score for the content area. Let s say that David got a Reading claim scale score of 2780 [CLICK]. 18

The 3 rd step is that the SEM value for the claim scale score is calculated using the same statistical process as was used to find the SEM value for the overall score. [CLICK] So we ll say that David s SEM for his Reading claim is a plus or minus of 50. [CLICK] The claim scale scores and associated SEMs can be found in score files downloaded from WAMS and/or from the Retrieve Student Results section in ORS. They do not show up on the Individual Student Reports. Please note that the statistics principle that causes smaller numbers of students to have higher average SEMs, also means that smaller amounts of items have higher SEMs. Therefore, the SEMs for the claim scale scores are likely to be larger numbers than the SEM for the overall scale score. 19

The 4 th step is to use the standard cut score for the particular grade level and content area, along with the Claim SEM, to do some math. [CLICK] The claim scale score is set over to the side for this step. First we will multiply the SEM by 1 and a half (or 1.5). Multiplying by 1 and a half is a statistics procedure that has to do with the accuracy of the prediction being made about student ability. That multiplication will give us what I m going to call our adjusted SEM. Then we add and subtract the adjusted SEM from the standard cut score. This will give us a high number and a low number. [CLICK] So, for our imaginary student David, we use the standard cut score for 7 th grade ELA which is 2552. Then using his claim SEM of 50, and multiplying it by 1.5 equals 75. [CLICK] Then we add and subtract, [CLICK] and get 2627 as our highest number and 2477 as our lowest number for David..PAUSE. An important thing to remember is that the standard cut score used here is the cut between levels 2 and 3 for that grade level and content area. And for scores on high school level tests in WA, this step still uses the Smarter Balanced standard cut score, 20

NOT the graduation cut score, which is lower. 20

Then our 5 th step is to compare the high and low numbers to the students claim scale score. If the claim scale score is lower than the low number, then the claim achievement category is Below Standard. [CLICK] This means that the student has a weakness in this claim category. If the claim scale score is higher than the high number, [CLICK] then the claim achievement category is Above Standard. This means that the student has a strength in this claim category. If it falls in between, [CLICK] then a determination of strength or weakness cannot be made and the claim achievement category is At or Near Standard. So, David s high and low numbers from step 4 [CLICK] are 2477 and 2627. And his claim scale score from step #2 was 2780 [CLICK] which is larger than both of these numbers, so. 21

it falls in the above standard area and our imaginary student David has a strength in the area of Reading. 22

Here is another way to illustrate it. Remember the 7 th grade ELA scale? Let s concentrate of the standard cut of 2552 [CLICK]. Then we ll add [CLICK] our claim SEM times 1.5 to get our high number, and subtract [CLICK] it to get our low number. If the students claim scale score is anywhere in here [CLICK] then they will get the At slash Near Standard claim category. If the claim scale score is up here [CLICK], then they will get the Above Standard claim category. And if the claim scale score is down here [CLICK], then they will get the Below Standard claim category. The most important thing to note here is that the At/Near range is dependent on the size of the standard error. The larger the SEM for the claim scale score, the larger the at/near range. A bigger SEM value is going to get even bigger when you do the math in Step #4, resulting in a bigger at/near area for a student s claim scale score to fall in to. 23

Here is the illustration with a scale score that has a larger SEM than the graphic on the previous slide. Notice that the at/near range got bigger. Remember when I first mentioned SEM, and I told you that the SEM is different for each student since they saw different items and got different combinations of items correct remember that? This is where that fact plays out in an even stronger fashion that it does for the overall scale score. The different SEMs result in different At/Near ranges for each student. Let me repeat: The different SEMs result in different At/Near ranges for each student. This is why 2 students can have the same claim scale score, but have different claim categories reported. This is also why there are no cut scores for the claims. 24

This is how the process can be described or summarized in words. I put the graphic here to help you connect.a claim score of Above happens when the student s claim scale score is greater than 1.5 SEMs above the standard cut score. A claim of Above means that this is an area of strength for the student. A claim score of Below happens when the student s claim scale score is greater than 1.5 SEMs below the standard cut score. A claim of Below means that this is an area of weakness for the student. And a claim score of At/Near happens when the student s claim scale score is within 1.5 SEMs of the standard cut score. A claim of At/Near means that this area is neither a strength nor a weakness for the student. 25

We ve reached the end of our journey through the summative test scores, so I d like to summarize just a few points. The first is a reminder that scores are not determined by a calculation of raw-pointsearned divided by total-points-possible The second is that scores are related to: the particular set of items the student was presented with, the difficulty of those items, which of those items the student answered correctly, the SEM of the scale score, and the distance between the scale score and the standard cut score. The third is that the 4 achievement levels are defined by the Achievement Level Descriptors. We encourage educators to dig into the ALDs to get a better understanding of what the Achievement Level numbers mean for their work with students. Slide 9 has the link to that document, and the link is also included in the FAQ document. 26

Now we can turn our attention to the Interim scores. First a very short explanation of the Interim Comprehensive Assessment scores. [CLICK] Even though the ICAs are currently not adaptive, so every kid DOES see the same exact items, the same statistical procedures are used to calculate scores as are used for the Summative assessment. They are also reported with the same three types of scores. And, like the summative results, [CLICK] the claim scale scores and associated SEMs can be found in ICA score files downloaded from the Retrieve Student Results section in ORS. Another effect of students viewing the same exact items on the ICAs is that you are more likely to see similar scale scores and similar SEMs among groups of students than you would expect to see on the Summative assessments. 27

Now we can turn our attention to the IAB scores. I said at the beginning of this presentation that we developed this training as a result of questions from the field most of the questions from the field were about the IAB results and how to sort students based on these scores. First, [CLICK] the IABs are reported using what is called a performance category, and you ll notice that they have the same names as the summative claim achievement categories. So, hopefully it won t surprise you that we use the same process [CLICK] as was used to determine the claim scores on these performance categories. [CLICK] A combination of raw score and item difficulty results in [CLICK] a scale score and a standard error. Then we go through that SEM times 1.5, add/subtract process to [CLICK] compare to the standard cut for that grade level and content area, which [CLICK] then gives us the performance category. 28

What is different for the IABs, is that the IAB scale scores and associated SEMs are [CLICK] NOT available from the Retrieve Student Results section in ORS. Those are calculated behind the scenes by the scoring/data engine at AIR, so there is not a place for districts and schools (or for us here at the state level) to see the information contained in the purple arrow like there is for the Summative and ICA results. The parts of the process that we can see are [CLICK] the raw score and the performance category using AIR Ways and ORS, and the [CLICK] standard cut scores from Smarter Balanced. 29

Based on questions that we ve received, we have a couple of other thoughts for you The first is about the percentage of points earned. I m confident that you by now know that the scores are not determined by percentages. But, we re educators, we like to do math with student test results, and the AIR Ways layout of student scores makes these kinds of calculations possible. So I d like to remind you of the impact of item difficulty on the scale scores. Let s return to Anya and Beatrice [CLICK] and give them the same IAB. The little diamonds are the 12 items they answered correctly, and the Xs are the 3 items they answered incorrectly. They answered different questions wrong, but they will have the same raw points of 12 out of 15, or 80%. It is possible that Anya and Beatrice will have a different performance category assigned to them. Beatrice is more likely to be Above Standard because she answered the hardest items correctly while Anya did not, so it s possible that Anya could be At/Near Standard. This is because it s the combination of the difficulty of the items they each answered correctly that impacts their scale score and SEM value, which could then result in being assigned different performance categories. So, be careful, very, very careful, in using percentages to look at IAB results. 30

The second thought is about item difficulty across the content. Remember that the IABs are groups of items about similar topics. Some IABs have content that is harder (in general) for students than other topics. This means that IABs with harder content will have items with higher difficulty levels.which leads to higher scale scores. [CLICK] remember Anya and Beatrice s math test with the addition and multiplication items? What if we only had addition items on one test [CLICK] and had a separate test with the multiplication items? [CLICK] Beatrice got all 6 addition questions correct, which is 100%. Then she got 5 out of 6 multiplication questions correct, which is 83%. But the addition items are easier, so the maximum possible scale score will be lower for those addition items than for the multiplication items. She ll get a higher scale score for the multiplication test, even though she earned less raw points (and had a lower percentage) on that test. [CLICK] This is why raw points earned on one IAB cannot be compared to raw points on another IAB. Not within a grade level, not across grade levels. You can only compare the results of one IAB to the same IAB. 31

The third thought we have for you is about that Standard Error of Measurement value again Remember that the smaller the number of items on a test, the greater the SEM. For example, I know more about Beatrice s ability in multiplication if I ask her 100 questions than if I ask her only 5 questions. The SEM on the 5 question test is much higher than the SEM on the 100 question test. As a result of this, the scale scores for IABs, which have from 5 to 15 items each, have larger SEMs than the overall scores for the Summative or ICAs, which have 40-47 items each. And the IABs with the smallest number of items, like the ELA Brief Writes (6 items), the ELA Performance Tasks (5-6 items), or the Math Performance Tasks (6 items), have the largest SEMs. Here s the implication of this. [CLICK] When we take the large SEM of the IABs, then do the math represented in the green boxes, the range for the At/Near Standard can get really big. REALLY big. That range can get so big that a student can earn as little as 1 raw point on a Performance Task IAB but be in the performance category of At/Near Standard. It s possible that the difficulty level of that 1 raw point is high enough to 32

result in a pretty high scale score. Combine that with a large SEM value, and the likelihood of the performance being in the At/Near range is pretty high. So, yes, it is possible for a student to earn only 1 point on an IAB and get the At/Near performance category. 32

Given all of that information, you may be asking your self, Now what? What should I concentrate on when looking at information in AIR Ways? We d like to encourage you to look past the colors and the category labels, and really dig into the items themselves. Here are some questions that could be asked: Were there items that all your students struggled with? Were there items that all of your students did well on? Are there any patterns in the ways they responded? Where are the outliers items that all but a few students did well on? What instruction would benefit those few students? Were there trends in answers based on particular types of items? Does it look like students know what to do with technology enhanced items like graphing or drag-anddrop? How are they handling the multi-select items which allow them to choose more than one correct answer? What did you notice while hand scoring constructed response items in the Teacher Hand Scoring System? 33

Keep this in mind. The IABs are one tool that can be used to gather evidence about a student s understanding. The teacher should also gather additional evidence through formative practices. The IABs are not intended to be all the evidence that is collected, but they can be used to inform an educator s thoughts about a student s understanding. Several data points (including the IAB) should be used to formalize a conclusion about a student s understanding. 34

We ve reached the end of our journey through the interim assessment scores, so I ll summarize just a few key points: The first is that the same statistical procedures are used to calculate scores for the ICAs as are used for the Summative Assessment. The second is that the same process is used to determine performance categories on the IABs as is used for calculating the claim scores on the Summative Assessment. The third is that we encourage teachers to focus on the items and how students responded to them when planning next steps after administering an IAB. 35

The FAQ document posted along side this video on the WCAP Portal has: All the links included in this presentation, organized by slide number, answers to some frequently asked questions, and links to documents for further, more technical reading about the statistics involved in these processes. We also posted the original PowerPoint presentation so that you can incorporate the information into your own trainings. Please just remember to attribute the original information to OSPI and to inform your audience of any alterations or additions you may make. And just a reminder of our intended audience and places where you are welcome to share and incorporate this information. 36

Thank you for viewing this presentation. If you have questions or comments about this presentation, please email me, Kara Todd, and I ll respond directly. I will also add common questions and responses to the FAQ document. If you have content specific questions, like about the items on the IABs or next steps to take with students, please contact the Assessment Specialist for the content area, either Anton Jackson for Mathematics or Shelley O Dell for ELA. 37