I N V E N T O R Y. Technical Guide. A breakthrough tool for assessing foundational reading skills for students in grades Dr. Richard K.

SPI_Covers_5R.indd 1 4/12/11 3:53 PM

I N V E N T O R Y Technical Guide A breakthrough tool for assessing foundational reading skills for students in grades 3-12 Dr. Richard K. Wagner In partnership with the Scholastic Research & Validation Department

PD TM Copyright 2011 by Scholastic Inc. All rights reserved. Published by Scholastic Inc. SCHOLASTIC, SPI SCHOLASTIC PHONICS INVENTORY, SRI SCHOLASTIC READING INVENTORY, SYSTEM 44, READ 180, and associated logos are trademarks and/or registered trademarks of Scholastic Inc. LEXILE and LEXILE FRAMEWORK are registered trademarks of MetaMetrics, Inc. Other company names, brand names, and product names are the property and/or trademarks of their respective owners.

Table of Contents Introduction Overview of the SPI................................... 4 Uses of the SPI....................................... 4 Rationale for the SPI.................................. 5 Administration and Scoring Administration....................................... 6 Scoring: Fluency and Accuracy........................... 6 Score Reporting and Interpretation Screening Report..................................... 7 Reading Program Placement and Decoding Description......... 7 Development of the Scholastic Phonics Inventory Development of the SPI Item Bank........................10 Scoring and Cross-Validation Samples......................10 SPI Scoring Algorithm.................................12 Item-Level Fluency Thresholds.........................12 Combining Accuracy and Latency Into Fluency Scores........15 Summary of the Development of the SPI Scores............17 Form Equivalence....................................17 Reliability of Scholastic Phonics Inventory Scores Internal Consistency Reliability..........................19 Alternate-Form Immediate Reliability......................19 Standard Error of Measurement...........................20 Validity of the Scholastic Phonics Inventory Fluency Scores Content-Description Validity.............................21 Criterion-Prediction Validity.............................22 Construct-Identification Validity..........................24 Accommodated Administration of the Scholastic Phonics Inventory Description of Accommodated Administration Procedures.......32 Form Equivalence....................................32 Reliability and Validity of Accommodated Administration Scores...33 Standard Error of Measurement...........................35 Summary of the Reliability Analyses.......................35 Summary of the Reliability and Validity Analyses..............39 References...................................................40 Technical Guide 1

list of tables Table 1: Reading Program Placement and Criteria to Establish Decoding Status Table 2: Descriptive Statistics of Standard Scores (SS) for Scoring Sample on Word-Level Criterion Measures Table 3: Descriptive Statistics of Standard Scores (SS) for Cross-Validation Sample on Word-Level Criterion Measures Table 4: Example of Sensitivity and Specificity Calculations Table 5: Combining Accuracy and Latency Into Fluency Scores: Four Possible Response Patterns Table 6: Mean Accuracy Scores for SPI Items FIND and PROTUND Table 7: Mean Fluency Scores for SPI Items FIND and PROTUND Table 8: Comparing Means and Standard Deviations of SPI Fluency Scores for Scoring Sample Table 9: Comparing Means and Standard Deviations of SPI Fluency Scores for Cross- Validation Sample Table 10: Internal Consistency Reliability Coefficients (Coefficient Alpha) for SPI Fluency Scores Table 11: Alternate-Form Immediate Reliability Coefficients for SPI Fluency Scores Table 12: Observed Validity Coefficients (and Coefficients Corrected for Range Restriction for Criterion Measures) for SPI Fluency Scores as Predictors of Reading Criterion Scores for the Scoring Sample (N = 91) Table 13: Observed Validity Coefficients (and Coefficients Corrected for Range Restriction for Criterion Measures) for SPI Fluency Scores as Predictors of Reading Criterion Scores for the Cross-Validation Sample (N = 91) Table 14: Descriptive Statistics, Tests of Group Differences, and Effect Sizes for Scoring Sample Table 15: Descriptive Statistics, Tests of Group Differences, and Effect Sizes for Cross-Validation Sample Table 16: Validity Coefficients (Point-Biserial Correlation Coefficients) for SPI Fluency Scores as Predictors of Poor Versus Adequate Decoding Group 2 Scholastic Phonics Inventory PD TM

Table 17: Example of Sensitivity, Specificity, Positive Predictive Value and Negative Predictive Value Calculations Table 18: Levels of Acceptability for Classification Statistics Proposed by Hammill et al. (2006) Table 19: Classification Statistics for Predicting Decoding Status Using SPI Fluency Scores for the Scoring Sample (N = 91) Table 20: Classification Statistics for Predicting Decoding Status Using SPI Fluency Scores for the Cross-Validation Sample (N = 91) Table 21: Comparing Means and Standard Deviations of SPI Accuracy Scores for Scoring and Cross-Validation Samples Table 22: Internal Consistency Reliability Coefficients (Coefficient Alpha) for SPI Accuracy Scores Table 23: Alternate-Form Immediate Reliability Coefficients for SPI Accuracy Scores Table 24: Observed Validity Coefficients (and Coefficients Corrected for Range Restriction for Criterion Measures) for SPI Accuracy Scores as Predictors of Reading Criterion Scores for the Scoring Sample (N = 91) Table 25: Observed Validity Coefficients (and Coefficients Corrected for Range Restriction for Criterion Measures) for SPI Accuracy Scores as Predictors of Reading Criterion Scores for the Cross-Validation Sample (N = 91) Table 26: Descriptive Statistics, Tests of Group Differences, and Effect Sizes for Scoring and Cross-Validation Samples Table 27: Validity Coefficients (Point-Biserial Correlation Coefficients) for SPI Accuracy Scores as Predictors of Poor Versus Adequate Level Decoding Group Membership for the Scoring and Cross-Validation Samples Table 28: Classification Statistics for Predicting Decoding Status Using SPI Accuracy Scores (N = 91) list of figures Figure 1: Screening and Placement Report Figure 2: Receiver Operating Characteristic (ROC) Curve for an SPI Nonsense Word Item 6 d. Technical Guide 3

INTRODUCTION Overview The Scholastic Phonics Inventory (SPI) is a computer-based test of letter recognition, word reading efficiency, and phonological decoding. SPI measures the accuracy and fluency with which students identify individual letters and words and decode nonsense words. A response is scored as accurate if the student selects the correct answer; it is scored as fluent if the student selects the correct answer within the established time limit for the item, known as a fluency threshold. The SPI contains three equivalent test forms for screening and progress monitoring purposes. The software selects the appropriate test form automatically; each time a student logs on to take a test, the software delivers a new form. Uses The SPI was developed to identify 3 rd -12 th grade students who are poor decoders and/or unable to recognize sight words with fluency, and to differentiate these students from those who are adequate decoders and able to recognize sight words with fluency. Within the poor decoder category, the SPI further describes student performance as Pre-Decoder, Beginning Decoder, Developing Decoder, or Advancing Decoder. These decoding descriptions assist educators in proper placement within a Tier 2 Strategic Instructional Program or a Tier 3 Intensive Intervention. 4 Scholastic Phonics Inventory PD TM

Rationale SPI measures fluency for two word-level reading skills: phonological decoding and sight word reading. Phonological decoding at the word-level is a building block upon which fluent single-word reading and fluent reading of connected text for comprehension are based, and an important predictor of reading comprehension. The SPI uses nonsense word reading fluency as an effective measure for evaluating phonological decoding. When presented with a nonsense word, readers must break it into parts, retrieve sounds associated with the parts, and string them together to pronounce the unfamiliar word. This process is assessed with the SPI by presenting examinees with pronounceable nonsense words. A related element that contributes to fluency is sight word knowledge. Skilled readers have a large vocabulary of sight words that can be recognized automatically. However, developing a large vocabulary of sight words is largely dependent on the reader s ability to decode efficiently. Skilled readers analyze unfamiliar words or nonsense words more fully than poor readers do. For example, some poor readers tend to use initial consonant cues to guess at the rest of the word. A full analysis of unfamiliar words contributes to their becoming sight words over time. With repeated, accurate reading of the same word, the word eventually becomes stored in memory as a sight word one that is identified automatically and without conscious thought. The more accurate and automatic readers become with these word-level reading processes, the more cognitive resources become available for comprehending strings of text. In fact, for elementary-age students, word-level reading has been found to be a major determinant of reading comprehension (Jenkins et al., 2003; Stanovich, 1991). Difficulties with word-level reading become increasingly problematic as students get older. Problems with phonological decoding and sight word fluency result in poor comprehension and lower motivation (Snow, Burns, & Griffin, 1998), and as texts become increasingly advanced with each grade, poor readers fall further and further behind. Recent studies of struggling adolescent readers in urban schools indicate that over half are deficient in word-level reading skills (Hock et al., 2009). 6 d. Technical Guide 5

ADMINISTRATION AND SCORING Administration The SPI contains 3 equivalent test forms. Each test form includes 1) a practice test with 11 items; 2) 11 letter recognition items; 3) 30 sight word recognition items, and 4) 30 nonsense word decoding items. Each form of the SPI is administered individually via a personal computer in approximately 10 minutes. To log in, students are instructed to enter their user name and password on the log-in screen and then click the Go On button to begin. Students will follow the audio directions to complete each section of the SPI. During all sections of the assessment, students can access the Pause/ Play, and Speaker buttons, as applicable. Once a student answers the last SPI question, he/she will be asked to click on the Go On button to complete the test and exit. Scoring: Fluency and Accuracy With respect to scoring, both fluency (i.e., speed and accuracy) and accuracy are assessed for sight words and nonsense words. Fluency is important because it frees the reader to attend to comprehension. If a student is accurate but slow (non-fluent), it is likely that reinforcement of basic skills along with ongoing practice and corrective feedback will increase word-level fluency. If a student is fluent with nonsense words but not fluent with sight words, a plausible explanation is that the student has adequate phonological decoding skills but limited knowledge of the English vocabulary being assessed. On the other hand, if a student is fluent with sight words but not fluent with nonsense words, the explanation may be that the student has compensated with a large sight word vocabulary, yet he/she continues to struggle with basic phonological decoding skills. 6 Scholastic Phonics Inventory PD TM

SCORE REPORTING AND INTERPRETATION Screening Report Among other reports 1, SPI generates a Screening and Placement Report (see Figure 1) that includes the following information: Date and Form of SPI Placement Test Percent Accurate and Fluent on SPI subtests SPI Fluency Score SPI Decoding Status If it is available a Lexile measure obtained from the Scholastic Reading Inventory (SRI) is also included in the report. Reading Program Placement and Decoding Description SPI reports describe students foundational reading skills in terms of four levels of Decoding Status. Results are based on the accuracy of student responses in the Letter Names subtest, and both the accuracy and fluency (i.e., speed) of student responses in the Sight Words and Nonsense Words subtests. In addition, an accuracy-only scoring system was developed for accommodated administration purposes. For more detail on the SPI accommodated administration reliability and validity see Section 7: Accommodated Administration of Scholastic Phonics Inventory. Table 1 details the criteria used to establish each Decoding Status for both the fluency- and accuracy-only versions of the test. Total Fluency and Total Accuracy Scores for placement in or out of decoding intervention are based on the empirical research; while the criteria used for further placement into Pre-Decoder, Beginning Decoder, and Developing Decoder levels are based on performance for specific types of items, as described in Table 1. Further detail regarding the empirically-based item level fluency thresholds follows in Section 4C: SPI Scoring Algorithm. 1 See the SPI Educator s Guide for a detailed review of the other five reports: 1.) Summary Progress Report, 2.) Student Report, 3.) District/School Status Report, 4.) District/School Growth Report, and 5.) School-to-Home Report. 6 d. Technical Guide 7

Table 1: Reading Program Placement and Criteria to Establish Decoding Status 1 Decoding Status Description General Criteria Criteria for Accuracy- Only Scoring Pre-Decoder Beginning Decoder Developing Decoder Advancing Decoder A student with little or no knowledge of letter names or letter-sound correspondences A student who can identify letter names but cannot decode fluently A student who can fluently decode words with consonants and short vowels (CVCs) but cannot fluently decode more complex words A student who can decode with adequate fluency SPI Fluency Score: 0 10 Letter Names: less than 70% accuracy Nonsense Words: less than 50% accuracy on items that assess consonants and short vowels SPI Fluency Score: 0 10 Letter Names: at least 70% accuracy Nonsense Words: less than 50% accuracy on items that assess consonants and short vowels SPI Fluency Score: 11 22 SPI Fluency Score: 23 60 SPI Fluency Score: 0 45 Letter Names: less than 70% accuracy Nonsense Words: less than 50% accuracy on items that assess consonants and short vowels SPI Fluency Score: 0 45 Letter Names: at least 70% accuracy Nonsense Words: less than 50% accuracy on items that assess consonants and short vowels SPI Accuracy Score: 46 49 SPI Accuracy Score: 50 60 Placement Should Include System 44 or Tier 3 foundational reading program involving alphabetic principle and phonemic awareness System 44 or Tier 3 explicit phonics instruction with simple consonantvowel-consonant patterns System 44 or Tier 3 explicit phonics instruction starting with consonant blends READ 180 or a Tier 2 reading program with direct support in building vocabulary, reading comprehension, and fluency connected with text. 1 Note: the above criteria are based on data collected to date and may be updated based on future releases. 8 Scholastic Phonics Inventory PD TM

Figure 1. Screening and Placement Report 6 d. Technical Guide 9

DEVELOPMENT OF THE SCHOLASTIC PHONICS INVENTORY Development of the SPI Item Bank Nonsense word items. Each form of the SPI contains 30 nonsense word items. Each item consists of a target and three distractors. The items were chosen to represent the full range of decoding skills from consonants and short vowels to blends, digraphs, diphthongs, long vowels, final e, r-controlled vowels, vowel variants, and vowel teams. All targets and distractors are nonsense words or obscure English words (e.g., kens) that are unlikely to be known. The targets and distractors were chosen to avoid Spanish words, slang, and nonsense words that sounded like real words. Sight word items. Each form of the SPI contains 30 sight word items. The Sight Word Recognition section includes two types of items. In the first type, students hear a high-frequency word and must select it from a list of words. In the second type, students hear a high-frequency word and must select the correct spelling from a list of four choices. As was the case for nonsense word items, each sight word item consists of a target and three distractors. The targets were sampled from the first 300 words on the Dolch and Fry word lists and the first 5,000 words in the American Heritage Word Frequency Book. The distractors are relatively common words, orthographically similar to the target words. As the items become more difficult, the target sight word is presented with phonetically similar distractors that are nonsense words, therefore requiring orthographic (or spelling) knowledge to indentify the correct sight word. Scoring and Cross-Validation Samples The Northeastern sample consisted of a middle-school sample of 202 poor readers who were nominated by school staff as either (a) having sufficient decoding skills to participate successfully in reading intervention that does not expressly focus on systematic and explicit phonics instruction, but may include implicit phonics instruction and word study in addition to vocabulary and comprehension skills and strategies (N = 90), or (b) lacking decoding skills necessary to participate in such an intervention (N = 112). We refer to these groups as the adequate decoders and poor decoders, respectively. School staff used performance on the Connecticut Mastery Test Reading (CMT Reading) to place students into each group. The CMT Reading is graded on a scale from 1-5. On this scale, Level 5 is considered advanced, 10 Scholastic Phonics Inventory PD TM

Level 4 is considered goal, Level 3 is considered proficient, Level 2 is considered basic, and Level 1 is considered below basic. Students who scored at Level 1 on the CMT were expected to perform very poorly on reading measures of decoding. Students who scored at Level 2 on the CMT Reading were expected to perform adequately on decoding measures but continue to struggle with grade level reading material. Members of the sample ranged in age from 12 years 10 months to 16 years 6 months with a median age of 13 years 8 months. The sample contained 90 7 th grade students and 112 8 th grade students. The sample contained somewhat more males (55 percent) than females (45 percent). English language learners made up 8 percent of the sample. Students receiving special education services made up 26 percent of the sample. Students eligible for free or reduced-price lunches made up 34 percent of the sample. The sample was diverse, including Caucasian (41 percent), African-American (27 percent), Hispanic (30 percent), and Asian (3 percent) students. Four students were dropped from the sample because of concerns noted during testing, including obvious illness or text anxiety. Five students were dropped from the sample because their excessive variability in performance across assessments suggested questionable motivation for performing to the best of their abilities. Three students were dropped from the sample because of missing data for reference measures of reading. After the 12 students were dropped from the sample, the total sample size was reduced to 190 students. The sample was subdivided into adequate and poor decoder subgroups based on their performance on the Phonemic Decoding Efficiency subtest from Form A and B of the Test of Word Reading Efficiency (TOWRE) (Torgesen, Wagner, & Rashotte, 1999). Standard Scores from Forms A and B were averaged. The adequate decoder group (N = 90) achieved an average standard score greater than 90. The poor decoder group (N = 100) achieved an average standard score equal to or less than 90. In addition to administering the SPI, four decoding subtests were administered to the sample: the Sight Word Efficiency and Phonetic Decoding Efficiency of the TOWRE (Torgeson, Wagner, and Rashotte, 1999) as well as the Word Attack and Letter Word Identification subtests from the Woodcock-Johnson III (Woodcock, McGrew, & Mather, 2001). 6 d. Technical Guide 11

The sample was divided into a scoring sample and a cross-validation sample. A block randomization procedure was used to ensure that half of the adequate decoders and half of the poor decoders ended up in each of the two samples. At this point, both the scoring and cross-validation samples were inspected for outliers. Using an operational definition of receiving a standard score lower than 40 on the TOWRE subtests, a standard score lower than 50 on the Word Attack subtest, or total accuracy scores on the SPI of less than 25, three students were dropped from the poor decoder group of the scoring sample, and five students were dropped from the poor decoder group of the cross-validation sample. SPI Scoring Algorithm Item-Level Fluency Thresholds. Fluency thresholds were determined empirically for each item. The data were provided by the scoring sample. Descriptive statistics for the scoring sample are presented in Table 2, and those for the cross-validation sample are presented in Table 3. Table 2. Descriptive Statistics of Standard Scores (SS) for Scoring Sample on Word-Level Criterion Measures Adequate Decoders (N = 44) Measure Mean Standard Deviation TOWRE Sight Word Efficiency Form A (SS) 97.1 7.7 TOWRE Sight Word Efficiency Form B (SS) 95.5 6.9 TOWRE Phonetic Decoding Efficiency Form A (SS) 104.9 9.8 TOWRE Phonetic Decoding Efficiency Form B (SS) 101.6 8.5 Woodcock-Johnson Word Attack (SS) 96.6 7.3 Woodcock-Johnson III Letter-Word Ident. (SS) 93.8 7.3 Poor Decoders (N = 47) Measure Mean Standard Deviation TOWRE Sight Word Efficiency Form A (SS) 82.8 6.9 TOWRE Sight Word Efficiency Form B (SS) 82.7 7.2 TOWRE Phonetic Decoding Efficiency Form A (SS) 79.2 8.4 TOWRE Phonetic Decoding Efficiency Form B (SS) 78.3 8.7 Woodcock-Johnson Word Attack (SS) 84.7 8.4 Woodcock-Johnson III Letter-Word Ident. (SS) 81.2 9.5 12 Scholastic Phonics Inventory PD TM

Table 3. Descriptive Statistics of Standard Scores (SS) for Cross-Validation Sample on Word-Level Criterion Measures Adequate Decoders (N = 45) Measure Mean Standard Deviation TOWRE Sight Word Efficiency Form A (SS) 97.9 6.7 TOWRE Sight Word Efficiency Form B (SS) 97.0 6.3 TOWRE Phonetic Decoding Efficiency Form A (SS) 104.1 9.3 TOWRE Phonetic Decoding Efficiency Form B (SS) 101.3 6.0 Woodcock-Johnson Word Attack (SS) 96.9 5.8 Woodcock-Johnson III Letter-Word Ident. (SS) 94.5 5.5 Poor Decoders (N = 46) Measure Mean Standard Deviation TOWRE Sight Word Efficiency Form A (SS) 86.7 7.4 TOWRE Sight Word Efficiency Form B (SS) 86.1 7.0 TOWRE Phonetic Decoding Efficiency Form A (SS) 80.2 9.2 TOWRE Phonetic Decoding Efficiency Form B (SS) 80.0 8.2 Woodcock-Johnson Word Attack (SS) 83.5 8.6 Woodcock-Johnson III Letter-Word Ident. (SS) 84.8 7.4 These results indicate that the adequate decoders were average in decoding, with the poor decoders scoring approximately one standard deviation below the adequate decoders. The item fluency thresholds were set so as to differentiate poor and adequate decoders. For each item, a receiver operating characteristic (ROC) curve was generated. ROC curves are plots of sensitivity versus 1- specificity for all potential fluency threshold values. In the present example, sensitivity is the proportion of poor decoders who are correctly categorized as inadequate decoders by the SPI. Specificity refers to the proportion of adequate decoders who are correctly categorized by the SPI as adequate decoders. An example is presented in Table 4 for the purpose of illustrating sensitivity and specificity calculations: 6 d. Technical Guide 13

Table 4. Example of Sensitivity and Specificity Calculations ACTUAL LEVEL OF DECODING SPI Performance Poor Adequate Inadequate Decoders 4 2 Adequate Decoders 1 8 For this example, sensitivity (i.e., the proportion of poor decoders who are correctly categorized by the SPI as inadequate decoders) is 4 (i.e., number of poor decoders correctly categorized) divided by 5 (i.e., total number of poor decoders), or.80. Specificity (i.e., the proportion of adequate level decoders who are correctly categorized by the SPI as adequate decoders) is 8 (i.e., number of adequate decoders correctly categorized) divided by 10 (i.e., total number of adequate decoders), or.80. A ROC curve for an SPI nonsense word item is presented in Figure 2. Figure 2. Receiver Operating Characteristic (ROC) Curve for an SPI Nonsense Word Item. ROC Curve 1.0 0.8 Sensitivity 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 1 Specificity 14 Scholastic Phonics Inventory PD TM

The strategy is to pick the threshold value that represents the point on the curve that is as close to the upper left hand corner as possible. This maximizes sensitivity and specificity (i.e., minimizes 1 minus specificity). In practice, a table that provides the ROC data in the form of values of sensitivity and specificity for all possible threshold values is used to identify the optimal item fluency threshold. This process was used to identify optimal individual threshold values for each individual sight word and nonsense word item. Combining Accuracy and Latency into Fluency Scores. A fluent response must be accurate as well as sufficiently fast. To get credit for a fluent response to an item, the response had to be accurate and the total response time (latency) could not exceed the threshold time. This method of scoring is represented in Table 5. Table 5. Combining Accuracy and Latency Into Fluency Scores: Four Possible Response Patterns Pattern Response Accurate? Latency below Fluency Score Threshold? 1 No No 0 2 No Yes 0 3 Yes No 0 4 Yes Yes 1 There are a number of advantages to this kind of scoring. First, this method of scoring produces hybrid scores that combine accuracy and speed of responding. Hybrid scores have proven to be effective on other reading measures such as the TOWRE and the Test of Silent Reading Efficiency and Comprehension (TOSREC) (Wagner, Torgesen, Rashotte, & Pearson, 2009). One reason that hybrid scores are effective is that individual and developmental differences in underlying reading skill affect both accuracy and speed of response. Therefore, a score that incorporates both speed and accuracy is better than one that is based on only speed or accuracy. A second advantage of this method of scoring is that outlying response times are handled implicitly. If performance on an assessment is measured in terms of average response time, a practical problem that must be dealt with is what 6 d. Technical Guide 15

to do about outlying response times. For example, an outlying response time of 20 seconds will have a large impact on the average response time for a set of responses that typically fall in the range of 1 to 2 seconds. The scoring method used on the SPI handles this potential problem in that a response that exceeds the threshold value gets an item fluency score of 0 regardless of how slow it is. A third advantage of this method of scoring is that it handles a practical problem that arises in the SPI. Because the mouse must be moved to select the correct response in a list of distractors, the amount of mouse movement required varies across items depending on the position of the target item in the list of distractors. This presumably affects response times. This potential unwanted source of variability is handled implicitly by the fact that item thresholds are determined empirically for each individual item. Differences in response time associated with differences in amount of mouse movement required are reflected in the empirical distribution of response times that are the basis of the ROC curves used to identify the optimal item threshold. A final advantage of this method of scoring is that it facilitates maximal use of the information gained from responses to all items, ranging from easy sight word items to difficult nonsense word items, for the task of differentiating adequate and inadequate decoders. Consider the following example of accuracy and fluency scores obtained for the easy sight word item FIND and the difficult nonsense word item PROTUND. The mean accuracy scores for these two items are presented in Table 6 for a previous sample and for poor decoders and adequate decoders separately. Table 6. Mean Accuracy Scores for SPI Items FIND and PROTUND AVERAGE ITEM DIFFICULTY (ACCURACY ONLY) Item Entire Sample Poor Level Adequate Level FIND 1.00 1.00 1.00 PROTUND 0.68 0.64 0.73 As expected, everyone is perfectly accurate for FIND, as indicated by the item difficulties of 1.00 for the entire sample, for poor decoders, and for adequate decoders. This item is not useful for differentiating poor and adequate decoders if we only look at accuracy alone. For the much more difficult 16 Scholastic Phonics Inventory PD TM

PROTUND, only two-thirds of the entire sample gets it correct (0.68), and performance is worse for poor decoders (0.64) than for adequate decoders (.73). Now consider the mean fluency scores for these two items, which are presented in Table 7. Table 7. Mean Fluency Scores for SPI Items FIND and PROTUND AVERAGE ITEM DIFFICULTY (FLUENCY) Item Entire Sample Poor Level Adequate Level FIND 0.82 0.72 0.93 PROTUND 0.42 0.34 0.50 These results are quite different. The FIND item now is helping out in differentiating poor and adequate decoders, as indicated by the average difficulties of.72 for poor decoders and.93 for the adequate decoders. It is able to help out because to get credit for the item, you need to respond accurately but also quickly. The PROTUND item works in a similar fashion. Summary of the Development of the SPI Scores. The SPI Fluency scores are based on both the accuracy and speed of response. Response thresholds were established for each item using data from the scoring sample. With the fluency-based method of scoring, each item contributes to the differentiation of students who have decoding problems from those with adequate decoding. Form Equivalence The equivalence of the three forms of the SPI was examined by comparing means and standard deviations of the alternate forms and by calculating correlations of performance among the three forms. The means and standard deviations are presented here and the correlations of performance among the three forms are presented as alternate-form immediate reliability coefficients in the next chapter. Means and standard deviations of SPI Fluency Scores for the three forms of the SPI are presented for the scoring sample in Table 8 and in Table 9 for the cross-validation sample. 6 d. Technical Guide 17

Table 8. Comparing Means and Standard Deviations of SPI Fluency Scores for Scoring Sample Adequate Decoders (N = 44) Mean Standard Deviation Form A 31.5 7.8 Form B 31.8 7.7 Form C 31.1 8.4 Poor Decoders (N = 47) Mean Standard Deviation Form A 15.8 6.5 Form B 15.4 7.5 Form C 15.5 7.3 Table 9. Comparing Means and Standard Deviations of SPI Fluency Scores for Cross-Validation Sample Adequate Decoders (N = 45) Mean Standard Deviation Form A 28.8 8.0 Form B 30.3 7.9 Form C 28.5 8.0 Poor Decoders (N = 46) Mean Standard Deviation Form A 21.2 8.0 Form B 21.6 8.2 Form C 21.1 8.0 The comparability of the means and standard deviations supports the equivalence of the alternate forms of the SPI Fluency Scores. 18 Scholastic Phonics Inventory PD TM

RELIABILITY OF SCHOLASTIC PHONICS INVENTORY SCORES Internal Consistency Reliability Content-sampling error was estimated in two ways. First, internal consistency reliability coefficients (Coefficient Alpha) were calculated for SPI Fluency Scores. These internal consistency reliability coefficients are presented in Table 10. Table 10. Internal Consistency Reliability Coefficients (Coefficient Alpha) for SPI Fluency Scores Scoring Sample (N = 91) COEFFICIENT ALPHA SPI Fluency Score: Form A.90 SPI Fluency Score: Form B.91 SPI Fluency Score: Form C.91 Cross-Validation Sample (N = 91) SPI Fluency Score: Form A.85 SPI Fluency Score: Form B.86 SPI Fluency Score: Form C.85 The magnitude of the internal consistency reliability supports the internal consistency reliability of each form of the SPI. Alternate-Form Immediate Reliability A second estimate of content-sampling error is provided by alternate-form immediate reliability coefficients. Alternate-form immediate reliability coefficients were calculated by correlating performance on the three forms of the SPI. These reliability coefficients are presented in Table 11. 6 d. Technical Guide 19

Table 11. Alternate-Form Immediate Reliability Coefficients for SPI Fluency Scores Scoring Sample (N = 91) 1. Form A 1.00 2. Form B 0.92 1.00 A B C 3. Form C 0.91 0.91 1.00 Cross-Validation Sample (N = 91) 1. Form A 1.00 2. Form B 0.86 1.00 A B C 3. Form C 0.78 0.79 1.00 The magnitude of the alternate-form immediate reliability coefficients supports the reliability of the SPI. Standard Error of Measurement The standard error of the measurement (SEM) is a measure of the amount of measurement error associated with SPI Fluency Scores. The SEM is calculated by multiplying the standard deviation of a test score by the square root of one minus the reliability of the test score. The SEM for SPI Fluency Scores is 4. This value allows us to put confidence intervals around SPI Fluency Scores. A 95% confidence interval is approximately equal to an area within 2 standard deviations on either side of the mean. This means the true score will fall between ± 8 on either side of a student s score. Summary of the Reliability Analyses. The reliability analyses indicate that the SPI Fluency Scores meet the highest standard of reliability. The standard error of measurement for SPI Fluency Scores is 4, which corresponds to a 95% confidence interval of plus or minus 8. 20 Scholastic Phonics Inventory PD TM

VALIDITY OF THE SCHOLASTIC PHONICS INVENTORY FLUENCY SCORES Content-Description Validity Content-description validity refers to the examination of the content of the test to determine whether it is a representative sample of the behavior domain that is being assessed (Anastasi & Urbina, 1997). The traditional term for this kind of validity is content validity. The behavior domains assessed by SPI are letter recognition, sight word reading fluency, and decoding fluency. The following is a description of the items in each form: Letter Recognition Students hear a letter name and must select the corresponding letter from a list of four choices. All 26 letters of the alphabet are represented, either as targets (correct answers) or distractors (incorrect answer choices). Only lower-case letters are used, as they are generally considered more challenging than upper-case letters and more appropriate for assessing older readers. Sight Word Reading Students hear a high-frequency sight word and must select it from a list of four choices. In some items, the targets are selections from the first 300 words on the Dolch and Fry word lists and the distractors are other common words that are orthographically similar to the target. In other items, the targets are selections from the first 5,000 words in the American Heritage Word Frequency Book and the distractors are misspellings of the target. Decoding Students hear a nonsense word and must select the corresponding word from a list of four choices. All answer choices are nonsense words that follow the conventions of English, thereby making them decodable while preventing them from being read from memory. The items represent the breadth of spelling patterns taught in most phonics programs including consonants, short vowels, blends, digraphs, and r-controlled vowels and align to 6 d. Technical Guide 21

the scope and sequence of System 44, Scholastic s decoding intervention program. Targets and distractors work together to assess individual sound-spellings and require students to attend carefully to differences among spelling patterns. In generating the items, care was taken to avoid proper nouns, Spanish words, nonsense words that sound like real words, and items that may be difficult for speakers of certain dialects, including African American Vernacular English, to distinguish phonologically. Throughout the development process, items were reviewed by an esteemed group of reading and assessment experts, and replacements were made as necessary. Criterion-Prediction Validity Criterion-prediction validity refers to the extent to which a test predicts performance that the test is intended to predict (Anastasi & Urbina, 1997). The traditional term for this kind of validity is criterion-related validity. Predictive validity coefficients were calculated by using SPI Fluency scores as a predictor for six criterion variables. These were the two forms of the Sight Word Efficiency and the Phonetic Decoding Efficiency subtests from the Test of Word Reading Efficiency (TOWRE) (Torgesen, Wagner, & Rashotte, 1999), and the Word Attack and Letter-Word Identification subtest from the Woodcock-Johnson III (Woodcock, McGrew, & Mather, 2001). Samples with restricted range provide biased estimates of validity. Correction for range restriction provides an unbiased estimate. Observed predictive validity coefficients and coefficients corrected for range restriction are presented in Table 12 for the scoring sample. 22 Scholastic Phonics Inventory PD TM

Table 12. Observed Validity Coefficients (and Coefficients Corrected for Range Restriction for Criterion Measures) for SPI Fluency Scores as Predictors of Reading Criterion Scores for the Scoring Sample (N = 91) SPI FLUENCY SCORE SPI FLUENCY SCORE SPI FLUENCY SCORE Form A Form B Form C TOWRE Sight Word Efficiency (Form A).69(.81).73(.84).71(.83) TOWRE Sight Word Efficiency (Form B).68(.83).74(.87).70(.84) TOWRE Phon. Decod. Efficiency (Form A).73(.71).72(.70).71(.69) TOWRE Phon. Decod. Efficiency (Form B).75(.76).76(.77).72(.73) Letter-Word Identification (WJ-III).61(.74).63(.75).58(.71) Word Analysis (WJ-III).63(.78).61(.76).60(.75) Note: All coefficients are significant at p <.01. Predictive validity coefficients are presented in Table 13 for the cross-validation sample. Table 13. Observed Validity Coefficients (and Coefficients Corrected for Range Restriction for Criterion Measures) for SPI Fluency Scores as Predictors of Reading Criterion Scores for the Cross-Validation Sample (N = 91) SPI FLUENCY SCORE SPI FLUENCY SCORE SPI FLUENCY SCORE Form A Form B Form C TOWRE Sight Word Efficiency (Form A).70(.85).69(.85).61(.79) TOWRE Sight Word Efficiency (Form B).64(.82).65(.83).58(.78) TOWRE Phon. Decod. Efficiency (Form A).56(.56).58(.58).50(.50) TOWRE Phon. Decod. Efficiency (Form B).61(.67).64(.70).53(.59) Letter-Word Identification (WJ-III).50(.70).62(.79).52(.71) Word Analysis (WJ-III).55(.74).62(.79).51(.70) Note: All coefficients are significant at p <.01. These validity coefficients were large in magnitude and support the criterionprediction validity of the SPI scores. 6 d. Technical Guide 23

Construct-Identification Validity Construct-identification validity refers to the extent to which a test measures the target theoretical construct or trait (Anastassi & Urbina, 1997). The previous term for this type of validity is construct validity. Constructidentification validity is a global form of validity that encompasses evidence provided about the content-description validity and criterion-prediction validity of a test, but includes other evidence as well. For the SPI constructidentification validity is supported if groups that are known to differ in levels of decoding can be shown to differ in performance on the SPI. Using the scoring sample, the SPI scores of the poor decoders were compared to those of the adequate decoders. These results are presented in Table 14. 24 Scholastic Phonics Inventory PD TM

Table 14. Descriptive Statistics, Tests of Group Differences, and Effect Sizes for Scoring Sample Poor Decoders Adequate Decoders (N = 47) (N = 44) Measure Mean (SD) Mean (SD) t Cohen s d SPI Fluency Scores Form A 15.8 (6.5) 31.5 (7.8) 10.4 2.19 Form B 15.4 (7.5) 31.8 (7.7) 10.2 2.16 Form C 15.5 (7.3) 31.1 (8.4) 9.5 1.703 Criterion Reference Measures TOWRE Sight Word Efficiency 82.8 (6.9) 97.1 (7.7) 9.3 1.96 Form A TOWRE Sight Word Efficiency 82.7 (7.2) 95.5 (6.9) 8.7 1.82 Form B TOWRE Phonetic Decoding 79.2 (8.4) 104.9 (9.8) 13.5 2.82 Efficiency Form A TOWRE Phonetic Decoding 78.3 (8.7) 101.6 (8.5) 12.8 2.71 Efficiency Form B Woodcock- Johnson Letter-Word 81.2 (9.5) 93.8 (7.2) 7.1 1.49 Identification Woodcock- Johnson Word Attack 84.7 (8.4) 96.6 (7.3) 7.2 1.51 Note. All t-test values significant at p <.001 level. The results for the cross-validation sample are presented in Table 15. 6 d. Technical Guide 25

Table 15. Descriptive Statistics, Tests of Group Differences, and Effect Sizes for Cross-Validation Sample Poor Decoders Adequate Decoders (N = 46) (N = 45) Measure Mean (SD) Mean (SD) t Cohen s d SPI Fluency Scores Form A 21.2 (8.0) 28.8 (8.0) 4.6 0.95 Form B 21.6 (8.2) 30.3 (7.9) 5.2 1.07 Form C 21.1 (8.0) 28.5 (8.0) 4.4 0.92 Criterion Reference Measures TOWRE Sight Word Efficiency 86.7 (7.4) 97.9 (6.7) 7.5 1.59 Form A TOWRE Sight Word Efficiency 86.1 (7.0) 97.0 (6.3) 7.7 1.64 Form B TOWRE Phonetic Decoding 80.2 (9.2) 104.1 (9.3) 12.4 2.58 Efficiency Form A TOWRE Phonetic Decoding 80.0 (8.2) 101.2 (6.0) 14.1 2.9 Efficiency Form B Woodcock- Johnson Letter-Word 83.5 (8.6) 94.5 (5.4) 7.3 1.53 Identification Woodcock- Johnson Word Attack 84.8 (7.3) 96.9 (5.8) 8.7 1.84 Note. All t-test values significant at p <.001 level. The groups differed substantially and significantly on all three SPI Fluency Scores. The magnitude of these differences were comparable to those of the differences on TOWRE and Woodcock-Johnson scores for the scoring sample, 26 Scholastic Phonics Inventory PD TM

as evidenced by the effect sizes for the SPI scores and those for the TOWRE and Woodcock-Johnson III. The SPI Fluency Score differences were highly significant but with smaller effect sizes for the cross-validation sample. The magnitude of the group differences in SPI scores supports the constructidentification validity of the SPI scores. Another way of examining the ability of SPI scores to differentiate poor decoders and adequate decoders is to examine validity coefficients in the form of point-biserial correlations between SPI scores and a group membership (i.e., poor decoders versus adequate decoders). These results are presented in Table 16. Table 16. Validity Coefficients (Point-Biserial Correlation Coefficients) for SPI Fluency Scores as Predictors of Poor Versus Adequate Decoding Group Scoring Sample (N = 91) Validity Coefficient SPI Fluency Score Form A.74 SPI Fluency Score Form B.74 SPI Fluency Score Form C.71 Cross-Validation Sample (N = 91) SPI Fluency Score Form A.44 SPI Fluency Score Form B.48 SPI Fluency Score Form C.42 Note. All validity coefficients significant at p <.001. Because the SPI was constructed to be a measure of word-level reading skills rather than a measure of perceptual motor speed, a second test of constructidentification validity is provided by comparing the average difference in response time between poor decoders and adequate decoders for the initial matching items that did not require word-level reading skills and the sight word and nonsense word items that did require word-level reading skills. The difference in average response times for poor and adequate decoders was an order of magnitude greater for the items that required word-level reading skills (approximately 500 milliseconds) compared to the matching items that did not require word-level reading skills (approximately 50 milliseconds). This confirms that performance on the SPI is primarily determined by fluency at word-level reading as opposed to simple perceptual motor speed. 6 d. Technical Guide 27

Classification analyses. The most stringent test of the constructidentification validity of the SPI is provided by classification analyses. A classification study was carried out in which SPI Fluency scores were used to predict group membership (i.e., poor decoders versus adequate decoders) for the scoring sample. For classification studies, four statistics are of importance: 1. Sensitivity. Sensitivity refers to the proportion of poor decoders who are correctly categorized by the SPI. 2. Specificity. Specificity refers to the proportion of adequate decoders who are correctly categorized by the SPI. 3. Positive Predictive Value. Positive predictive value refers to the proportion of students the SPI categorized as poor decoders who actually were poor decoders. 4. Negative Predictive Value. Negative predictive value refers to the proportion of students the SPI categorized as adequate decoders who actually were adequate decoders. The previous example used to illustrate calculation of sensitivity and specificity (Table 4) is used here again to extend the calculations to positive predictive value and negative predictive value in Table 17. Table 17. Example of Sensitivity, Specificity, Positive Predictive Value and Negative Predictive Value Calculations ACTUAL LEVEL OF DECODING SPI Performance Poor Adequate Poor Decoders 4 (TP) 2 (FP) Adequate Decoders 1 (FN) 8 (TN) Note. TP = true positive. TN = true negative. FP = false positive. FN = false negative. As illustrated previously, sensitivity (i.e., the proportion of poor decoders who are correctly categorized by the SPI as poor decoders) is 4 (true positives) divided by 5 (true positives plus false negatives, or the total number of actual poor decoders), or.80. 28 Scholastic Phonics Inventory PD TM

Specificity (i.e., the proportion of adequate decoders who are correctly categorized by the SPI as adequate decoders) is 8 (true negatives) divided by 10 (true negatives plus false positives, or the total number of actual adequate decoders), or.80. Positive predictive value (i.e., the proportion of students the SPI categorized as poor decoders who actually are poor decoders) is 4 (true positives) divided by 6 (true positives plus false positives, or the total number of poor decoders categorized by the SPI), or.67. Negative predictive value (i.e., the proportion of students the SPI categorized as adequate decoders who actually were adequate decoders) is 8 (true negatives) divided by 9 (true negatives plus false negatives, or the total number of adequate decoders categorized by the SPI), or.89. Different authorities have proposed different standards for what constitutes acceptable values for classification statistics (see Hammill, Wiederholt, & Allen, 2006, for a review). Wood, Flowers, Meyer, and Hill (2002) and Jansky (1978) proposed values of.70 as a standard for acceptable values of sensitivity and specificity. Wood et al. (2002) proposed accepting lower values for positive predictive value, whereas Jansky advocated a standard that required positive predictive value also achieve a value of.70. Gredler (2002) and Kingslake (1983) proposed that sensitivity, specificity, and positive predictive values should meet a higher standard of achieving values of.75 or better. Hammill et al. (2006) proposed a system of three levels of acceptability for classification statistics that is presented in Table 18. Table 18. Levels of Acceptability for Classification Statistics Proposed by Hammill et al. (2006) Level 1. Sensitivity and Specificity, or Sensitivity and Positive Predictive Value >=.70. Level 2. Sensitivity, Specificity, and Positive Predictive Value >=.70. Level 3. Sensitivity, Specificity, and Positive Predictive Value >=.75. The results of the classification analyses for the scoring sample are presented in Table 19. 6 d. Technical Guide 29

Table 19. Classification Statistics for Predicting Decoding Status Using SPI Fluency Scores for the Scoring Sample (N = 91) Form A Form B Form C Sensitivity.84.86.98 Specificity.87.83.72 Positive Predictive Value.86.83.77 Negative Predictive Value.85.83.97 All of these values achieve the highest level of acceptability in the Hammill et al. system. The results of the classification analyses for the cross-validation sample are presented in Table 20. Table 20. Classification Statistics for Predicting Decoding Status Using SPI Fluency Scores for the Cross-Validation Sample (N = 91) Form A Form B Form C Sensitivity.71.80.73 Specificity.65.63.63 Positive Predictive Value.64.65.63 Negative Predictive Value.72.78.73 Sensitivity and negative predictive values exceeded.7 for all forms, but specificity and positive predictive values were in the.6 to.7 range. Summary of the validity analyses. The content-description validity of the SPI was demonstrated by examining the extent to which the items represented the target domains of sight word and nonsense word decoding. The criterionprediction validity of the SPI was demonstrated by the magnitudes of the predictive validity coefficients generated when SPI scores were used to predict reading criteria in two samples. The construct-identification validity of the SPI was supported by the magnitude of group differences in SPI scores for poor decoders and adequate decoders, and by the success of the SPI in predicting group membership in two classification studies. The 30 Scholastic Phonics Inventory PD TM

classification statistics met the highest standard of acceptability for the scoring sample, but not for the cross-validation sample. The construct-identification validity also was supported indirectly by the results of the investigation of the content-description validity and criterion-prediction validity of the measure mentioned previously. 6 d. Technical Guide 31

ACCOMMODATED ADMINISTRATION OF THE SCHOLASTIC PHONICS INVENTORY Description of Accommodated Administration Procedures Accommodated administration is available for students who cannot manipulate a mouse efficiently due to motor impairments or who exhibit severe attention disorders. An accuracy-only scoring system is available that does not penalize slow responses. Depending on the degree of difficulty in manipulating the mouse, students can either move the mouse themselves or an assistant can do it for them. In either case, only the accuracy of the responses is scored. The decision to use accommodated administration should be made by the student s teacher or someone else knowledgeable about the student s limited ability to manipulate the mouse efficiently. Teachers can enable this feature for individual students in the Scholastic Achievement Manager (SAM). The number and types of test items remain the same. Because the accuracy-only based scores are not as sensitive as the fluency-based scores, accommodated administration of the SPI should only be done when the student is unable to manipulate a mouse efficiently. Form Equivalence The equivalence of the three forms of the SPI accuracy scores was examined by comparing means and standard deviations of the alternate forms. Means and standard deviations of SPI Accuracy Scores for the three forms of the SPI are presented for the scoring sample and cross-validation sample in Table 21. 32 Scholastic Phonics Inventory PD TM