Examining Differential Item Functioning in Reading Assessments for Students with DisabilitiesJamal Abedi Seth Leon & Jenny C. Kao January 2007 All rights reserved. Any or all portions of this document may be reproduced and distributed without prior permission, provided the source is cited as: Abedi, J., Leon, S., & Kao, J. (2007). Examining differential item functioning in reading assessments for students with disabilities. Minneapolis, MN: University of Minnesota, Partnership for Accessible Reading Assessment. AbstractThis study examines group differences between students with disabilities and students without disabilities using DIF analyses in a high-stakes reading assessment. Results indicated that for grade 9, many items exhibited DIF. Items that exhibited DIF were more likely to be located in the second half of the assessment subscales. After accounting for reading ability using a proxy score from items on the first half of the subscales, students with disabilities consistently under-performed on items located in the second half relative to the items located in the first half, compared to students without disabilities. These results were seen in grade 9 for data from two different states, but these results were not seen for grade 3. This study has several limitations to the data. There was no access to information about the testing accommodations that students with disabilities might have received, and no information about the type of disabilities. Results of this study can shed light on potential factors affecting the accessibility of reading assessments for students with disabilities, in an ultimate effort to provide assessment tools that are conceptually and psychometrically sound for all students. A companion report is available examining differential distractor functioning for students with disabilities. IntroductionMore than 6 million students with disabilities-approximately 13 percent of all students-attended United States public schools during the 2003-2004 school year (U.S. Government Accounting Office, 2005). Accountability standards have been raised since the reauthorization of the Individuals with Disabilities Education Act (IDEA) and the authorization of the No Child Left Behind Act of 2001 (NCLB, 2002), that require states to include students with disabilities in annual assessments. In a review of state practices, Klein, Wiley, and Thurlow (2006) found that 44 states reported participation and performance for students with disabilities on all of their NCLB assessments during the 2003-2004 school year. According to data collected during the 2003-2004 school year, of the 48 reporting states and the District of Columbia, 41 states reported that at least 95 percent of students with disabilities participated in the statewide reading assessment (U.S. Government Accounting Office, 2005). Furthermore, most students with disabilities participated in regular reading assessments, while relatively few participated in alternate assessments. Nearly 84% of middle school students with Individualized Education Programs (IEPs) participated in general reading assessments, as reported by states in the 2002-2003 Annual Performance Reports (Thurlow, Moen, & Wiley, 2005). Given the high rate of participation by students with disabilities in regular state and national assessments, as well as the implications of assessment outcomes for accountability, it is imperative that we ensure these assessments are as accessible to students with disabilities as possible. In other words, they must be as fair and accurate as possible. Students with disabilities may perform less well than their peers without disabilities for a variety of reasons, including their specific disability, lack of appropriate testing accommodations, or lack of opportunity to learn. However, they may also perform less well because of factors directly related to the tests. For instance, there could be issues related to the item quality or test item format. It is necessary to reduce irrelevant and extraneous sources not related to the construct being measured. Test bias can occur when performance on a test requires sources of knowledge different from those intended to be measured, causing test scores to be less valid for a particular group (Penfield & Lam, 2000). Test bias is often examined at the item level, with differential item functioning (DIF) analyses being part of the framework for probing item bias. If a certain group (i.e., racial/ethnic group or gender) performs lower on average on a specific item, then one could say that the item is biased against that particular group. DIF analyses compare the performance of two groups of the same level of ability in order to disentangle the effects of unfairness and ability level. Matching ability level is essential, since different groups may have different ability levels, where case differences in performance are to be expected (Clauser & Mazor, 1998). Consistent differences between two groups of the same ability level would suggest that DIF is present. However, results of DIF analyses can only suggest that DIF is present, and not that the items are biased. To consider an item as biased also requires determining the non-target constructs that lead to the between-group differences in performance (Penfield & Lam, 2000). Thus, DIF is a necessary but not sufficient condition for item bias (Clauser & Mazor, 1998). DIF analysis is often used to examine group differences between specific racial or ethnic groups or between males and females. For example, Hauser and Kingsbury (2004) explored differential functioning across student groups formed based on ethnicity and based on gender on items from the Idaho Standards Achievement Test. Zenisky, Hambleton, and Robin (2004) explored gender DIF in a large-scale science assessment. Other research has examined incidences of DIF for limited English proficient students (Snetzler & Qualls, 2000). DIF analyses have also been conducted for students with disabilities. Specifically, DIF analyses have been used to examine effects of accommodations that are provided to students with disabilities during testing (Bolt, 2004; Cohen, Gregg, & Deng, 2005; Koretz & Hamilton, 1999). This study aims to examine potential factors that may affect the accessibility of reading assessments for students with disabilities. Haladyna and Downing (2004) identified potential sources of systematic errors associated with construct-irrelevant variance, that included factors relating to test development: (1) item quality; (2) test item format; and (3) differential item functioning. We were specifically interested in employing DIF analyses to examine any potential between-groups differences in a high-stakes reading assessment. Our study differs from previous research using DIF analyses for students with disabilities in that our study seeks to investigate specific factors related to the test rather than to the accommodation. There are several statistical procedures that can be used to identify differentially functioning test items, including the Mantel-Haentzel statistic, logistic regression, SIBTEST, the Standardization procedure, and various Item-Response-Theory-based approaches (Clauser & Mazor, 1998). Our study uses a logistic regression approach as outlined by Zumbo (1999) because it is easier to employ and is more suitable for answering our research questions. Research QuestionsThe following research questions guided the analyses and reporting of this study:
MethodologyData SourceData from two states provided the impetus for answering the research questions. We will refer to them as State X and State Y to ensure anonymity. State X is a small state with an average number of students with disabilities. Data were obtained for the 1997-1998 academic year and included item-level information on students responses on the Stanford Achievement Test, Ninth Edition (Stanford 9). Students with valid scores were included in our analyses. Students with LEP (limited English proficient) classifications (including LEP students with disabilities) were excluded from the analyses to reduce the possible confounding of language proficiency issues. Of the 6,611 third-grade students included in the present analyses, 448 (6.8%) were considered to be students with disabilities. Of the 5,287 ninth-grade students, 522 (9.9%) were considered to be students with disabilities. State Y is a large state with an average number of students with disabilities. Data were obtained for the 1997-1998 academic year and included item-level information on students responses in the Stanford 9. Students with valid scores were included in our analyses. Students with LEP classifications (including LEP students with disabilities) were excluded from the analyses to reduce the possible confounding of language proficiency issues. Of the278,287 third-grade students included in the present analyses,21,239 (7.6%) were considered to be students with disabilities. Of the 244,446 ninth-grade students,17,321 (7.1%) were considered to be students with disabilities. Published by Harcourt Brace Educational Measurement in 1996, the Stanford 9 is a standardized, norm-referenced test in several subject areas, including reading. According to the Harcourt Assessment website, the Stanford 9 uses an easy-hard-easy format in which difficult questions are surrounded by easy questions to encourage students to complete the test. The reading portion of the test is characterized by three different types of reading selections-recreational, textual, and functional-and items that assess initial understanding, interpretation, critical analysis, and reading strategy (HarcourtAssessment.com). The present study examines two subscales of the Stanford 9, Reading Comprehension (RC) and Word Analysis (WA) (more commonly known as phonics or decoding), from the two states. Public school students in grades 3 and 9 were analyzed to present data over a wider age range. Procedure & Statistical DesignTo determine whether items exhibit DIF for students with disabilities, a multi-step logistic regression procedure was employed. The outcome variable in each model was the dichotomous response to the item which was coded as correct or incorrect. A total score on the applicable subscale (RC or WA) was computed as a proxy for ability on the construct. In step 1, the ability proxy was entered into the model and a measure of the explained variance (Naeglekerke R-square) was obtained. In step 2 the disability status grouping variable and an interaction between disability status and the ability proxy were entered into the model. Again the R-square estimate was obtained. The change in R-square between step 1 and step 2 was calculated and tested for significance. Items were identified for closer inspection as differentially functioning if the R-square change was at least 0.003 and was significant at p<0.01. A similar approach was used to determine whether item order influences DIF for students with disabilities. Rather than using the total score as a proxy for ability only the score on items from the first half of the assessment was used as an ability proxy (first 27 out of 54 items for RC; first 15 out of 30 items for WA). Items that exhibited DIF were examined more closely looking at the odds ratios of the variables in the final model. If systemic differences in the DIF findings arose between the two approaches they could then be compared. For example, if items showed larger DIF effects on the items from the latter portion of the assessment when the second proxy was used, and if the odds ratios on those items were in a consistent direction, then it would be apparent that item order was influencing DIF. ResultsThe analyses examine the following research questions:
The results are described by state and grade, and then by subscale. Detailed results of the DIF findings are available in the Appendix. State X Grade 9Reading Comprehension. Table 1 presents DIF results from State X in grade 9 for the 54-item Reading Comprehension subscale. The total score on the 54 items served as an ability proxy in this model. Items were identified as differentially functioning when the R-square change between steps 1 and 2 was at least 0.003 and was significant at p<0.01. There were 17 items that showed DIF, 13 of which were located in the second half of the assessment (Items 28 through 54). This suggests that item order might be influencing DIF. The second model used a similar method with the exception that the ability proxy was calculated only from the first 27 items. Using this method there were 23 items that showed DIF, 17 of which came from the second half of the assessment. The effect sizes using the first half ability proxy were larger, especially for the items from the second half of the assessment. Table 1. State X Grade 9 Item-level Reading Comprehension
Note: In Model 1, the total score was used as an ability proxy. In Model 2, the score on the first 27 items was used as an ability proxy. Items that were found to exhibit DIF from Model 2 in Table 1 were examined more closely in Table 2 to determine whether item order might be systematically influencing DIF. Logistic regression models were re-run for each of the 17 DIF items from the second half of the test. Each of the three variables was entered in a separate step to determine each partial R-square addition. Odds ratios are presented for the full model. In 15 of the 17 items the main effect of the disability status grouping variable was significant and for all 15 of those items the odds ratio for the disability status grouping variable was less than 1.0. This strongly demonstrates that students with disabilities under-performed on each of those items relative to students without disabilities when controlling for performance on the first half of the assessment. Similarly, 14 of these 17 items had a significant interaction between the disability status grouping variable and the first half ability proxy and the odds ratio for each significant finding was less than 1.0. A significant interaction term with an odds ratio less than 1.0 indicates that a student with disabilities who scored well on the first 27 items would not score as well on the second half of the test as a student without disabilities who had scored similarly on the first 27 items. Table 2. State X Grade 9 Item-level Reading Comprehension Logistic Regression Results for Items Showing DIF with Ability Proxy Based On First 27 Items Score
Note: * denotes significance at p<.05. ** denotes significance at p<.01 Figures 1 and 2 of the expected probability of a correct response for Items 36 and 49, respectively, serve as examples to illustrate these respective effects. Figure 1. Expected Probability of a Correct Response for Item 36 in State X Grade 9 Reading Comprehension Figure 1 represents the relationship for a strong main effect on the disability status grouping variable. The odds ratio for the main effect of the disability status grouping variable was 0.51. Students with disabilities who scored similarly as students without disabilities on the first half of the assessment were less likely to answer Item 36 correctly. Figure 2 represents the relationship for an interaction between the disability status grouping variable and the ability proxy based on the score from the first half. The odds ratio for the interaction term on item 49 was 0.5. The performance gap between students with disabilities and students without disabilities becomes very large for students who performed well on the first half of the test and there is little gap for students who were one standard deviation or more below the mean on the first half of the assessment. Figure 2. Expected Probability of a Correct Response for Item 49 in State X Grade 9 Reading Comprehension Word Analysis. Table 3 presents DIF results from State X in grade 9 for the 30-item Word Analysis subscale. The total score on the 30 items served as an ability proxy in this model. Items were identified as DIF when the R-square change between steps 1 and 2 was at least 0.003 and was significant at p<0.01. There were 12 items that showed DIF, 8 of which were located in the second half of the assessment (Items 16 through 30). Similar to the results for RC, this suggests that item order might be influencing DIF. The second model used a similar method with the exception that the ability proxy was calculated only from the first 15 items. Using this method there were 19 items that showed DIF, 13 of which came from the second half of the assessment. The effect sizes using the first half ability proxy were larger, especially for the items from the second half of the assessment. Table 3. State X Grade 9 Item-level Word Analysis
Note: In Model 1, the total score was used as an ability proxy. In Model 2, the score on the first 15 items was used as an ability proxy. Items that were found to exhibit DIF from Model 2 in Table 3 were examined more closely in Table 4 to determine if item order might be systematically influencing DIF. Logistic regression models were re-run for each of the 13 DIF items from the second half of the test. Each of the three variables was entered in a separate step to determine each partial R-square addition. Odds ratios are presented for the full model. In 12 of the 13 items the main effect of the disability status grouping variable was significant and for all 12 of those items the odds ratio for the disability status grouping variable was less than 1.0. Again this strongly demonstrates that students with disabilities under-performed on each of those items relative to students without disabilities when controlling for performance on the first half of the assessment. Additionally, 5 of these 13 items had a significant interaction between the disability status grouping variable and the first half ability proxy and the odds ratio for each significant finding was less than 1.0. All five significant interaction effects occurred on items located near the end of the test. Table 4. State X Grade 9 Item-level Word Analysis Logistic Regression Results for Items Showing DIF with Ability Proxy Based On First 15 Items Score
Note: * denotes significance at p<.05. ** denotes significance at p<.01 Figures 3 and 4 of the expected probability of a correct response for Items 18 and 30, respectively serve as examples to illustrate these respective effects. Figure 3. Expected Probability of a Correct Response for Item 18 in State X Grade 9 Word Analysis Figure 3 represents the relationship for a strong main effect on the disability status grouping variable. The odds ratio for the main effect of the disability status grouping variable was 0.39. Students with disabilities who scored similarly to students without disabilities on the first half of the assessment were less likely to answer Item 18 correctly. Figure 4 represents the relationship for an interaction between the disability status grouping variable and the ability proxy based on the score from the first half, along with a strong main disability effect. The odds ratio for the interaction term on Item 30 was 0.71. The odds ratio for the main disability effect was 0.28. Students with disabilities with similar performance to students without disabilities on the first 15 items are always predicted to score below non-disabled students. The gap between students with disabilities and students without disabilities in expected performance increases as performance on the first half of the test increases. Figure 4. Expected Probability of a Correct Response for Item 30 in State X Grade 9 Word Analysis State Y Grade 9Reading Comprehension. Table 5 presents DIF results from State Y in grade 9 for the 54-item Reading Comprehension subscale. Items were identified as DIF when the R-square change between steps 1 and 2 was at least 0.003 and was significant at p<0.01. When using total score on the 54 items as an ability proxy there were no items that showed DIF although most items were significant at p<0.01. In the second model, in which the ability proxy was calculated only from the first 27 items, 13 items showed DIF, 11 of which were located in the second half of the assessment. The effect sizes using the ability proxy based on the score from the first half were larger, especially for the items from the second half of the assessment. Table 5. State Y Grade 9 Item-level Reading Comprehension
Note: In Model 1, the total score was used as an ability proxy. In Model 2, the score on the first 27 items was used as an ability proxy. Items that were found to exhibit DIF from Model 2 in Table 5 were examined more closely in Table 6 to determine whether item order might be systematically influencing DIF. Logistic regression models were re-run for each of the 11 DIF items from the second half of the test. Each of the three variables was entered in a separate step to determine each partial R-square addition. Odds ratios are presented for the full model. In all 11 items the main effect of the disability status grouping variable was significant and for each of those items the odds ratio for the disability status grouping variable was less than 1.0. This strongly demonstrates that students with disabilities under-performed on each of those items relative to students without disabilities when controlling for performance on the first half of the assessment. Similarly, all 11 items had a significant interaction between the disability status grouping variable and the first half ability proxy and the odds ratio for each significant finding was less than 1.0. A significant interaction term with an odds ratio less than 1.0 indicates that a student with disabilities who scored well on the first 27 items would not score as well on the second half of the test as a student without disabilities who had scored similarly on the first 27 items. Table 6. State Y Grade 9 Item-level Reading Comprehension Logistic Regression Results for Items Showing DIF with Ability Proxy Based On First 27 Items Score
Note: * denotes significance at p<.05. ** denotes significance at p<.01 Word Analysis. Table 7 presents DIF results from State Y in grade 9 for the 30-item Word Analysis subscale. The total score on the 30 items served as an ability proxy in this model. Items were identified as DIF when the R-square change between steps 1 and 2 was at least 0.003 and was significant at p<0.01. With this model, only one item showed DIF. In the second model, in which the ability proxy was calculated only from the first 15 items, 12 items showed DIF, 10 of which were located in the second half of the assessment. The effect sizes using the first half ability proxy were larger, especially for the items from the second half of the assessment. Table 7. State Y Grade 9 Item-level Word Analysis
Note: In Model 1, the total score was used as an ability proxy. In Model 2, the score on the first 15 items was used as an ability proxy. Items that were found to exhibit DIF from the model in Table 7 were examined more closely in Table 8 to determine whether item order might be systematically influencing DIF. Logistic regression models were re-run for each of the 10 DIF items from the second half of the test. Each of the three variables was entered in a separate step to determine each partial R-square addition. Odds ratios are presented for the full model. In all 10 items the main effect of the disability status grouping variable was significant and for each of those items the odds ratio for the disability status grouping variable was less than 1.0. Again this strongly demonstrates that students with disabilities under-performed on each of those items relative to students without disabilities when controlling for performance on the first half of the assessment. Additionally 8 of these 10 items had a significant interaction between the disability status grouping variable and the first half ability proxy and the odds ratio for each significant finding was less than 1.0. Table 8. State Y Grade 9 Item-level Word Analysis Logistic Regression Results for Items Showing DIF with Ability Proxy Based On First 15 Items Score
Note: * denotes significance at p<.05. ** denotes significance at p<.01 State X Grade 3Reading Comprehension. Table 9 presents DIF results from State X in grade 3 for the 54-item Reading Comprehension subscale. The total score on the 54 items served as an ability proxy in this model. Items were identified as DIF when the R-square change between steps 1 and 2 was at least 0.003 and was significant at p<0.01. There were just three items that showed DIF, only one of which was located in the second half of the assessment. In the second model, in which the ability proxy was calculated only from the first 27 items, seven items showed DIF, five of which were located in the second half of the assessment. The effect sizes using the ability proxy based on the score from the first half were slightly larger for the items from the second half of the assessment. Table 9. State X Grade 3 Item-level Reading Comprehension
Note: In Model 1, the total score was used as an ability proxy. In Model 2, the score on the first 27 items was used as an ability proxy. Items that were found to exhibit DIF from Model 2 in Table 9 were examined more closely in Table 10 to determine whether item order might be systematically influencing DIF. Logistic regression models were re-run for each of the five DIF items from the second half of the test. Each of the three variables was entered in a separate step to determine each partial R-square addition. Odds ratios are presented for the full model. In three of the five items the main effect of the disability status grouping variable was significant and for all three of those items the odds ratio for the disability status grouping variable was less than 1.0. This seems to suggest that students with disabilities under-performed on each of those items relative to students without disabilities when controlling for performance on the first half of the assessment. Similarly all five of the DIF items had a significant interaction between the disability status grouping variable and the first half ability proxy and the odds ratio for each significant finding was less than 1.0. A significant interaction term with an odds ratio less than 1.0 indicates that a student with disabilities who scored well on the first 27 items would not score as well on the second half of the test relative to a student without disabilities who had scored similarly on the first 27 items. There were fewer items exhibiting DIF in grade 3 Reading Comprehension than in grade 9 Reading Comprehension in State X. Table 10. State X Grade 3 Item-level Reading Comprehension Logistic Regression Results for Items Showing DIF with Ability Proxy Based On First 27 Items Score
Note: * denotes significance at p<.05. ** denotes significance at p<.01 Word Analysis. Table 11 presents DIF results from State X in grade 3 for the 30-item Word Analysis subscale. The total score on the 30 items served as an ability proxy in this model. Items were identified as DIF when the R-square change between steps 1 and 2 was at least 0.003 and was significant at p<0.01. There were seven items that showed DIF, four of which were located in the second half of the assessment (Items 16 through 30). The second model used a similar method except that the ability proxy was calculated only from the first 15 items. Using this method there were nine items that showed DIF, just two of which came from the second half of the assessment (Items 16 through 30). The number of items showing DIF under this model from the second half of the test decreased. Table 11. State X Grade 3 Item-level Word Analysis
Note: In Model 1, the total score was used as an ability proxy. In Model 2, the score on the first 15 items was used as an ability proxy. These findings are unlike those seen in grade 9. In grade 9 there were more items exhibiting DIF from the second half than from the first half of the test when the score on the first half of the test was used as the ability proxy. For grade 3, logistic regression models were re-run for the two DIF items from the second half of the test, and are presented in Table 12. Only one of the two items had a significant main effect of the disability status grouping variable, and the odds ratio was less than 1.0. These results suggest that in State X the factors influencing the results for students with disabilities in grade 9 in WA were not present for students with disabilities in grade 3. Table 12. State X Grade 3 Item-level Word Analysis Logistic Regression Results for Items Showing DIF with Ability Proxy Based On First 15 Items Score
Note: * denotes significance at p<.05. ** denotes significance at p<.01 State Y Grade 3Reading Comprehension. Table 13 presents DIF results from State Y in grade 3 for the 54-item Reading Comprehension subscale. The total score on the 54 items served as an ability proxy in this model. Items were identified as DIF when the R-square change between steps 1 and 2 was at least 0.003 and was significant at p<0.01. There were no items that showed DIF in grade 3 using this method. In the second model, in which the ability proxy was calculated only from the first 27 items, 7 items showed DIF, one of which was located in the second half of the assessment. The number of items showing DIF under this model from the second half of the test decreased. Table 13. State Y Grade 3 Item-level Reading Comprehension
Note: In Model 1, the total score was used as an ability proxy. In Model 2, the score on the first 27 items was used as an ability proxy. These findings are unlike those seen in grade 9. In grade 9 there were more items exhibiting DIF from the second half than from the first half of the test when the score on the first half of the test was used as the ability proxy. For grade 3, logistic regression models were re-run for the one item showing DIF from the second half of the test, and are presented in Table 14. There was a significant main effect of the disability status variable and the odds ratio was less than 1.0. These results suggest that in State Y the factors influencing the results for students with disabilities in grade 9 in RC were not present for students with disabilities in grade 3. Table 14. State Y Grade 3 Item-level Reading Comprehension Logistic Regression Results for Items Showing DIF with Ability Proxy Based On First 27 Items Score
Note: * denotes significance at p<.05. ** denotes significance at p<.01 Word Analysis. Table 15 presents DIF results from State Y in grade 3 for the 30-item Word Analysis subscale. The total score on the 30 items served as an ability proxy in this model. Items were identified as DIF when the R-square change between steps 1 and 2 was at least 0.003 and was significant at p<0.01. There were no items that showed DIF using the total score on the 30 items as an ability proxy. In the second model, in which the ability proxy was calculated only from the first 15 items, 12 items showed DIF, 4 of which were located in the second half of the assessment. Table 15. State Y Grade 3 Item-level Word Analysis
Note: In Model 1, the total score was used as an ability proxy. In Model 2, the score on the first 15 items was used as an ability proxy. These findings are unlike those seen in grade 9. In grade 9 there were more items exhibiting DIF from the second half than from the first half of the test when the score on the first half of the test was used as the ability proxy. Logistic regression models were re-run for the four items that did indicate DIF from the second half of the test, and are presented in Table 16. The odds ratio for each of these items indicates that students with disabilities under-performed relative to students without disabilities after controlling for performance on the first half of the test. Table 16. State Y Grade 3 Item-level Word Analysis Logistic Regression Results for Items Showing DIF with Ability Proxy Based On First 15 Items Score
Note: * denotes significance at p<.05. ** denotes significance at p<.01 DiscussionStudents with disabilities tend to perform at lower levels than students without disabilities. While their lower performance can be attributed to their specific disability, there may be other factors that potentially interfere with their performance. It is necessary to identify such factors and reduce their interference, so that we may obtain accurate measurements of the knowledge of students with disabilities. Recent reauthorizations of federal legislations render it imperative that the instruction and assessment of students with disabilities are as fair and adequate as possible. While we recognize that factors related to instruction and assessment are intricately intertwined, only a relatively small portion of students with disabilities have conditions that lower their performance potential. This study does not address that issue but instead focuses specifically on factors related directly to the assessments and the accuracy as to which they reflect what students learn. The present study explored whether items in a reading assessment functioned differentially for students with disabilities, as compared to their peers without disabilities. Results of this study can provide insight into potential factors affecting the accessibility of reading assessments for students with disabilities, as part of an effort to improve assessments for all students. The following research questions guided this study:
To answer these research questions, student responses on multiple-choice items were compared across the disability status categories in two reading subscales of the Stanford 9, Reading Comprehension and Word Analysis, in two grade levels (3 and 9) from public schools in two different states (State X and State Y). A multi-step logistic regression procedure was used. Because it is essential in DIF analysis that the two groups being compared are matched on ability level, ability proxies were used based on either the total score of the subscale, or the combined score on the first half of the subscale. After accounting for reading ability, results for grade 9 in both states indicated that there are a number of items that exhibit DIF for students with disabilities on both the RC and WA subscales. Results also indicated that the items exhibiting DIF for students with disabilities are more likely to be located in the second half of the RC and WA subscales. When the reading ability proxy was based on the combined score from the first half of the RC or WA subscales, the effect size for DIF increased for the items located in the second half. Furthermore, students with disabilities consistently under-performed on the second half of the items relative to the first half of the items. These results were not seen for grade 3. In other words, there were fewer items that were shown to exhibit DIF for students with disabilities than what was found in grade 9. This was true for both the RC and WA subscales and for both states. In grade 3, items that were shown to exhibit DIF for students with disabilities were no more likely to be located in the second half of the assessments than they were in the first half of the assessments. The findings of this study have multiple implications. There are differences between grade 3 and grade 9, that may result from cognitive development of reading skills, or perhaps the differences in assessment standards for those grades, or that students with disabilities are more clearly identified as having disabilities in older years. In grade 9, we might speculate over what factors contribute to the diminishing performance for students with disabilities as the test progresses. Perhaps students with disabilities did not have sufficient time or energy to complete the test and rushed through the answers at the end. It could be that they reached a certain cognitive overload, lost motivation, or became fatigued or frustrated. Our companion report, that examines differential distractor functioning, found that students with disabilities in grade 9, appear to be making more random guesses rather than educated guesses in items located in the second half of the assessments, as compared to their non-disabled peers (see Abedi, Leon, & Kao, 2006, for more detail). More research is needed to determine the actual cause or causes. Qualitative research with students may potentially shed some light on these factors. This study has several major limitations. For instance, it does not differentiate between categories of disabilities. Students with disabilities are not a homogeneous subgroup. Not only are there different types of disabilities, but even within the same type of disability there are differences among individuals. Further insight could be gained from analyzing data by specific disability groups. This study was also limited in terms of scope. We did not have access to information on testing accommodations. Although our study was conducted assuming that students were properly accommodated, we do not know this for sure. It could be that students with disabilities did not receive adequate or appropriate accommodations, and knowing this could inform the results. Also, we did not have access to the actual test booklets or test items, which could provide further insight into the findings. Future studies should take into account accommodations and examine test booklets. Nevertheless, findings of this study provide evidence that other factors related to the assessments may contribute to the performance gap between students with disabilities and their peers without disabilities. Controlling for factors that are not related to the content being assessed may help test developers provide more accessible and more valid assessments for students with disabilities. Additionally, being cognizant that other factors exist may help when interpreting test results for students with disabilities, especially in the context of accountability. ReferencesAbedi, J., Leon, S., & Kao, J. (2006). Examining differential distractor functioning in reading assessments for students with disabilities (PARA report). Minneapolis, MN: Partnership for Accessible Reading Assessment. Bolt, S. E. (2004). Using DIF analyses to examine several commonly-held beliefs about testing accommodations for students with disabilities. Paper presented at the Annual Meeting of the National Council on Measurement in Education, San Diego, CA. Retrieved June 20, 2006, from http://education.umn.edu/NCEO/Presentations/NCME04bolt.pdf Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17(1), 31-44. Cohen, A. S., Gregg, N., & Deng, M. (2005). The role of extended time and item content on a high-stakes mathematics test. Learning Disabilities Research & Practice, 20(4), 225-233. Haladyna, T. M., & Downing, S. M. (2004). Construct-irrelevant variance in high-stakes testing. Educational Measurement: Issues and Practice, 23(1), 17-27. HarcourtAssessment.com. (n.d.). Stanford Achievement Test series, Ninth edition - Complete battery. Retrieved May 18, 2006, from http://harcourtassessment.com/hai/ProductLongDesc.aspx?ISBN=E132C&Catalog=TPC-USCatalog&Category=AchievementAccountability Hauser, C., & Kingsbury, G. (2004). Differential item functioning and differential test functioning in the Idaho Standards Achievement Tests for spring 2003. Lake Oswego, OR: Norwest Evaluation Association. Klein, J. A., Wiley, H. I., & Thurlow, M. L. (2006). Uneven transparency: NCLB tests take precedence in public assessment reporting for students with disabilities (Technical Report 43). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes. Koretz, D., & Hamilton, L. (1999). Assessing students with disabilities in Kentucky: The effects of accommodations, format, and subject. Los Angeles, CA: University of California, National Center for Research on Evaluation, Standards, and Student Testing. Retrieved June, 28, 2006, from http://www.cresst.org/Reports/TECH498.pdf No Child Left Behind Act of 2001, Pub. L. No. 107-110, 115 Stat. 1425 (2002). Penfield, R. D., & Lam, T. C. M. (2000). Assessing differential item functioning in performance assessment. Educational Measurement: Issues and Practice, 19(3), 5-15. Snetzler, S., & Qualls, A. L. (2000). Examination of differential item functioning on a standardized achievement battery with limited English proficient students. Educational and Psychological Measurement, 60(4), 564-577. Thurlow, M. L., Moen, R. E., & Wiley, H. I. (2005). Annual performance reports: 2002-2003 state assessment data. Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes. Retrieved May 2, 2006, from http://education.umn.edu/nceo/OnlinePubs/APRsummary2005.pdf U.S. General Accounting Office (2005). No Child Left Behind Act: Most students with disabilities participated in statewide assessments, but inclusion options could be improved. Washington, DC: Author. Zenisky, A. L., Hambleton, R. K., & Robin, F. (2004). DIF detection and interpretation in large-scale science assessments: Informing item writing practices. Educational Assessment, 9(1-2), 61-78. Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense. Retrieved March 10, 2006, from http://educ.ubc.ca/faculty/zumbo/DIF/handbook.pdf AcknowledgmentsThe work reported here was supported under a grant from the U.S. Department of Education, Office of Special Education Programs. The findings and opinions expressed in this report are those of the authors and do not necessarily reflect the positions or policies of the Office of Special Education Programs or the U.S. Department of Education. The authors acknowledge the valuable contribution of colleagues in this study. The authors are thankful to Martha Thurlow, Ross Moen, Christopher Johnstone, and other staff at the National Center on Educational Outcomes, and other members of the Partnership for Accessible Reading Assessment for their helpful comments and suggestions. Danna Schacter at CRESST also helped us substantially with the preparation of this manuscript. The authors are also grateful to Eva Baker for her support of this work, and to Joan Herman for her extensive involvement, advice, and support of this work. AppendixDetailed DIF ResultsTable A1. State X Grade 9 Item-level Reading Comprehension Ability Proxy Based On All 54 Items
Table A2. State X Grade 9 Item-level Reading Comprehension Ability Proxy Based On First 27 Items
Table A3. State X Grade 9 Item-level Word Analysis Ability Proxy Based On All 30 Items
Table A4. State X Grade 9 Item-level Word Analysis Ability Proxy Based On First 15 Items
Table A5. State Y Grade 9 Item-level Reading Comprehension Ability Proxy Based On All 54 Items
Table A6. State Y Grade 9 Item-level Reading Comprehension Ability Proxy Based On First 27 Items
Table A7. State Y Grade 9 Item-level Word Analysis Ability Proxy Based On All 30 Items
Table A8. State Y Grade 9 Item-level Word Analysis Ability Proxy Based On First 15 Items
Table A9. State X Grade 3 Item-level Reading Comprehension Ability Proxy Based On All 54 Items
Table A10. State X Grade 3 Item-level Reading Comprehension Ability Proxy Based On First 27 Items
Table A11. State X Grade 3 Item-level Word Analysis Ability Proxy Based On All 30 Items
Table A12. State X Grade 3 Item-level Word Analysis Ability Proxy Based On First 15 Items
Table A13. State Y Grade 3 Item-level Reading Comprehension Ability Proxy Based On All 54 Items
Table A14. State Y Grade 3 Item-level Reading Comprehension Ability Proxy Based On First 27 Items
Table A15. State Y Grade 3 Item-level Word Analysis Ability Proxy Based On All 30 Items
Table A16. State Y Grade 3 Item-level Word Analysis Ability Proxy Based On First 15 Items
|