The Design and Validation of a Motivating Large-Scale Accessible Reading Comprehension Assessment for Students with Disabilities

Deborah R. Dillon, David G. O'Brien, Kato Kentaro, Cassandra Scharber, Catherine Kelly, Anne Beaton, & Brad Biggs
(all authors are from the University of Minnesota, Twin Cities)

Contact person:
Deborah R. Dillon
University of Minnesota, Twin Cities
330B Peik Hall, 159 Pillsbury Dr. S. E.
Curriculum & Instruction Department
College of Education & Human Development
Minneapolis, MN 55455
PH:  612.626.8271; Fax: 952.944.2672
Email:  dillon@umn.edu  


In the United States there are over 6.7 million children and youth with disabilities (U.S. Department of Education, 2006). These students, served by federally supported programs for the disabled during the 2005-2006 school year, represented nearly 14 percent of total school enrollment. The 14% includes students with specific learning disabilities (41%), speech or language impairments (21.9%), intellectual and developmental disabilities) (8.3%), emotional disturbance (7.1%), and students with hearing impairments, orthopedic impairments, other health impairments, visual impairments, multiple disabilities, deaf-blindness, autism, traumatic brain injury, and developmental delay. Historically, students with disabilities have been excluded from accountability testing, but recent legislation such as the reauthorization of the Individuals with Disabilities Act (IDEA) and the No Child Left Behind (NCLB) Act of 2001 mandate the inclusion of these students in statewide accountability assessments with the goal of promoting higher achievement for these students. Although inclusion of students with disabilities in accountability assessment is an important endeavor, it has introduced challenges for states that seek to improve the quality of assessments for these students.

States are required to assess annually the reading achievement of at least 95 percent of students, which includes students with disabilities while requiring states to report the academic achievement of students with disabilities as a separate subgroup (NCLB, 2002). While federal regulations allow states to count a small percentage of students taking alternate assessments toward their adequate yearly progress (AYP) calculations for NCLB, most students with less severe cognitive disabilities still take regular assessments. In addition, because comparability between alternative assessments and regular accountability assessments is a major concern, it is important that states provide regular assessments that are accessible to students with disabilities. By accessible we mean an assessment students can approach and participate in despite characteristics that might inhibit or preempt their participation in typical assessments. The goal of our research project is to design a large-scale reading comprehension test that is accessible to all students. To do so, we focused on a key characteristic of students with disabilities that influences their reading achievement—their motivation to participate in, and complete reading comprehension assessments. Specifically, we identified texts based on topical interest and popular genres read by children and youth and selected passages from primary sources that represent students’ in- and out-of-school reading of popular genres. We then reproduced these texts, complete with color pictures, when we created our motivating assessment.

THEORETICAL FRAMEWORKS

Students with disabilities often view the task of reading multiple passages and answering comprehension items on a test as insurmountable. Part of this stance is attributable to low self-efficacy (Authors, XXXXa). These low self-efficacy beliefs in turn, impact the amount of effort, willingness to persevere, and the control these students execute over their environment (Bandura, 1986; Pajares, 2002; Purdie & Hattie, 1996). If students with disabilities have expectations for success and value the given task, their self-efficacy is predicted to influence their effort, persistence, and achievement (Bandura, 1986). Additionally, self-efficacy plays a role in determining how resilient an individual will be in the face of adversity (Jinks & Morgan, 1999; Pajares, 1996). Thus, students’ self-efficacy beliefs impact their academic performance by influencing the choices students make in academic pursuits, how much effort students expend on a given task, and how much anxiety students experience during a given task (Pajares, 2002). The research on self efficacy provides a rationale for creating accessible reading assessments: If we create assessments in which students want to participate and they feel they can succeed, then they may engage in the task and persevere, allowing us to gain a more valid picture of their reading proficiency.

Guthrie and Wigfield (2000; 2005) have identified constructs that, if attended to in reading assessments, could increase students’ motivation while being assessed. These constructs include “characteristics of the text, student choice and control of the setting, student goals, difficulty sequence of the items, complexity of the task, response opportunities, and accompanying activities that may influence the situational motivation of students during an assessment” (p. 198). However, as Guthrie et al. (2004) noted, despite the fact that motivation and engagement contribute to reading comprehension, “motivation and engagement have rarely been incorporated into experimental studies of instruction or interventions for reading comprehension” (p. 403).

As part of a large research and development project funded by the Research to Practice Division, Office of Special Education Programs, U.S. Department of Education (Authors, XXXXb) , we attended to motivation constructs by designing a large-scale reading comprehension assessment that addressed students’ interests and their sense of self efficacy with the goal of making the assessment more accessible to 4 th and 8 th graders with a range of disabilities that affect their reading of typical large-scale comprehension tests. Specifically, we designed an assessment for students that included interesting texts because research indicates that when students find texts interesting, they comprehend better (Schiefele, 1999) , particularly students who are typically poor comprehenders (deSousa & Oakhill, 1996).

Purpose of the Study and Guiding Questions

Our goal was to examine whether improving the motivational characteristics of a large-scale reading assessment increases its accessibility for students with disabilities due to their increased engagement. The purpose of this paper is to report the findings of the first phase of the assessment development: scaling or calibrating the measurement tools (passages and items) that will be used in the second phase (not reported here), the administration of a large scale comprehension assessment to students with disabilities wherein students will be allowed to choose the passages they read. The calibration process allowed us to empirically determine the comparability of passages and items (questions following the passages) used in the reading assessment by placing all passages and items on a common IRT (item response theory) -based equal-interval measurement scale. We also gathered important data on students’ perceptions of passage interest and difficulty. This first phase of the study was guided by the following questions:

  1. How well do the engaging and accessible reading passages function when they are placed on a common interval measurement scale to allow scores from different passages (of equal or unequal difficulty) to be compared and equated? How well do the multiple choice items designed for the engaging reading passages function?
  2. Which engaging passages do students prefer to read? Which passages do they consider to be challenging?
  3. How do students’ interests and their ratings of difficulty of passages impact their reading performance?

RESEARCH METHODS

Participants

The study focused on a representative total sample of 1,245 students selected from schools in a large urban district in a mid-western city whose teachers volunteered to participate in the study. The district enrolls a total of 36, 370 students in 99 schools. Of these students 64.1% receive free or reduced lunch and the federal categories for race and ethnicity include the following proportions of students: 28.2% White; 41.5%; Black; 15.4% Hispanic; 10.8% Asian/Pacific Islander; and 4.1% American Indian/Alaska Native. Somali and Hmong students, which represent a significant proportion of students of color, are not yet broken out by the district into specific subgroups from the federal categories. A total of 627 students from intact classrooms in grades 3-5 (211 3rd graders, 211 4th graders, 205 5th graders) participated in the study along with and 618 students from intact classrooms in grades 7-9 (224 7th graders, 212 8th graders, 182 9th graders). Thus, the design involved students representing the full range of reading ability--a rectangular distribution. The sample also included students with disabilities and ELL students who were fluent in English, and schools provided accommodations to students in accordance with their IEPs. Teachers provided general demographic information about each class that participated in the testing (e.g., number of students with disabilities, number of English language learners, and reading level range). These data were used to describe the study sample and were not linked to individual students.

Developing the Assessment: Selecting Passages and Items

The first step in the calibration process involved selecting 40 passages for the 4 th grade reading assessment and 40 passages for the 8 th grade assessment. The passages at each grade level included 10 literary-fiction and 10 informational-expository texts, matching several text types identified on the new National Assessment of Educational Progress (NAEP) Reading Framework (2009). The passages were selected based on a review of research in children’s and young adult literature that indicated topics of interest to students at the targeted grade levels. E xpertise was sought from reading and children’s literature scholars and classroom teachers in the selection process . Passage lengths were determined using technical specifications documentation for existing state assessments for each grade level, respectively, with the goal to use intact passages or short chapter excerpts from books. The difficulty of each passage for 4 th and 8 th grade readers was calculated or gauged using standard readability measures including the Fry Readability Graph (Fry, 1977) and Lexiles (Lexile Framework, MetaMetrics, Inc.). The passages were also assigned a difficulty rating as easy, medium, and hard by reading experts. After passages were identified, analyzed, and permission was secured from publishers, illustrators and photographers to use the texts in an accessible reading assessment, experienced test item developers wrote 10 multiple choice questions for each passage, including at least one vocabulary item. The item writers used the NAEP 2009 Reading Framework cognitive targets to construct the items in three categories: (a) locate/recall, (b) integrate/interpret, and (c) critique/evaluate). The passages and items were examined by reading experts for quality and potential bias after the item writers drafted items.

Our goal was to retain the original visual characteristics (photos, pictures, creative fonts, maps, sketches, illustrations, or diagrams) that accompanied each passage because we believed those features would engage students and motivate them to read and, as others have found, support their comprehension (Brookshier, Scharff, & Moses, 2002; Levie & Lentz 1982). However, some research suggests that visuals can be distracting to students with disabilities and impact their comprehension ( Rose, 1986 ). To address this concern, we developed criteria to guide decisions about the inclusion or exclusion of visual items for each of the reading passages used in the assessment. The criteria were generated using the research literature, expert opinions from a noted scholar in the area of children’s literature, reading professors, and expert K-12 classroom teachers (Authors, XXXXc). Consensus for retention of exclusion was reached through individual analysis followed by group analysis/discussion. Agreement was nearly 95% for all visual items prior to discussion between researchers and 100% after discussion. Although there is little definitive evidence on how to best design the layout of test pages to make them more readable and accessible (Chall & Squire, 1996; Waller, 1996), we perused criteria from syntheses of research in cognate areas including studies of readability and document analyses (Oakland, 2004) and universal design applied to large scale assessments (Thompson, Johnston, & Thurlow, 2002) to inform our design and layout. Test passages were reproduced by digital layout experts working with assessment professionals to look as much like the passages as they appeared in the original publications as possible with a few modifications. A two-column format was used to present the passages to ensure that passages were consistently presented with a lack of clutter and ample white space per page (at variance from the column layout range found among the originals). The 10 items following each passage were arranged on the page to make the answer choices easy to read and to contrast the answer section with other text on the page. The end result was a test booklet with 3 reading passages and multiple choice questions; most passages included color pictures or visuals.

Assessment Administration Procedures

We developed 5 forms of the assessment, each form consisting of a set of passages arranged in a counterbalanced passage order to reduce order effects. Across all 5 forms of the test we included six anchor passages (passages included in all forms) and 14 non-anchor passages (passages that varied across forms), from which three were selected and included in each form of the test. Thus, each form included nine passages, six anchor and three non-anchor passages. Within classes, students were randomly assigned to one of the five test forms. After students read and completed the 10 multiple choice questions for each passage, they answered two generic Likert scale questions. The Likert questions (each with 4 intervals) for each passage required students to rate the passage indicating their interest in the content and rating of how challenging the text was to read. Since the Likert items immediately followed each passage, students were able to remember passage content well enough to rate how interesting or challenging the material was before they continued on to the next passage, yet the two brief questions were not so intrusive as to slow down momentum needed to complete the test. We administered the assessment in three separate testing sessions--all on different days of the week or across 2 weeks in an effort to reduce fatigue and to allow students all the time they needed to complete the assessment.

ANALYSIS AND RESULTS

In the first part of our analysis we sought to address research question #1: How well do the engaging and accessible reading passages function when they are placed on a common interval measurement scale to allow scores from different passages (of equal or unequal difficulty) to be compared and equated? How well do the multiple choice items designed for the engaging reading passages function?

How well did the multiple choice items designed for the engaging reading passages function? After the analysis, over half of the 40 interesting passages we selected and at least 7 of the 10 multiple choice items for each passage were successfully calibrated or placed on a common scale. Hence, the calibration resulted in a high quality, valid assessment that can be used in phase two of the research wherein students will choose the passages they want to read on an accessible reading assessment. The following section details the calibration process, indicating how we created a valid and reliable set of passages and items.

Descriptive Item and Passage Analysis

The 4th grade and 8th grade assessments were calibrated separately with different sets of field testing data, but in both cases we followed the same two-step procedure: Descriptive Item/Passage Analysis including Item Selection, Reliability Check, and Passage Selection; and IRT Calibration in which we used Initial Calibration and Final Item Selection. Our final goals included the following: (a) selection of six passages in each text type (expository or literary), (b) selection of seven items in each passage, and (c) calibration of the retained items on a single scale.

Descriptive item/passage analyses were conducted for the following purposes: (a) screening the items for the subsequent IRT calibration by excluding items with undesired characteristics and (b) selection of a fewer number of passages based on motivation ratings.

Item Selection

We computed basic item statistics including item difficulty and response proportions, point biserial correlation, and the omission rate to identify items with undesirable characteristics that we excluded from the subsequent analysis. Response proportions were calculated for all response options in each item. The proportion for the correct response option represented the difficulty of the item. If any of the response options, whether a correct response option or not, had too large a proportion of responses on a distractor, was eliminated because it did not elicit useful information about the examinees’ ability because most students gave the same response. Items with a correct response proportion greater than .90 or smaller than .25 (this is the chance-level probability for four-alternative items) on any response option were flagged for exclusion.

Point biserial correlations between response choice and total score. Point biserial correlations were calculated between each of the four response options and the total (number-correct) score for each item. The correlation between the correct response and the total score was regarded as a measure of discrimination, the extent to which the item discriminates the target ability. Ideally the correct response should correlate highly with the total score and, hence, be the best indicator of ability. Also, the correlation between the correct response and the total score should not be too small even if it is highest among all response options. Accordingly, these items were flagged for exclusion: (a) items for which the largest point biserial correlation was observed for a response option other than the correct one, or (b) items for which the point biserial correlation for the correct response option was smaller than .15.

Students who attempted all nine passages and received the same form across three sessions were included in the computation of total scores. Omitted responses were treated as incorrect in the total score calculation. Ideally, students with any omitted responses should be excluded, but it would have substantially reduced the number of valid students in the current data pool. By selecting students who attempted all nine passages, the effect of omitted responses on the total score was kept at a minimum.

Omission rates. Omission rate is another potentially important index of item characteristics. If too many students skip an item, that item may have some undesirable characteristics. The denominator of the omission rate is the number of students who attempted the item. Items with an omission rate greater than .10 were flagged for exclusion. Items that did not meet any of the above four criteria (difficulty range, maximum point biserial for the correct response, magnitude of point biserial for the correct response, and omission rate) were considered for possible exclusion from calibration.

Reliability Check

We computed Cronbach’s coefficient alpha to check internal consistency of test items. Since we have multiple forms as well as passages, the coefficient alpha was computed for (a) all items in the anchor passages (60 items common to all forms less the deleted items), (b) all items in the non-anchor passages by form (30 items, which varied across forms, less the deleted items), (c) all items by form (90 items less the deleted items), and (d) by passage. As in the computation of point biserial correlations, only students who both attempted all nine passages and received the same form across three sessions were included in the computation of reliability. Since the number of items differed by comparison, we also computed the reliability estimates corrected for the number of items by the Spearman-Brown (SP) formula. The hypothetical number of items in the SP correction was set to 14, which equals the number of items included in two passages in our final tests.

Passage Selection

Six passages were selected for each text type (expository or literary) yielding a total 12 passages to be used in phase two of the study—the creation of the motivation assessment. We first ranked the passages based on the mean rating of how interesting the passage was, and then selected the top six passages for each text type. Passages for which too many items had been dropped in the item analysis were not included in the ranking. We also took into account the quality of passages, including how interesting the passages were to all students, if there was a reasonable representation of topics that were of interest to both genders, and if students felt the passages were appropriately challenging (not too hard; not too easy). In Table 1 and Table 2 we list the top rated texts for each grade level by genre, and provide a short description of each passage.

Calibration by IRT: Model and Estimation Details

The particular IRT model used in this study is the three-parameter logistic model ([3PLM]; see Baker, 2001, for a brief introduction of IRT; Hanson, 2002), that is suitable for multiple-choice items. In 3PLM, the item response function (IRF) is specified by three parameters: discrimination (a), difficulty (b), and the guessing parameter (c). More specifically, the IRF for item j in 3PLM is given by

Equation 1

where q denotes the ability of an examinee. Note that the normal metric correction factor (i.e., D = 1.702) is included in the a parameter in this study.

We needed to estimate item parameters (a, b, and c) for each item using our calibration sample. It is known that some item parameter estimates are “drifted” to extreme values due to a relatively flat likelihood function, when the maximum likelihood method is used for estimation and the sample size is not large enough-- as is the case with the current study. Even when the sample size is large, estimation of the c parameter is sometimes problematic (Zeng, 1997). In order to assess the extent to which parameter drifts would occur in the given setting, preliminary simulation studies were conducted under the same setting as the current study (i.e., 600 examinees take one of the five forms in which 90 items are administered; there are 60 anchor items and 140 non-anchor items; item parameters were randomly generated within their usual ranges). Results indicated that parameter drifts would be likely to occur. Several approaches are available in order to stabilize item parameter estimates (e.g., Baldwin, 2007; Barnes & Wise, 1991). We adopted the Bayesian approach, which imposes prior distributions on item parameters to confine their estimates to reasonable ranges. In particular, the four-parameter beta prior distribution (Zeng, 1997) was used for each parameter. Another set of simulations showed that Bayesian estimates resulted in higher correlations with, and smaller errors to, the true parameter values than the usual maximum likelihood estimates as also demonstrated by Swaminathan and Gifford (1986) and Swaminathan et al. (2003).

The main purpose of imposing prior distributions is to limit the parameter estimates within reasonable ranges (e.g., a parameters should be greater than zero while it is not likely that their values exceed, for example, 5.0), so in other aspects the prior distributions should be made as neutral as possible (in principle, prior distributions have a subjective nature, and it is possible to make them reflect stronger prior information by a researcher’s choice). The four-parameter beta distribution is specified by four parameters and denoted by B4 (s 1, s 2, l, u). s 1 and s 2 determine the shape of the distribution, and l and u specify the lower and upper limits, respectively. For all a parameters, we used B4 (1.75, 3.00, 0, 5.10), which limits the range of a parameters between 0 and 5.10 with the mean, 10 and 90 percentiles being 1.88, 0.58, and 3.33, respectively. For all b parameters, we used B4 (1.15, 1.15, -6.00, 6.00), which limits the range of b parameters between -6 and 6 with the mean, 10 and 90 percentiles being 0, -4.57, and 4.57, respectively. For all c parameters, we used B4 (3.50, 4.00, 0, 0.50), which limits the range of c parameters between 0 and .50 with the mean, 10 and 90 percentiles being 0.23, 0.12, and 0.35, respectively. Parameter estimation was implemented by the computer program ICL (Hanson, 2002). The above prior distributions are all default in ICL.

Initial IRT calibration.The 10 multiple choice questions/items created for each of the twelve selected passages that were retained in the screening step were subjected to the initial IRT calibration. In addition, all anchor passages were included whether they had been selected or not. We fit the 3PLM, and estimated a, b, and c parameters for each item.

Final item selection.Based on the initial IRT results, we selected the best 7 items from the 10 we tested for each passage. Within each passage, we first listed items with poor item fit. For this purpose, empirical and IRT-based IRFs were plotted for each item. First, ability levels were formed by dividing the entire sample of students into six groups of approximately equal size by ability estimates (i.e., estimated q scores). All students were arranged by their ability scores, and the top 17% were put in the first level, the second 17% in the second level, and so forth. Next, the sample proportion of a correct response and a 95% confidence interval were computed at each ability level, and plotted against the mean q value of that level (an empirical IRF). The corresponding IRT-based IRF were also plotted. We inspected each plot in terms of the following points: visual inspection, goodness of fit chi-square statistic, mean absolute difference (MAD), and balancing the number of cognitive targets.

Visual inspection. Each pair of empirical and IRT-based IRFs was visually inspected for discrepancies and systematic deviations. If 95% confidence intervals of an empirical IRF cover the corresponding IRT-based IRF, the fit at that point was considered as acceptable.

Goodness of fit chi-square statistic. Yen’s (1981) Q 1 statistic divided by its degrees of freedom was used as a chi-square item fit statistic. Larger values indicate poor item fits.

Mean absolute difference (MAD). MAD indicates how much the empirical and IRT-based IRFs are separated on average over all ability levels. Larger values indicate poor item fits. No absolute thresholds were used with the above indices to determine poor item fit. Instead, items with relatively poor fits within each passage were flagged for deletion except for a few cases in which item fits were clearly unacceptable.

Balancing the number of cognitive targets. In addition to item fit, we considered balancing the cognitive targets (locate/recall, integrate/interpret, or critique/evaluate) within each passage; we required that each passage include at least one item for each cognitive target. If dropping a poor-fitting item lead to elimination of a cognitive target in the passage, then that item was retained and another item was sought for exclusion. If there was more than one candidate for exclusion, we excluded items with lower discrimination (i.e., a lower a value). After dropping those items, we checked to see if the proportions of cognitive targets in the whole test approximated those defined in the 2009 NAEP Reading Framework specification, and replaced items where necessary. We also checked to ensure that we had a good subset of items that measured vocabulary (typically 1 item per passage). The final set of items for the motivation assessment consisted of 84 total items, seven items for each of the 12 passages.

Final IRT Calibration

The resulting 84 items (plus items in non-selected anchor passages, if any) were subjected to the final IRT calibration. It resulted in item parameter estimates, from which plots of item response functions and item information functions were created. Item fit was also examined. In order to evaluate the performance of the final set of items, we calculated information functions at item, passage, and test levels (IIF, PIF, and TIF, respectively) as well as passage and test characteristic functions (PCF and TCF, respectively). In the motivation study (phase two), a “test” will consist of four passages, two expository and two literary. Ability scores will be estimated separately for these tests. Thus, the average TIF for the calibration study was computed from all combinations of two passages (the number of ways to choose two passages out of six is 15) for each text type. The average standard error of measurement (SEM; as the square root reciprocal of the average TIF) and the average TCF were also computed in the same manner (Authors, XXXXd; Kato, Moen & Thurlow, 2007)).

FURTHER ANALYSES AND RESULTS

After the calibration process was completed, we undertook a second analysis to address research question #2: Which engaging passages do students prefer to read? Which passages do they consider to be challenging? First, we analyzed students’ responses to the situated motivation questions in their test booklets where they rated their interest in passage topics and indicated how challenging they found the passages after they read them and completed the items. At the 4 th grade level the results indicated that out of the 20 passages, 8 expository passages were rated higher than any of the literary passages by all students. The passages students rated highly were from sources such as Ranger Rick, Cricket, Highlights Magazine, and some were excerpts from well-known children’s literature written by authors such as Dorothy Henshaw Patent. At the 8 th grade level results indicated that out of the 20 passages 4 of the expository passages were rated higher than any of the literary passages by all students; overall the highest ranked passages were expository. Some passages viewed as more challenging were also rated as most interesting, but no clear pattern was established between interest and perceived difficulty.

In the third part of the analysis we sought to address research question #3: How do students’ interests and their ratings of difficulty of passages impact their reading performance?

The purpose of this additional analysis was to examine whether students’ motivation (interest) affected their reading performance. Our hypothesis was that motivation is positively correlated with reading performance especially for low-performing students. To determine if this hypothesis had merit, the entire sample of students was divided into two groups based on their IRT ability estimates (i.e., theta score). The low ability group consisted of students with theta less than 0, and the high ability group consisted of students with theta greater than 0. We created a variable “Low” to denote these groups; Low = 0 for the high ability group and Low = 1 for the low ability group. Linear regression analysis was conducted to test the hypothesis. The following regression model is fitted for each passage: (Passage Score) = b 0 + b 1 (Interest) + b 2 (Low) + b 3 (Interest x Low) + e, where b 0 through b 3 are regression coefficients, Passage Score is the percent correct score for the passage (ranging from 0 to 1), Interest is the interest rating (ranging from 1 to 4 with 4 being very interesting), and the term (Interest x Low) denotes the interaction between the interest rating and ability group. As already noted, Low is a variable which takes value 0 or 1. Thus, for the high ability group (Low = 0), terms that include Low vanish and the above equation reduces to (Passage Score) = b 0 + b 1 (Interest) + e, where coefficient b 1 represents the effect of interest on the passage score for the high ability students. For the low ability group (Low = 1), the above equation becomes (Passage Score) = (b 0 + b 2) + (b 1 + b 3) (Interest) + e. In this case, the effect of interest on the passage score for the low ability group is (b 1 + b 3). Our hypothesis can be tested by looking at significance of b 3 (i.e., the coefficient for the interaction term). If b 3 = 0, then there is no difference in the effect of interest on the passage score between the low and high ability groups.

Results indicated that for the lowest scoring 4 th grade students (e.g., those scoring within the lowest quartile on the test), passage interest was positively correlated at the .05 level with performance on the test. This included passages that were expository and narrative in equal proportion (7 of the 10 expository and 7 of 10 literary were positively correlated). In addition, overall results for the lowest scoring 8 th grade students (e.g., those scoring within the lowest quartile on the test) indicated that passage interest was correlated at the .05 level with performance on about half of the passages. This included passages that were expository and narrative in equal proportion (6 of the 10 expository and 6 of 10 literary). In Figure 1 we provide two plots that show how interest for 4 th and 8 th graders is correlated with performance on the test.

We also examined gender and interest and how these factors impacted students’ performance. At the 4 th grade level there was some correlation overall for females with respect to interest in each of the genre categories; for males there was some correlation for expository but little overall correlation. At the 8 th grade level there was some correlation for males with respect to interest—more so with literary text, but for females there was no overall correlation between interest and performance.

SUMMARY AND IMPLICATIONS

The goal of this research was to demonstrate how large-scale assessments of reading proficiency can become more accessible and valid for all students, including the full range of students with disabilities, while also meeting the assessment requirements of the No Child Left Behind Act of 2001 (NCLB): (a) provision of a valid measure of proficiency against academic standards, and (b) provision of individual interpretive, descriptive, and diagnostic reports. Instead of creating an assessment from discarded passages and items no longer used on other large-scale assessments, we choose to create a reading assessment from the ground up, allowing us to address issues of motivation and engagement as part of test development. We successfully calibrated highly interesting passages and meaningful items using the NAEP 2009 cognitive targets.

Results indicated that students overwhelmingly embraced the highly interesting passage topics, the color pictures, and format of the test. As other researchers have found (Mohr, 2006), our results indicated that expository texts containing interesting topics were of high interest to readers at both the 4 th and 8 th grade levels. In particular, 4th graders were drawn to informational texts about animals; 8 th graders were drawn to expository texts on unusual and sometimes gruesome topics; and students at both grade levels enjoyed literary texts about young people their own age working through daily challenges and life issues. We also determined that creating motivating assessments via interesting passages is positively correlated with reading performance especially for low-performing students at the 4 th grade levels and to some extent, students at grade 8. These results are similar to those found by Guthrie & Wigfield (2005). We also observed students at all levels, including those with disabilities, persevering and to do well on the assessment.

Students who struggle with reading have likely had many unsuccessful attempts at literacy tasks, and, in turn, are apt to avoid tasks at which they feel being successful is improbable (Bandura, 1993; Casteel, Isom, & Jordan, 2000; Jinks & Morgan, 1999; Margolis & McCabe, 2004). This research will contribute to the ultimate goal of making large-scale assessments more “universally designed,” that is, designed from the beginning to be accessible and valid for the widest range of students, including students with disabilities who lack self-efficacy. The second phase of this research, where we will use the calibrated passages and items to offer students choice in what they read on large-scale reading assessments, will further test our hypotheses that using constructs of motivation when designing assessments can result in more valid measures of students’ abilities.

Table 1. Top Rated Passages for Grade 4 Based on Interest and Challenge

ID

Title

Text Type (E=Expository; L=Literary)

Difficulty

Level (E=easy; M=medium; H=Hard)

Anchor Passage

Short Description of the Passage

13

What is a Bear?

E

E

Yes

Types of bears and characteristics

2

Bubble Power

E

H

Yes

How animals use bubbles to protect themselves

19

What in The World is a Wolverine?

E

M

No

What wolverines are like and where they live

23

Giants of The North

E

M

Yes

Facts about walruses

20

Rice Balls

E

H

No

How to make rice balls from various ingredients

31

A Really Big Family

E

E

No

Information about sperm whales

 

 

 

 

 

 

8

Susie, Fudge, and The Big Race

L

H

No

Training for and participating in a dog sled race

26

Kirby Drogan and the Two Giants

L

H

No

Three dragon brothers fight to protect a castle

24

From Sea to Shining Sea

L

M

No

A family’s trip is almost delayed due to car trouble

17

Grandma's Quilt

L

M

Yes

A grandma and her grand- daughter help homeless people

21

One-eyed Willie

L

E

Yes

A boy’s mom and his classmates help him cope with an eye problem

28

Scaredy-Cat Abby

L

H

Yes

A babysitter overcomes her own fears to provide good care for younger kids

 

 

 

 

 

 

Table 2. Top Rated Passages for Grade 8 Based on Interest and Challenge

ID

Title

Text Type (E=Expository; L=Literary)

Difficulty

Level (E=easy; M=medium; H=Hard)

Anchor Passage

Short Description of the Passage

52

Boggs Bills

E

E

No

A man creates realistic drawings of money and what he does with it

30

Dragon Spit

E

M

Yes

A description of the spit of Komodo dragons

53

Liar Liar

E

E

Yes

Methods that tell you if someone is lying

4

The Most Gruesome Fish in the World pp. 35-39

E

E

No

Information about the hagfish

54

Sea Otter Rescue

E

H

No

Rescuing otters after an oil slick

9

Body in the Bog

E

H

Yes

The mystery of the Tollund man found in a bog

 

 

 

 

 

 

34

Snacks

L

E

Yes

A boy is challenged to create a snack that makes his teacher “think”

16

Market Square Dog

L

E

No

An abandoned, hurt dog finds a family

48

A Genius for Sauntering

L

M

Yes

A teenager’s appreciation of a boy walking in a pair of jeans

10

Wolf Shadows, ch. 1 pg. 1 - 7

L

E

No

Two boys return from deer hunting and face danger on the trip home

6

Soledad f/The Circuit; pp. 8-11

L

H

No

A story from the life of a migrant child

47

Gullibility

L

H

No

A teenager enjoys telling tales to explain events in her life

 

 

 

 

 

 

Figure 1. The Relation Between Interest, Passage Type, and Overall Performance for High and Low Ability Groups for Grade 4 and Grade 8

Figure 1a

Figure 1b

Note for Slope Plots. The two columns are labeled "b1" and "b1+b3" for High and Low groups, respectively. For each plot, the horizontal axis is the slope for the High group ("b1") and the vertical axis is the slope for the Low group ("b1+b3"). A number in each point is passage ID, or it may be "E"(average expository), "L" (average literary"), or "A" (average for all passages). Passage type is distinguished by different shape of points (square = expository, circle = literary, and triangle = all passages). Each plot has a diagonal line - if a point is above this line, the Low group has a higher slope than the High group (but a point must be above zero on the vertical scale to say that interest affects passage score more positively for the Low group than for the High group). If a point is on this line, then the slope is the same for High and Low groups.

REFERENCES

Authors. (XXXXa). The role of motivation in engaged reading of adolescents. In K.Hinchman & H. Sheridan-Thomas (Eds.), Best Practices in Adolescent Literacy Instruction (78-96). New York, NY: Guilford.

Authors (XXXXb). National Accessible Reading Assessment Project. Grant funded as part of the Institute of Education Sciences, National Center for Special Education. Grant Numbers H324F040002. Funded October 2004 through 2010.

Authors. (XXXXc). The role of “visual elements” in creating engaging reading assessments: Arguments and criteria. This technical paper reports the rationale for, and processes used, to select appropriate visual elements for use on a large-scale reading assessment. University of XXXX.

Authors (XXXXd) The role of engagement in assessing reading comprehension: The role of challenging and interesting passages. This paper reports the technical aspects and resulting statistics from of a large-scale calibration study with over 1,200 students from grades 3-9. University of XXXX.

Baker, F. B. (2001). The basics of item response theory. ERIC Clearinghouse on Assessment and Evaluation. Retrieved May 14, 2002, from http://ericae.net/irt/baker/

Baldwin, P. (2006, April). A modified IRT model intended to improve parameter estimates under small sample conditions. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA.

Bandura, A. (1986). Social foundations of thought and action: A social-cognitive theory. Englewood Cliffs, NJ: Prentice Hall.

Bandura, A. (1993). Perceived self-efficacy in cognitive development and functioning. Educational Psychologist, 28(2), 117-148.

Barnes, L. L. B., & Wise, S. L. (1991). The utility of a modified one-parameter IRT model with small samples. Applied Measurement in Education, 4, 143-157.

Brookshier, J., Scharff, L.F.V, & Moses, L.E. (2002). The influence of illustrations on children’s book preferences and comprehension. Reading Psychologist, 23, 323 – 339.

Casteel, C. P., Isom, B. A., & Jordan, K. F. (2000). Creating confident and competent readers: Transactional strategies instruction. Intervention in School and Clinic, 36(2), 67-74.

Chall, J. & Squire, J. R. (1996). The publishing industry and textbooks. In R. Barr, M. L. Kamil, P. Mosenthal, & P. D. Pearson. Handbook of reading research (Vol. II, pp. 120-146). New York: Longman.

deSousa, I., & Oakhill, J. (1996). Do levels of interest have an effect on children’s comprehension monitoring performance? British Journal of Educational Psychology, 66, 471-482.

Fry, E. B. (1977). Fry’s readability graph: Clarifications, validity, and extensions to level 17. Journal of Reading, 21, 242-252.

Guthrie, J.T., & Wigfield, A. (2000). Engagement and motivation in reading. In M.L. Kamil, P.B. Mosenthal, P.D. Pearson, & R. Barr (Eds.), Handbook of reading research: Volume III (pp. 403-422). New York: Erlbaum.

Guthrie, J. T., Wigfield, A., & Perencevich, K. C. (Ed.). (2004). Motivating reading comprehension: Concept-Oriented Reading Instruction. Mahwah, NJ: Erlbaum.

Guthrie, J. T., Wigfield, A., Barbara, P., Perencevich, K. C., Taboadoa, A., Davis, M. H., Scafiddi, N. T., & Tonks, S. (2004). Increasing reading comprehension and engagement through Concept-Oriented Reading Instruction. Journal of Educational Psychology, 96(3), 403-423.

Guthrie, J.T., & Wigfield, A. (2005). Roles of motivation and engagement in reading comprehension assessment. In S.G. Paris, & S.A. Stahl (Eds.), Children’s reading comprehension and assessment (pp. 187-213). Mahwah, NJ: Lawrence Erlbaum Associates.

Hanson, B. A. (2002). IRT command language. Retrieved November 21, 2007, from http://www.b-a-h.com/software/irt/icl/icl_manual.pdf

Jinks, J., & Morgan, V. (1999). Children’s perceived academic self-efficacy: An inventory scale. The Clearing House, 72(4), 224-230.

Kato, K., Moen, R., & Thurlow, M. (2007). Examining DIF, DDF, and omit rate by discrete disability categories. Minneapolis, MN: University of Minnesota, Partnership for Accessible Reading Assessment.

Levie, W. H. & Lentz, R. (1982). Effect of text illustrations: A review of research. Educational Communication and Technology, 30, pp. 195-232.

Lexile Framework for Reading. Lexile Framework, MetaMetrics. Retrieved February 10, 2009, from http://www.lexile.com/DesktopDefault.aspx?view=re&tabindex=2&tabid=31&tabpageid=358

Margolis, H., & McCabe, P. P. (2004). Self-efficacy: A key to improving the motivation of struggling learners. The Clearing House, 77(6), 241-249.

Mohr, K. (2006). Children’s choices for recreational reading: A three-part investigation of selection preferences, rationales, and process. Journal of Literacy Research, 38, 81-104.

National Assessment Governing Board. (2005). Specifications for the 2009 NAEP reading assessment. Washington, DC: Author.  http://www.nagb.org/pubs/reading_fw_08_05_prepub_edition.doc

No Child Left Behind Act of 2001, 20 U.S.C. 6301 e seq (2002). (PL 107-110).

Oakland, T., & Lane, H. B. (2004). Language, reading, and readability formulas: Implications for developing and adapting tests. International Journal of Testing, 4(3), 239-252.

Pajares, F. (1996). Self-efficacy beliefs in academic settings. Review of Educational Research, 66(4), 543-578.

Pajares, F. (2002). Gender and perceived self-efficacy in self-regulated learning. Theory Into Practice, 41(2), 116-125.

Purdie, N., & Hattie, J. (1996). Cultural differences in the use of strategies for self-regulated learning. American Educational Research Journal, 33(4), 845-871.

Rose, T.L. (1986). Effects of Illustrations on reading comprehension of learning disabled students. Journal of Learning Disabilities, 19, 542-544.

Schiefele, U. (1999). Interest and learning from text. Scientific Studies of Reading, 3, 257-279.

Swaminathan, H., & Gifford, J. A. (1986). Bayesian estimation in the three-parameter logistic model. Psychometrika, 51, 589-601.

Swaminathan, H., Hambleton, R. K., Sireci, S. G., Xing, D., & Rizavi, S. M. (2003). Small sample estimation in dichotomous item response models: Effects of priors based on judgmental information on the accuracy of item parameter estimates. Applied Psychological Measurement, 27, 27-51.

Thompson, S. J., Johnstone, C., & Thurlow, M. L. (2002). Universal design applied to large scale assessments (Synthesis Report 44). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.

U.S. Department of Education (2006). Individuals with Disabilities Education Act (IDEA) database. Author: Office of Special Education and Rehabilitative Services.

Waller, R. (1996). Typology and discourse. In R. Barr, M. L. Kamil, P. Mosenthal, & P. D. Pearson. (Eds.), Handbook of reading research, (Vol. II, pp. 341-380). New York: Longman.

Yen, W. M. (1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5, 245-262.

Zeng, L. (1997). Implementation of marginal Bayesian estimation with four-parameter beta prior distributions. Applied Psychological Measurement, 2, 143-156.

Top of Page