- NCEO Technical Report 48

Student Think Aloud Reflections on Comprehensible and Readable Assessment Items:
Perspectives on What Does and Does Not Make an Item Readable

Technical Report 48

Christopher Johnstone, Kristi Liu, Jason Altman, Martha Thurlow

September 2007

Johnstone, C., Liu, K., Altman, J., & Thurlow, M. (2007). Student think aloud reflections on comprehensible and readable assessment items: Perspectives on what does and does not make an item readable (Technical Report 48). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.

Table of Contents

Executive Summary
Introduction
Methods
Results
Discussion
References

Executive Summary

This document reports on research related to large-scale assessments for students with learning disabilities in the area of reading. As part of a process of making assessments more universally designed we examined the role of "readable and comprehensible" test items (Thompson, Johnstone, & Thurlow, 2002). In this research, we used think aloud methods to better understand how interventions to improve readability affected student performance. Decreasing word counts in items and making important words bold did not seem to make any difference in student achievement (although students preferred that important words were printed in bold). Vocabulary, however, was a notable factor. Non-construct vocabulary in both the stem and answer choices of items caused difficulty for students as well as words with negative prefixes (e.g., dis). Implications for this research are that readability is correlated with vocabulary (see Rand Study Group, 2002) and that construct and non-construct vocabulary must be clearly defined in order to increase accessibility of assessments.

Introduction

Recent changes in Federal legislation (including the No Child Left Behind Act of 2001) have placed greater emphasis on large-scale assessment data as a measure of student learning. Because of this focus, schools and school districts are accountable for ensuring student success on assessments. According to federal guidelines, all students must succeed on state-mandated assessments of learning, including students with disabilities.

In order to ensure that all students, including students with disabilities, are taking assessments that are accessible to a wide variety of student needs, Thompson, Johnstone, and Thurlow (2002) recommended that a "universal design" approach be used when designing assessments. Universal Design of Assessment (UDA) is broadly defined as assessments that are "designed and developed from the beginning to allow participation of the widest possible range of students, and to result in valid inferences about performance for all students who participate in the assessment" (Thompson et al., 2002, p.5).

The term universal design was originally used as an architectural concept (Center for Universal Design, n.d.). Philosophies of access and barrier removal were the foundations for early universal design approaches. As part of this design philosophy, ramps, elevators, expanded doorways, signs, bathrooms, and other features do not have to be added or modified at additional expense after the completion of a building. Rather, as part of universal design, these features are sketched into structures’ blueprints from the beginning. The promise of universal design is that some of the same architectural features that accommodate people with disabilities also benefit many others, including senior citizens, families with young children, and delivery people.

Thompson et al.’s definition was also initially used to define UDA of general education assessments. The authors further explained UDA by defining seven "elements" of universal design, including: (1) inclusive assessment population; (2) precisely defined constructs; (3) accessible non-biased items; (4) amenable to accommodations; (5) simple, clear, and intuitive instructions and procedures; (6) maximum readability and comprehensibility; and (7) maximum legibility.

This study builds on a series of previous studies conducted by the National Center on Educational Outcomes. Beginning in 2002, Thompson et al., Johnstone (2003), Johnstone, Thompson, Moen, Bolt, and Kato (2005), and Johnstone, Bottsford-Miller, and Thompson (2006) all have sought to operationalize the term "universal design" as it applies to assessments, and to understand the practical implications of universal design processes. One common theme in all of the studies was that language was a particularly important characteristic of items that were and were not accessible.

This study focused specifically on maximum readability and comprehensibility of items in reading assessments. The study was particularly salient to ongoing research on the accessibility of general education reading assessments (see http://www.narap.info) because it sought to better understand what particular characteristics make a test item readable and accessible.

Readable and comprehensible language was also selected as the focus for this study because it is a somewhat controversial concept in the field of measurement. Tampering with how items are written in order to increase readability has recently been challenged because of the potential effects that language changes may have on the intended constructs of items (Gong & Marion, 2003). In addition, language changes possibly make test items less authentic and reduce their similarity to real-world applications (Kahl, 2006).

Much of the research on the need for readable and comprehensible language as an aspect of universal design comes from literature on students who are English language learners. In some cases, readable text is operationalized as language that is concise and efficient, allowing readers to quickly access test items. For example, Abedi, Hofstetter, Baker, and Lord (2001) found that English language learners often have higher proficiency in mathematics than is reflected on large-scale mathematics tests, and recommended that language be clarified to reflect essential constructs. Likewise, Kopriva, Winter, Emick, and Chen (2006) noted that there are both target and non-target language skills required for assessment items and that by clarifying these language needs it is easier to create more concise language around relevant constructs. Once construct-relevant language is identified, the process of winnowing away superfluous language may improve assessment results and, more importantly, assessment validity.

Readable and comprehensible language can be quantified on readability scales. Shaftel, Belton-Kocher, Glasnapp, and Poggio (2006) applied formulae to predict how readable particular items were on a test. The authors found that students who typically struggle with reading the English language (English language learners and students with disabilities) may be at risk for items containing pronouns, complex verbs, and irrelevant language. In such cases, determining what students do and do not know is complicated by item language, which at times may not allow English language learners and students with disabilities to demonstrate their true capabilities.

As early as 1987, Rakow and Gee outlined a series of considerations for making test items more readable and comprehensible. These included:

Students would likely have the experiences and prior knowledge necessary to understand the item
Vocabulary is appropriate for the intended grade level
Sentence complexity is appropriate for the intended grade level
Definitions and examples are clear and understandable
Required reasoning skills are appropriate for student's cognitive level
Relationships are made clear through precise, logical connectives
Content within items is clearly organized
Graphs, illustrations, and other graphic aids facilitate comprehension
Questions are clearly framed

The RAND Reading Study Group (2002) also stated that difficulty of text may vary due to "complex Boolean expressions." Such expressions are challenging because "the respondent needs to keep track of different options and possibilities" (p. 96). In the case of negative expressions, an unnecessarily high cognitive load may be added to items that employ negatives within items (e.g., "Which of the following is not a reason why the captain wanted to turn her ship around?").

The RAND Reading Study Group also found a strong correlation between the level of vocabulary on items and their readability. To address the danger in unnecessarily inflating the difficulty level of items, Haladyna and Downing (2004) suggested removing nuisance variables (such as challenging vocabulary that is not relevant to the construct tested) as a way of improving the validity of tests. Reduction in the level of non-construct vocabulary can be coupled with reducing the density of text (i.e., text that does not "pack too many idea units" or propositions into a single clause) (RAND Reading Study Group, 2002, p. 96) to create readable, comprehensible test items.

In summary, a variety of sources provide information on how to operationalize "readable and comprehensible" text. The purpose of this research was to take the cumulative information from these sources to create readable and comprehensible text in reading items and then to examine their impact on students with learning disabilities. In order to determine the qualitative impact of such interventions, we chose to use think aloud methods.

According to Ericsson and Simon (1994), think aloud verbalizations (where students say all their thoughts aloud while they are approaching an item) are an important data source for understanding cognitive activities. The reason why think aloud data are so rich is because all cognitive processes travel through short-term memory and are spoken at the time they are processed. Thus, think aloud methods provide researchers with qualitative information about the reasons why students may struggle on test items (Leighton, 2004). Such data are informative for purposes of addressing universal design issues in large-scale assessment items (Johnstone et al., 2006).

Methods

Overview

In this study, released National Assessment of Educational Progress (NAEP) items were used as reference or "standard" items from which changes were made to create items that fit our definition of readable and accessible. Specifically, Grade 8 passages and their corresponding items were examined. Researchers at the National Center for the Improvement of Educational Assessment (NCIEA) developed all protocols. Each protocol contained a passage (one informational and one literary passage were used for this study), followed by a series of items. For each passage, standard NAEP released items and modified items designed to be readable and comprehensible were interspersed with one another. Items from the standard set of NAEP released items were reviewed by NCIEA researchers for the following characteristics.

Use of pronouns. A sentence with excessive pronoun use may cause a student to lose track of the main point of reference in an item. For example "Jake left his house to go to visit his grandmother at her house. What was the first thing she said?"
Use of negatives. An unnecessarily high cognitive loading may be added to items that employ negatives within items.
Vocabulary. There is strong evidence that suggests an inverse correlation between the level of vocabulary in text and its readability (RAND Reading Study Group, 2002). For the purposes of this study, we were not concerned with vocabulary in passages, but with the vocabulary level within items. Unless the construct of an item is to test vocabulary skills, we attempted to reduce vocabulary demands as a principle of access. For example, "The author's conclusion was a somber reminder of what?" might be rephrased as "The ending of the book was a sad reminder of what?"
Non-construct subject area language (specialized vocabulary). English language arts has a specialized vocabulary of its own (e.g., characterization, denouement, prose, iambic pentameter, etc.). When these words are part of the intended construct, it is appropriate to include them in items. When these terms are extraneous to the intended construct, they may introduce "nuisance variables" (Haladyna & Downing, 2004). We attempted to remove any non-construct subject area language from items.
Complex sentences, dense text. The RAND Reading Study Group (2002) defined sentences with "dense clauses" as sentences that "pack too many constituents or idea units (i.e., propositions) within a single clause" (p. 96). Likewise, the RAND Group defined sentences with "dense noun phrases" as sentences with "too many adjectives and adverbs modifying the head noun" (p. 96). Items with either one of the above sentence types were rewritten for clarity.

When items contained such features, they were modified to reflect language that was more readable and comprehensible.

Sample

This study took place in the summer of 2007. Eight students participated in the study. All eight had recently completed eighth grade. All of the students had diagnosed specific learning disabilities in the area of reading and were participants in a summer reading enrichment program, but did not all have the same home school. All received services for their disability in public or private schools and were familiar with large-scale assessments. Because the students paid tuition for the private reading enrichment program, it is likely that they were all from middle or high socioeconomic status (SES), although we did not collect SES data. None of the students, however, reported familiarity with the NAEP assessment of reading (from which sample items were derived).

Among the participants, five students were male and three students were female. One of these students was African American, one was Asian American, and one was Hispanic. None was an English language learner.

Procedures

For this study, we followed a standard protocol for each student. First, each student participated in a practice activity in order to better understand how think aloud activities worked. Some of the students had experience with think aloud activities because their teachers had used them in reading and mathematics instruction. None of the students had difficulty with the process. When students did struggle, they were reminded to "keep talking" (Ericsson & Simon, 1994).

Each student read either a literary or informational text passage. Students were randomly assigned to passage types (informational or literary) and "forms" (each form had both standard and modified items, but ordering of items was different across forms). The different forms were designed so that every other question was standard or modified (e.g., Item 1 on Form A was a standard item, item 2 was a modified item, etc. For Form B, item 1 was a modified item, item 2 was a standard item, etc.). Thus, different forms had "matched pairs" of items that were written in standard/modified formats. To this end, every student was likely to answer both standard and modified items, but each student only read one particular passage type (informational or literary).

In this study, students read a passage either silently or aloud (students were given the choice of how they would like to read). Of the eight participants, only one participant chose to read the passage aloud. Once students completed reading the passage, they were asked to answer items one at a time while thinking aloud. Students were asked to read all items aloud while answering them. The interaction between researcher and student then took place in two phases.

In phase 1, the student engaged in a think aloud while the interviewer used only passive prompts to encourage the child to think out loud. During this phase, students read each item aloud and described their problem solving processes while attempting to correctly answer the item. In phase 2, the interviewer asked the child specific questions to probe his or her understanding of the child’s cognitive process. Questions that were asked of the students included:

How did you get that answer?
What makes you believe that answer is the right one?
Was there anything that seemed tricky about this question?
Was there anything that confused you about this question?
Were there any words in this question that you did not know?
Could we do anything, change the item in any way, to make it clearer to you?
Did you think this passage was easy or hard to read?
Were there any words you did not understand?
Was any part of it confusing to you?
Could you find the answers to the questions in the passage?
Other questions related to content of passage or item.

Researchers praised students throughout the process for thinking out loud and verbalizing, but did not provide any information to students about whether particular answers were correct or incorrect. The purpose of providing feedback to students was to encourage them to continue verbalizing, because this was the primary data source for this study.

Each think aloud activity took approximately one hour to complete. Students were videotaped at two angles to capture student verbalizations plus any activities related to pointing, circling, or other physical acts in which students engaged during the think-aloud session.

Analysis

Data were transcribed from videotapes and coded using a standard coding form. This form asked reviewers to note the following information:

Student identification number
Whether it was a standard or modified question
Student comments relevant to the item
Whether the student answered the question correctly or incorrectly
If the student answered the question incorrectly, an analysis of her or his error
Data supporting conclusions made from the error analysis
Any advice the student had for test makers

Once coding was completed, secondary analyses were conducted to (a) determine the number of correct and incorrect answers for individual items, (b) make comparisons of matched pairs of standard and modified items, and (c) find qualitative patterns in errors made by students.

There was no way to test for statistical significance with the small sample of students who participated in this research. Thus, descriptive statistical information was compared with qualitative information for each item.

Results

Not all students answered all items in this research. This was because students were randomly assigned to either informational or literary passages, and further randomly assigned to "standard" (odd numbered items that students completed were standard and even numbered items were modified) or "modified" (odd numbered items students completed were modified and even numbered items were standard) conditions. Furthermore, we stopped all think aloud sessions after one hour to prevent student fatigue; thus, some items at the end of protocols were not answered at all. Each item pair (its standard and modified versions) is reported below, along with quantitative information on correct/incorrect responses and qualitative information supplied by students. After all items are reported, overall information on the number of correctly and incorrectly answered items by type (standard vs. modified) is reported.

Overall Results

Among students who answered both standard and modified questions, only one student answered more standard questions correct than modified questions. Overall, students answered only 46% of standard items correctly and answered 72% of modified items correctly. Student responses ranged from one student answering three of four standard items (75%) correctly and only one of three (33%) modified questions correctly to one student answering only one of four standard questions (25%) correctly and three out of three modified questions correctly (100%). Although sample sizes were small, the general overall effect seemed to be that students achieved better results on modified items than on standard items. Table 1 represents overall results for standard and modified items.

Table 1. Overall Results for Standard and Modified Items

Item Type	% Correct Overall	Range % Correct by Participant
Standard	46	25%-75%
Modified	72	50%-100%

Item Level Results

Grade 8, Matched Pair 1 (Literary Passage)

For the first item, the standard version of the item asked the students to "Use the dictionary entry below to answer the question, what meaning of fast is used in line 4?" For this particular item, the modified version of the item simply asked "Which meaning of fast is used in line 4?" Therefore, the modifying process involved reducing the text load of the item.

Both the standard and modified version of the item were answered correctly 50% of the time (1 out of 2 students answered each version correctly). Two of the four students looked back at the passage to examine the vocabulary word in question. One student who looked back at the text (and had the modified item) answered correctly while another student (with the standard item) struggled with the entire item. Although she looked back, she said she was "confused" and wanted to go on to the next item. Students who answered the item correctly (both in modified and standard formats) used deduction to determine the correct answer. Students who answered the item incorrectly did so because they did not grasp the meaning of "fast" within the context of the story. These errors were errors in construct-relevant skills.

Grade 8, Matched Pair 2 (Literary Passage)

The standard version of the second item begins with "In lines 10–13, the poet uses a simile ‘his brown skin hung in strips/like wallpaper’ to emphasize the fish’s." The modified version of the phrase asked "In lines 10–13 ‘his brown skin hung in strips/ like ancient wallpaper’ is used to emphasize the fish’s." This modified version reduced the text load and decreased the vocabulary demands for the item. All students answered this item correctly in both versions (three correct answers, zero incorrect answers for the standard item and one correct answer, zero incorrect answers for the modified item). Two students (both with standard items) answered quickly from memory. The other two students looked back at the passage to answer correctly.

Grade 8, Matched Pair 3 (Literary Passage)

The standard version of the item asked "The image of the fish in lines 43 through 50 develops the fish as having a character that is?" The modified version asked "Lines 43–50 describe the fish as." The modified version contained only seven words and eliminated the word "character."

In both versions of the item, one student answered the item correctly (50%) and one student answered the item incorrectly (50%). One of the students who answered the item incorrectly (modified version) eliminated answers she thought were incorrect, but used faulty logic. The student who answered the standard item incorrectly did so because he could not read the word "character." In this case, an error analysis demonstrated that the use of the vocabulary "character" may have been the cause for one error.

Grade 8, Matched Pair 4 (Literary Passage)

The standard version of item 4 asked "When the poet says ‘Like medals with their ribbons frayed and wavering’ (lines 61–62) she is referring to." The modified version of this item simply asked " ‘Like medals with their ribbons frayed and wavering’ (lines 61 and 62) refers to." This item removed the pronoun "she" and reduced the word count on the item from 18 to 14 words.

None of the students who answered the standard question answered it correctly (0 of 2 students). One out of two students answered the modified version correctly. Error analysis of incorrect answers indicates that the two students who answered the standard version incorrectly were unable to infer meaning from the quote provided. The student who answered the modified version incorrectly was confused by a vocabulary word in the item. The student who answered the question correctly re-read the quote, then used deduction (eliminating perceived incorrect answers). It appears as if the elimination of "she" did not make any difference for students. The connection between students’ inability to infer meaning in the standard version of the item and characteristics of the item are unclear for this item.

Grade 8, Matched Pair 5 (Literary Passage)

The standard version of this item asked students to "Reread the lines beginning with ‘I admired’ (line 45) and ending with ‘aching jaw’ (line 64). The speaker most admires the fish because she thinks he." The standard version of this item contained two pronouns (she for the author and he for the fish) and 26 words. The modified version of this text contains 12 words and only one pronoun (for the fish). The modified version asks "The speaker most admires the fish (lines 45-64) because of his." The word "most" was also presented in bold print in the modified version.

The two students who answered the modified question incorrectly did so because they did not know one of the terms used in the response choices. One student could not pronounce two of the words in the response choices. The one student who answered the modified version of the item incorrectly did so because he depended on his prior knowledge rather than rereading the text to find the correct answer.

In this item, it appears as if the vocabulary level of the response choices in the item were questionable. It is unknown why the same response choices were less problematic for the modified version of the item, but evidence points to a possible compounding effect of challenging items and subsequent answer choices. This item also demonstrated the problematic nature of students depending on prior knowledge to answer questions. Although activating prior knowledge is an important tool in the reading process, if prior knowledge is not combined with information in actual text, it may lead students astray.

Grade 8, Matched Pair 6 (Literary Passage)

The standard item in matched pair 6 asks students "Which of the following best describes the person speaking in the poem?" The modified version of this item begins "Which best describes the speaker of the poem?" A second set of revisions in the item also reduces the text load in the latter part of the item by one word (making the total word count 12 words for the standard item and 8 words for the modified item).

All students answered this item correctly. Two students depended on memory and two students eliminated answer choices they felt were incorrect until they arrived at the correct answer. Overall, it appeared as if the reduction in word count (only four words) and the use of bold print for the word best did not make a difference for this item as the item was easy for the students in both versions.

Grade 8, Matched Pair 7 (Informational Passage)

The standard version of this item asked "What did the immigrants dislike most about their trip to America?" When modified, the item read "What upset the immigrants most about their trip?" For this item, the negative term "dislike" was replaced by the word "upset." There was clear evidence in this item that eliminating a negative term (dislike) changed how students interpreted the item. Both students who answered the standard item answered incorrectly while both students who answered the modified item answered correctly. One student answered incorrectly because he relied on memory and did not revisit the passage; the other student read the word "dislike" as "like." The latter student clearly demonstrated the effect of negative words that have prefixes such as "dis" or "non" on readers’ comprehension of test items.

Grade 8, Matched Pair 8 (Informational Passage)

The standard version of this item was worded "The statement that immigrants had to ‘contend with border guards, thieves, and crooked immigration agents’ means that the immigrants?" The modified version of this item was re-written for clarity of purpose, "Immigrants had to ‘contend with border guards, thieves, and crooked immigration agents.’ This means." The language for the item was simplified and the word "that" was removed twice to present a clearer message to students.

Two students answered the standard version of this item correctly. The student who answered incorrectly struggled to find the quote in the passage and claimed that the item was too difficult to understand. He suggested the item read "what was the main idea of the story." It is unknown whether the modified version of this item would have helped this student answer correctly, or if the item’s requirements were simply too difficult for this student. The one student who answered the modified item did so correctly.

Grade 8, Matched Pair 9 (Informational Passage)

The standard version of this item asked students to determine "What most worried the immigrants about the medical examinations?" The modified version of the story attempted to use language for which Grade 8 students may be more familiar, and asked students "What were immigrants most afraid of during medical examinations?"

One out of two students answered the standard item correctly. The student who answered incorrectly looked for the section referred to in the answer choices, but was unable to find those sections in the text. It is unknown whether the modified version of this item would have helped this student, but his error appeared to be the result of his inability to recall features of the story, not an issue related to the item. Both students who answered the modified item did so correctly.

Grade 8, Matched Pair 10 (Informational Passage)

The standard version of this item asked students "The United States eventually reduced the number of immigrants allowed to enter the country because?" The modified version of this item asked "Why did the United States reduce the number of immigrants?" One of two students who answered the standard version of this question did so correctly. The student who answered incorrectly did so because he answered based on his own prior knowledge of the issue, which tainted his response. Only one student took the modified version of this item and answered incorrectly. This student did so because he thought the word "reduce" meant to "make more." In this item, the version students answered did not seem to make a difference. It is clear, however, that defining the vocabulary term in the stem of the item (e.g., implying that reduction means making less of something) may have changed the outcome of this item.

Grade 8, Matched Pair 11 (Informational Passage)

The standard version of this item asked "In the passage, what is the main purpose of the subheadings?" The modified version of this question simply omitted the phrase "In this passage" and asked "What is the purpose of the subheadings?"

Three students took the standard version of this item, but only one answered correctly. The two students who answered incorrectly did so because they were unfamiliar with the terminology used in the item (subheadings). The one student who answered the standard item correctly and the one student who answered the modified item correctly did so because they were familiar with the expository convention of subheadings from school. In this item, design was not a factor because student success depended on knowledge of the particular convention in the item. Reducing the number of words in the item had neither a positive nor negative effect.

Grade 8, Matched Pair 12 (Informational Passage)

This item again referred to a convention used in expository text. The standard version of the text asked "In the article, how do the quotations by immigrants relate to the sections of the article?" The modified version of this item asked students "The author most likely included the quotations by immigrants to."

Results for this item were similar to the previous item. One student answered the standard version of this item and two students answered the modified version. All students answered the item correctly because they were familiar with the convention to which the item referred. In this case, modifying the item did not appear necessary because all students were able to find the key convention in question and answer based on their prior knowledge of the convention.

Grade 8, Matched Pair 13 (Informational Passage)

The final item related to the main idea of the passage. The standard version of this item asked students "The main point the author is making in this passage is about the?" The modified version asked "This passage is mostly about the."

Although the reduction in language complexity was hypothesized to make the item more accessible, all students who answered this item did so correctly (two students for the standard version and one student for the modified version). All students answered this item very quickly, indicating that finding the "main idea" or what a passage is "mostly about" is a strategy for which students are very familiar. In this case, simplified language was not necessary because of the ease with which students approached the construct.

Summary of Results

Results from this study indicate that modified items may be more accessible to students with learning disabilities. As a whole, students answered modified items correctly at a rate of 72% compared to answering standard items correctly at a rate of 46%. The differences indicate an overall positive effect on student outcomes when using items that are readable and comprehensible, but individual items provide further information on what specific item modification strategies were most meaningful.

For example, the text load (word count) was reduced for modified items 1–5 and 10–12, but reduction in word count did not make a difference in student results. Likewise, the modified version of items 5 and 6 contained bold print. Students commented that they liked the bold print, but it did not seem to affect results. Errors directly related to design of items, however, were present. For example, item 3 contained a vocabulary term that was unknown to the student. Likewise, item 7 (standard) had a negative prefix (dis) that was removed in the modified item. The negative prefix was the cause of student error. Items 5 and 10 both had item-related errors that were not addressed in our modification process, but are instructive in their results. Item 5 had answer choices that were too difficult for students to read and item 10 had a vocabulary word in the item’s stem that was problematic in both the standard and modified versions of the item. Both items appeared to be comprehension items, but the vocabulary in the items served as an access point. It is unknown whether making changes to the vocabulary level of these items would affect the intended construct. Because the items were not vocabulary items, it is possible that parenthetical definitions of terms might have helped students, but further research is needed to determine the effects of parenthetical definitions on both student outcomes and test validity.

For most items, the source of student error or success was the particular student’s knowledge of the content and story. For items 6, 9, and 11 students answered incorrectly because of flawed reading strategies. In contrast, there were no errors for either the standard or modified version of items 12 and 13 because all students understood the constructs tested. The sources of error were difficult to distinguish from student responses to item 8. Table 2 shows how each item was modified, and what the likely sources of error were.

Table 2. Item Modification Strategies and Student Errors

Item	Modification	Source of Error
1	Reduced text load (word count)	Student reading strategies
2	Reduced text load and vocabulary level	Student reading strategies
3	Reduced text load and vocabulary level	Item stem vocabulary*
4	Removed pronouns and reduced vocabulary	Student reading strategies
5	Reduced word count, important words bold	Vocabulary in answer choices*
6	Reduced text load, important words bold	Student reading strategies
7	Replacement of negative word	Item negative term*
8	Rewritten for clarity of purpose	Error source unknown
9	Rewritten with familiar language	Student reading strategies
10	Reduced text load	Undefined vocabulary in stem*
11	Reduced text load	Student reading strategies
12	Reduced text load	No error
13	Reduced text load	No error

*Indicates item-related sources of error.

Discussion

As states design tests that are intended to be accessible to the diverse general assessment population (e.g., universally designed assessments), continued research is important for defining how test items can become more "readable and comprehensible." To this end, specific research on what strategies are most effective in making test items more accessible can inform practitioners on ways to make items accessible to a wide variety of students, including students with learning disabilities.

This study was limited because of its sample size. Only eight participants engaged in think aloud activities, so results may not generalize to the wider population. Furthermore, the sample was drawn from public and private school students who were actively engaged in reading improvement programs over the summer. Thus, the particular students who participated may not represent the heterogeneity of students with learning disabilities in the United States. It is likely that the particular sample with which we worked was from a higher socioeconomic level than the general population of students with learning disabilities in the U.S. It will be important to conduct additional research, both increase sample size and to increase the diversity of students within the sample.

Despite limitations of this study, informative results emerged. Because think aloud methods can provide in-depth information about student problem-solving processes for test items, error analyses could be conducted on all items to determine the likely source of error. For many items, students answered correctly or incorrectly based on reading strategies and their knowledge of reading conventions. Item modifications such as reducing the number of words in an item and highlighting bold words did not seem to have any effect on student results.

Four items, however, provided information on ways that items can be made more readable and comprehensible. All of these items related to vocabulary. As noted above, English language arts has specific vocabulary and terms that are often tested in large scale assessments. In this research, we found that other words (not constructs that were tested) confused students. Because of this, it may be important to conduct further research on the effects of undefined, non-construct vocabulary in item stems; challenging non-construct vocabulary in item response choices; and words with negative prefixes (such as "dis").

In this research, it was not the number of words that challenged students, but the words themselves. As the assessment community continues to attempt to understand how to make test items as accessible as possible without changing intended constructs, consideration of the role of vocabulary within the items themselves is important. Findings from this research indicate that non-construct vocabulary may obscure what knowledge we can gain about student reading comprehension from tests. Such findings imply there is a need for both further and more comprehensive research on the effects of vocabulary, and continued review by content experts to ensure the level of non-construct vocabulary in items is appropriate.

References

Abedi, J., Leon, S., & Mirocha, J. (2001). Validity of standardized achievement tests for English language learners. Paper presented at the American Educational Research Association Conference, Seattle, WA.

Center for Universal Design (n.d.). What is universal design? Center for Universal Design, North Carolina State University. Retrieved January, 2002, from the World Wide Web: www.design.ncsu.edu.

Ericsson, K. A., & Simon, H. A. (1994). Protocol analysis: Verbal reports as data (Revised edition). Cambridge, MA: MIT Press.

Gong, B, & Marion, S. F. (2003). Implementing Universal Design principles in educational assessment: Current challenges of construct clarity. Retrieved June 18, 2006, from the World Wide Web: http://www.nciea.org/

Haladyna, T.M. & Downing, S.M. (2004). Construct irrelevant variance in high stakes testing. Educational Measurement: Issues and Practice, 23(1), 17-27.

Johnstone, C. J. (2003). Improving validity of large-scale tests: Universal design and student performance (Technical Report 37). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.

Johnstone, C. J., Bottsford-Miller, N. A., & Thompson, S. J. (2006). Using the think aloud method (cognitive labs) to evaluate test design for students with disabilities and English language learners (Technical Report 44). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.

Johnstone, C.J., Thompson, S.J., Moen, R.E., Bolt, S., & Kato, K. (2005). Analyzing results of large-scale assessments to ensure universal design (Technical Report 41). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.

Kahl, S. (2006). Universal design vs. depth of knowledge. Presented at Chief Council of State School Officers Large Scale Assessment Conference, San Francisco, CA, June 25–28, 2006.

Kopriva, R., Winter, P., Emick, J. & Chen, C. (2006). Achieving accurate results for diverse learners: Access-based item development. Paper presented at American Educational Research Association Meeting, April 7–11, 2006.

Leighton, J. P. (2004). Avoiding misconception, misuse, and missed opportunities: The collection of verbal reports in educational achievement testing. Educational Measurement: Issues and Practice, 23, 6–15.

RAND Reading Study Group (2002). Reading for understanding: Towards an R & D program in reading comprehension. Retrieved March 10, 2007, from the World Wide Web: http://www .rand.org/pubs/monograph_reports/MR1465.pdf

Rakow, S. J. & Gee, T. C. (1987). Test science, not reading. Science Teacher, 54 (2), 28–31.

Shaftel, J., Belton-Kocher, E., Glassnapp, D., & Poggio, J. (2006). The impact of language characteristics in mathematics test items on the performance of English language learners and students with disabilities. Educational Assessment, 11 (2), 105–126.

Thompson, S. J., Johnstone, C. J., & Thurlow, M. L. (2002). Universal design applied to large scale assessments (Synthesis Report 44). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.

Top of page