Using Systematic Item Selection Methods to Improve Universal Design of Assessments

Policy Directions 18

Christopher Johnstone • Martha Thurlow • Michael Moore  • Jason Altman

September 2006

All rights reserved. Any or all portions of this document may be reproduced and distributed without prior permission, provided the source is cited as:

Johnstone, C., Thurlow, M., Moore, M., & Altman, J.  (2006). Using Systematic item selection methods to improve universal design of assessments (Policy Directions 18). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes. Retrieved [today's date], from the World Wide Web:


The No Child Left Behind Act of 2001 (NCLB) and other recent changes in federal legislation have placed greater emphasis on accountability in large-scale testing. Included in this emphasis are regulations that require assessments to be accessible. States are accountable for the success of all students, and tests should be designed in a way that provides all students an opportunity to demonstrate their knowledge and skills. With the reauthorization of the Individuals with Disabilities Education Act in 2004, states are required for the first time to incorporate universal design principles in developing and administering tests, to the extent feasible.

Applying the concept of universal design to statewide assessment means that assessments are designed from the beginning and continually refined to result in more valid inferences about performance of students with diverse characteristics. These assessments are based on the premise that each child in school is a part of the population to be tested, and that test results should not be affected by disability, gender, race, or English language ability. While universally designed assessments are not intended to eliminate individualization, they may reduce the need for accommodations by eliminating access barriers associated with the tests themselves. At the same time, the intent of the measurement—the intended content and construct of the assessment—are not changed.

Including universal design in test construction is already taking place in the majority of states. According to a survey of states conducted by the National Center on Educational Outcomes (NCEO), during the 2004–2005 school year 43 states addressed issues of universal design. More than half of the states addressed universal design at the item development and review levels, and by including it in RFPs for test development.

There are many elements involved in creating universally designed assessments. They include making sure that students with disabilities are part of field testing, for example. But a major focus of universal design in assessments is making sure that the items included in the assessment are appropriate. There are several methods for selecting items to ensure that they optimize the characteristics of universal design. The purpose of this Policy Directions is to provide an overview of these item selection methods, and to suggest that a combination of the methods will produce the better result. Each method has strengths and weaknesses, may lead to different results, and is in different stages of current practice (see Table 1). Although each method has merits, NCEO recommends states employ all methods systematically and in conjunction with each other.

Table 1. Item Analysis Methods: Pros, Cons, and Current Practice




Current Practice

Improvement on Current Practice

Expert Review

Provides structured review of items by experts

Does not provide actual student data

Unstructured "sensitivity" review panels

Provides reviewers with tools to make decisions

Statistical Analyses

Provides significance data on field tested items; flags potentially problematic items

Validity questionable for small populations

Only DIF calculations performed

Multiple analyses conducted, providing patterns of flagged items

Think Aloud Methods

Provides information on why particular items function as they do

Does not provide data across groups

Not currently in widespread use

Provides important design information


Expert Review

Once an assessment is designed and in a format suitable for previewing, it is important for states to have sensitivity review teams examine the assessment. The use of review teams to examine items is common practice in many states, and is generally encouraged by test vendors. When creating bias and content review teams, it is important to involve members familiar with disability and language issues. Grade level experts, representatives of major cultural and disability groups—or those who can reflect their needs—researchers, and teaching professionals all make up an effective review team.

NCEO, working closely with experts in the fields of assessment, disability, reading, mathematics, and language acquisition, developed and refined a set of considerations for test developers and item reviewers to use to ensure that tests are universally designed. The considerations are listed in Table 2 (see Technical Report 42 listed in Resources). They ask six important questions about test items. Test item developers and reviewers can use these questions to help determine the extent to which an item provides appropriate accessibility without changing the intended construct.

There are considerations that item and test developers or reviewers can use to help determine whether a test overall—not just the individual items—is universally designed:

  • Do all visuals (e.g., images, pictures) and text provide information necessary to respond to the item?

  • Is information organized in a manner consistent with an academic English framework with a left-right, top-bottom flow?

  • Can booklets/materials be easily handled with limited motor coordination?

  • Are response formats easily matched to question?

  • Is there a place for taking notes (on the screen for computer-based tests) or extra white space with paper pencil?

These considerations show that it is not just each item that is important, but also how the whole test is put together that is also an important aspect of universal design.

Table 2. Considerations for Universally Designed Assessment Items

Does the Item...

Measure what it intends to measure?

  • Reflect the intended content standards (reviewers have information about the content being measured)?

  • Minimize knowledge and skills required beyond what is intended for measurement?

Respect the diversity of the assessment population?

  • Sensitive to test taker characteristics and experiences (consider gender, age, ethnicity, socio-economic level, region, disability, and language)?

  • Avoid content that might unfairly advantage or disadvantage any student subgroup?

Have clear format for text?

  • Standard typeface?

  • Twelve (12) point minimum size for all print, including captions, footnotes, and graphs (type size appropriate for age group), and adaptable font size for computers?

  • High contrast between color of text and background?

  • Sufficient blank space (leading) between lines of text?

  • Staggered right margins (no right justification)?

Have clear visuals (when essential to item)?

  • Visuals are needed to answer the question?

  • Visuals have clearly defined features (minimum use of gray scale and shading)?

  • Sufficient contrast between colors?

  • ·Color alone is not relied on to convey important information or distinctions?

  • Visuals are labeled?

Have concise and readable text?

  • Commonly used words (except vocabulary being tested)?

  • Vocabulary appropriate for grade level?

  • Minimum use of unnecessary words?

  • Idioms avoided unless idiomatic speech is being measured?

  • Technical terms and abbreviations avoided (or defined) if not related to the content being measured?

  • Sentence complexity is appropriate for grade level?

  • Question to be answered is clearly identifiable?

Allow changes to its format without changing its meaning or difficulty (including visual or memory load)?

  • Allows for the use of braille or other tactile format?

  • Allows for signing to a student?

  • Allows for the use of oral presentation to a student?

  • Allows for the use of assistive technology?

  • Allows for translation into another language?

As states and other testers explore the use of computers for testing, it is important to have ways to judge their appropriateness and universal design features as well. There are additional considerations for computer-based tests to ensure that they are universally designed (see Table 3).

When using Expert Review considerations, NCEO recommends incorporating the following into the review:

  • Conduct the review as early as possible in the stages of test development.

  • Include disability, technology, and language acquisition experts in item reviews.

  • Provide professional development for item developers and reviewers on use of the universal design considerations.

  • Present the items in the format in which they will appear on the test.

  • Include standards being tested with the items being reviewed.

  • Try out items with students (use Think Aloud methods).

  • Field test items in accommodated formats.

  • Review computer-based items on computers.

Table 3. Considerations for Universally Designed Computer-based Tests

Layout and design

  • Sufficient contrast between background and text and graphics for easy readability.

  • Color alone is not relied on to convey important information or distinctions.

  • Font size and color scheme can be easily modified (through browser settings, style sheets, or on-screen options).

  • Stimulus and response options are viewable on one screen when possible.

  • Page layout is consistent throughout the test.

  • Computer interfaces follow Section 508 guidelines (


  • Students have received adequate training on use of test delivery system.

  • Navigation is clear and intuitive; it makes sense and is easy to figure out.

  • Navigation and response selection is possible by mouse click or keyboard.

  • Option to return to items and return to place in test after breaks.

Screen reader considerations

  • Item is intelligible when read by a text/screen reader.

  • Links make sense when read out of visual context ("go to the next question" rather than "click here").

  • Non-text elements have a text equivalent or description.

  • Tables are only used to contain data, and make sense when read by screen reader.

Test specific options

  • Access to other functions is restricted (e.g., e-mail, Internet, instant messaging).

  • Pop up translations and definitions of key words/phrases are available if appropriate to the test.

  • Students writing online can get feedback on length of writing on-demand in cases where there is a restriction on number of words.

  • Students are able to record their responses and read them back as an alternative to a human scribe.

  • Students are allowed to create persistent marks to the extent that they are already allowed on paper-based booklets (e.g., marking items for review; eliminating multiple choice items, etc.).

Computer capabilities

  • Adjustable volume.

  • Speech recognition available (to convert user’s speech to text).

  • Test is compatible with current screen reader software.

  • Computer-based option to mask items or text (e.g., split screen).

  • Computer software for test delivery is designed to be amenable to assistive technology.

Statistical Analysis

A quantitative approach to selecting items that appear to provide access to students with certain characteristics, such as disabilities, is to conduct statistical analyses on test item results. Many statistical methods exist, ranging from simple methods based on classical test theory to complex methods based on contemporary item response theories (IRT). Four statistical approaches currently used in the field by researchers have practical usefulness for identifying items that potentially violate universal design principles for groups of students.

Table 4 shows the four statistical analysis methods that are useful for flagging items to identify those that are potentially more challenging for particular students. Each of these has been used to identify items that may not be universally designed (see Technical Report 41 listed in the Resources). Statistical methods are based on various assumptions that determine when items should be flagged as producing different performance from what would be expected. The flagging of an item is taken as an indication that the item may not be as accessible as possible, and may be creating barriers that do not allow the student to demonstrate his or her knowledge and skills.

Statistical analyses are useful for understanding which items may be biased and need revision, but they do not provide information on why particular items function the way they do. Understanding why is often critical to knowing what to do in making revisions to items.

Table 4. Four Statistical Analysis Methods for Reviewing Test Item Results

Method Procedure

Item Rankings

Items are ranked from most to least difficult for total population and for particular groups. It is assumed that ranks should be similar for groups and total.

Item Total Correlation (ITC)

Within-group investigation of how individual items correlate with the total score on the same test; poor correlations may signal a problem. Different ITCs for the same item across different groups of test takers may indicate that the item behaves differently across those groups.

Differential Item Functioning (DIF) – Contingency Table Methods

Performance on items is compared for large groups that perform at similar levels on the total test. In contingency table methods DIF statistics are calculated by comparing the proportion of students answering an item correctly in target and comparison groups with the same total score range; statistically significant differences may point to an item’s problematic nature.

Differential Item Functioning (DIF) – Item Response Theory Approaches

In IRT, characteristics of each item are represented by an item response curve, which is a function of individual test takers’ characteristics called "latent traits." Items are compared based on their item response curves between target and comparison groups.

Think Aloud Methods

Think aloud methods provide a way to answer questions about why particular items may be problematic for students. For state assessments, think aloud methods tap into the short-term memory of students who complete assessment items while they verbalize. Researchers believe that the verbalizations produced in think aloud studies provide excellent information because they are not yet in the long-term memory. Once experiences enter long-term memory, they may be tainted by personal interpretations. Therefore, an excellent way to determine whether design issues really do exist for students is to have students try out items themselves.

Think aloud methods have been used in research to identify problematic items for students with disabilities (see Technical Report 44 in Resources). When students verbalize everything they think while completing an item, it becomes easy to see how the design of the item affects the understanding of the item. If a student does have difficulty with the item, it will also be easy to determine whether the difficulty is a result of design features or a lack of curricular knowledge. The depth of understanding that results from think aloud techniques is this method’s strength. Follow-up questions can supplement any unclear data derived from think aloud techniques.


Attaining universal design in statewide assessments is a goal for states as they continually refine and improve their assessments. With the understanding that this goal means that states are retaining the constructs and content that their tests are intended to measure, the question becomes how to identify items that may violate the principles of universal design.

It is important to have a systematic approach to reach universal design in assessments. Research is paving the way to identifying techniques that are workable. Any one technique by itself, however, may be insufficient. The methods identified here will reduce the possibility of erroneously flagging and eliminating items that reflect poor performance due to students’ lack of opportunity to learn.

Specifically, using sets of considerations for expert review can make the test development process more transparent, informed, and focused on the needs of the entire population of students and ensure that the assessment results are more meaningful for the widest range of students. Statistical analysis methods can help pin-point test items that are potentially problematic and that may have universal design issues. Think aloud methods can be used with students themselves who can provide information that will help illuminate whether there are design issues that need to be addressed. These are all aspects of striving to reach universal design, which holds the promise of improved student performance. This goal can be reached without compromising the validity of the assessment.


Using the Think Aloud Method (Cognitive Labs) to Evaluate Test Design for Students with Disabilities and English Language Learners (Technical Report 44). Johnstone, C.J., Bottsford-Miller, N.A., & Thompson, S.J. (2006). Available from the National Center on Educational Outcomes at

Analyzing Results of Large-Scale Assessments to Ensure Universal Design (Technical Report 41). Johnstone, C.J., Thompson, S.J., Moen, R.E., Bolt, S., & Kato, K. (2005). Available from the National Center on Educational Outcomes at

Consideration for the Development and Review of Universally Designed Assessments (Technical Report 42).Thompson, S.J., Johnstone, C.J., Anderson, M.E., & Miller, N.A. (2005). Available from the National Center on Educational Outcomes at

Universal Design Applied to Large-Scale Assessments (Synthesis Report 44). Thompson, S. J., Johnstone, C. J., & Thurlow, M. L. (2002). Available from the National Center on Educational Outcomes at

Universally designed assessments: Better tests for everyone! (Policy Directions No. 14). Thompson, S., & Thurlow, M. (2002). Available from the National Center on Educational Outcomes at