Universal Design Online Manual

Christopher Johnstone • Jason Altman •
Martha Thurlow • Michael Moore

September 2006

All rights reserved. Any or all portions of this document may be reproduced and distributed without prior permission, provided the source is cited as:

Johnstone, C., Altman, J., Thurlow, M., & Moore, M. (2006). Universal design online manual. Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.


The No Child Left Behind Act of 2001 and other recent changes in federal legislation have placed greater emphasis on accountability in large-scale testing. Previously exempt students, many with disabilities, now must be included, monitored, and reported by all states. Because large-scale assessments have such high stakes, it is important to ensure that assessments are an accurate measure of the knowledge and skills of ALL students. To ensure that tests are designed from the beginning with accessibility in mind, Thompson, Johnstone, and Thurlow (2002) developed seven Elements of universally designed assessments, based on research from a variety of fields.

By the end of this manual you will better understand how the following design considerations improve testing for all students:

  • Providing inclusive assessment populations

  • Measuring what they are intended to measure

  • Reducing bias to a minimum

  • Having clear and understandable instructions and procedures

  • Ensuring amenability to accommodations

  • Having comprehensible language

  • Being legible.

This tool outlines steps that states can take to ensure universal design of assessments. The recommendations can be used for both computer and paper-based assessments. The National Center on Educational Outcomes (NCEO) recommends that states follow the steps provided in chronological order. Including any step in the design and review of tests may improve the design features of a state assessment.

This online document is accompanied by a more detailed "How-To" manual. See A State Guide to the Development of Universally Designed Assessments.

Overall Universal Design Principles

  1. Universally designed assessments DO NOT change the standard of performance - they are not watered down or made easier for some groups.

  2. Universally designed assessments are not meant to replace accommodations. Even by incorporating the elements of universal design in assessment design, accommodations may still be needed for some students in the areas of presentation, response, setting, timing, and scheduling.

  3. There is a reason we are calling the ideas we will be working with "considerations."  They are steps that should be considered when developing an assessment. They should be talked about openly and decisions should be made, having weighed the pros and cons of different design elements.

  4. In addition to English language learners, ALL students benefit from having more accessible tests.

Step 1: Ensure the Presence of Universal Design in RFPs

All students must have the opportunity to demonstrate their achievement of the content standards. Therefore, to satisfy state and federal requirements for universal design, vendors should design state tests that allow the maximum number of students possible (and students with diverse characteristics) to take the same assessments without threat to the validity and comparability of the scores.

To this end, vendors must demonstrate how they will develop "universally designed assessments." Such assessments are designed from the beginning to allow participation of the widest range of students and result in valid inferences about the performance of all students, including students with disabilities, students with limited English proficiency, and students with other special needs.

Step 2: Review Teams

Once the assessment is designed and in a format suitable for previewing, it is important for states to let sensitivity review teams examine the assessment (in the format in which students will see the test). Reviews by these teams are common practice in states, and are often encouraged by test vendors. When creating bias and content review teams, it is important to involve members of major language groups, disability groups, and support groups. Grade level experts, representatives of major cultural and disability groups, researchers, and teaching professionals all make up an effective review team.

Reviewers will need the following information to perform a careful and comprehensive review:

  • Purpose of the test, and content standard tested by each item
  • Description of test takers (e.g., age, geographic region)
  • Field test results by item and subgroup
  • Test instructions
  • Overall test and response formats
  • Use of technology
  • State accommodation policies

Bias and design issues may arise in test development and are not problematic if caught by review teams. Sensitivity reviewers are charged to flag items that may cause "problems for certain subgroups, where the "problems" are due to their subgroup status rather than their knowledge of the content. An efficient way to "flag" items is to use a review sheet, which provides reviewers an opportunity to mark potential issues with items, thus providing opportunities for further discussion among reviewers. By using a structured form, reviewers are more likely to provide specific feedback to test vendors. Such feedback allows for items to be re-examined for design issues, rather than (as is often the case) summarily rejected for unclear reasons. When using structured forms, item reviewers then create a "win-win" situation for advocates and vendors. In other words, they are able to give test vendors specific information about what may be an issue in items. Vendors can then determine whether changes can be made to items without having to remove items from item banks entirely. Item-specific review and whole test review forms can be used for item reviews.

Step 3: Using Think Aloud Methods to Analyze Flagged and Unflagged Items

In an effort to validate the findings of experts, a series of items can be examined by students themselves using cognitive lab, or think-aloud methods.

Think aloud methods were first used in the 1940s and have since been used for a variety of "end user" studies in the fields of ergonomics, psychology, and technology. In the case of statewide assessments, the end users are students who will take tests. Think aloud methods tap into the short-term memory of students who complete assessment items while they verbalize. The utterances produced by students are the data that researchers or states can use to better understand items.

The verbalizations produced in think aloud studies provide excellent information because they are not yet in the long-term memory. Once experiences enter our long-term memory, they may be tainted by personal interpretations. Therefore, an excellent way of determining whether design issues really do exist for students, is to have students try out items themselves in "live time."

NCEO typically videotapes all think aloud activities, but states can also either audiotape or have several observers review field notes. Inter-rater agreement is important for making decisions based on think aloud activities, so some strategy for confirming what is viewed or heard during think aloud activities should be undertaken. In addition, it is useful to include students who achieve at a variety of levels on statewide achievement tests. To this end, a sample population might include students without disabilities and majority culture children as well as students with disabilities, English language learners, and students from low socioeconomic status. A NCEO recent research report on think aloud methods can be accessed at http://education.umn.edu/nceo/OnlinePubs/Tech44/

An example of a process for selecting and conducting think aloud studies is described in Vignette #1.

Vignette #1

State X has recently conducted an expert review on its fourth grade mathematics test. Reviewers found that most items only had minor formatting issues that they would like to see improved, but that three of the items had major issues pertaining to bias, presentation, and comprehensible language. State X's assessment director was concerned that these items might cause students with a variety of descriptors to incorrectly answer these items because of design issues, thus reducing the validity of inferences that could be drawn from the test. State X then decided to conduct a think aloud study on the three items in question, as well as three items that generally met the approval of item reviewers.

Overview of Activities
State X opted to conduct the think aloud study with its own staff (alternatively, they may have decided to offer a subcontract to a local university or research organization to conduct the study). The study took place in a quiet room, where State X staff members could videotape the procedures.

Because State X's assessment director was concerned about the effects of bias, presentation, and language on students with particular disabilities and English language learners, she targeted these students, as well as students who were deemed "typically achieving, non-disabled, English proficient" students. In total, 50 Grade 4 students were contacted. Among these were: 10 students with learning disabilities, 10 students with mild mental retardation (who took the general education assessment), 10 students who were deaf, 10 students who were English language learners (but did not have a disability), and 10 non-disabled, English proficient students.

Each student was then individually brought to the quiet room. First, State X staff members explained the process. Then, students practiced "thinking aloud" by describing everything they do when they tie their shoes (sign language interpreters were present for students who are deaf). Once students understood the process, they were asked to think aloud while they answered mathematics items. The only time State X staff spoke was when students were silent for more than 10 seconds, at which time staff encouraged students to "keep talking." Each item took approximately 10 minutes per student.

After students completed items, State X staff asked post-hoc questions, simply to clarify any issues they did not understand. Data derived from post-hoc questions are not as authentic as think aloud data, but they can help to clarify issues that were unclear to staff.

Once all think aloud activities were completed, State X staff reviewed all the videotapes they had taken. Using NCEO's think aloud coding sheet, staff were easily able to determine if design issues were problematic for particular populations. The data they collected helped them to make recommendations for Step 4.

Step 4: Revisit Items Based on Information from Steps 2 and 3

Steps 2 and 3 are likely to produce rich data that identify concerns about particular items or the entire test. Prior to field testing (Step 5) it is important to analyze the data produced in Steps 2 and 3 and make any possible changes that can be made to the test. Some changes may be impossible prior to field testing, while others (such as formatting changes) may be quite easy to make. Regardless of whether changes are made to tests or not, data from Steps 2 and 3 are important sources for recommendations and cross-analyzing with field test results.

Step 5: Field Test

It is common practice for states to field test potential exam items well in advance of their actual inclusion in statewide testing systems. Somewhat less common is taking potential exam items and transferring them into accommodated formats and then field testing them for potential differential item functioning. It is important that test administrators are aware of item statistics as well as the effects of accommodations on each test item when making decisions on which items to include on exams.

Step 6: Analyze Field Test Data

A useful method for ensuring Universal Design of assessments is to conduct large-scale statistical analyses on test item results. Many methods exist for examining data to detect design issues related to Universal Design. Approaches range from simple methods based on classical test theory to more contemporary item response theory (IRT) techniques with increasing complexity. Helpful statistical techniques include: Item Ranking, Item Total Correlation, Differential Item Functioning (DIF) using Contingency Tables, and DIF using Item Response Theory (IRT) approaches.

The analyses listed above will almost certainly produce disparate results because they are examining slightly different item functions. In fact, between disability groups and analyses, it is likely that many items on a test will be flagged at least once. Such a result does not necessarily mean that an entire test is flawed.

Rather, a reasoned approach to sorting through large amounts of data is the rule of halves. If an item is flagged in half of the analysis methods (n ≥ 2 analyses), that item is a candidate for re-examination.

Furthermore, if data disaggregated by the disability category of particular students (e.g., students with learning disabilities, hearing impairments, etc.), and an item has been flagged across more than half of those categories, it may also be a candidate for revision. One can conduct similar examinations on populations who took tests with specific accommodations (if items are flagged for half of the accommodations tested, they may have universal design issues).

A NCEO report demonstrating item review methods is available at http://www.nceo.info/OnlinePubs/Technical41.htm.

Step 7: Final Revision

After experts have reviewed items, students have explained how they approached items using think aloud methods, and field test results have been reviewed, states and contractors can discuss the final revisions that need to be made to tests. It is possible that no changes at all will be made. On the other hand, the "final revision" stage is the last time states and contractors can address design issues before tests are distributed with "high stakes" for the test itself. This stage is one that should be approached with caution, but in a cooperative spirit that makes sense for all students as well as the needs of the state's finances and timelines.

Step 8: Testing

Step 8 is the culmination of months (or years) of hard work on the part of both the state and the contractor. During testing periods in states, students take the assessments designed by contractors under standard and accommodated conditions. Results are used for accountability purposes, and are monitored at both the school and district levels. Designing a test for accessibility is a challenging process, and culminates when students take the "live" test.

Step 9: Post-test Review

Once tests results are available, the process starts again. States can examine results statistically, and begin the expert review and think aloud processes for the following year's test. When contractors develop a test that states deem acceptable for use for more than one year, the universal design process is streamlined because many of the potential problems with a test were caught during the design and field test stage. Universal design processes are then used as a tool for ongoing item improvement.

It is possible that there will never be a test that is accessible to all students for all items. While a perfectly accessible test may not be possible, a more accessible assessment is possible. Hard work, cooperation, and following the steps in this guide can help the process. In addition, states may develop their own universal design processes. As universal design research emerges, processes will become more succinct, efficient, and effective.

States that have made commitments to accessible assessments for all students will find their efforts rewarded in better measurement of what their students know and can do. It is our hope that this manual will help current and future processes and make the commitment to universal design and easier one to make.