Models for Understanding Task Comparability
in Accommodated Testing

 

Council of Chief State School Officers
State Collaborative on Assessment and Student Standards
Assessing Special Education Students (ASES) - Study Group III
Patricia Almond-Chairperson

 

Gerald Tindal
University of Oregon

March, 1998

 

Behavioral Research and Teaching
College of Education - 232 Education
5262 University of Oregon
Eugene, OR 97403-5262
(541) 346-1640
geraldt@darkwing.uoregon.edu


Preface

This paper was commissioned by a subgroup of the State Collaborative on Assessment and Student Standards (SCASS) focused upon Assessing Special Education Students (ASES). When students with disabilities take large scale assessments, issues arise about testing accommodations and modifications. The SCASS-ASES group sought guidance as they addressed concerns about the effect of accommodations on the validity of test performance for students taking them. Knowledge about the impact of accommodations on assessment results was limited so a paper about task comparability was conceived.

This document represents the resulting paper and it reflects the search for understanding about accommodations in large scale assessments for students with disabilities. The SCASS-ASES Study Group Three proposed the project and its members acted as advisors during the development of the paper. They hope that the ideas presented in this paper spur discussion, inspire new research, and move educators forward in implementing the intent of the 1997 Amendments to the Individuals with Disabilities Education Act (IDEA).

The group wishes to acknowledge the dedication of Dr. Gerald Tindal, who researched and wrote the paper. He saw the project to completion, posed provocative questions, probed for answers, and provided rare insights into arguments that were raised by members of the group. He synthesized a collection of questions, conversations, as well as existing theories and knowledge in the field. His work has surpassed the initial vision, producing a fundamental perspective that already has filtered into understanding large scale assessments and students with disabilities.

Marth Thurlow also should be acknowledged for the careful and insightful editing she completed with this paper. She provided extremely insightful edits that made the language and ideas much more clear and readable. Her work was both careful and thoughtful.

Study Group Three

Patricia Almond (Chairperson, OR), Sue Bechard (CO), Jana Deming (TX), Edna Duncan (MS), John Haigh (MD), Jan Kirkland (MS), Ken Olsen (MSRRC), Paula Ploufchan (CCSSO), Martha Thurlow (NCEO), Becca Walk (WY).


Abstract

In this paper, three models are described for making decisions about the comparability of tasks when accommodations are used in large-scale assessments: (a) descriptive, (b) comparative, and (c) experimental. Within each of these models, criteria are presented that can be used to determine whether assessment tasks are considered similar (comparable) or different (noncomparable). The descriptive model involves the presentation or analysis of policy, providing historical or contextual information, and documentation. The comparative model involves the interpretation of policy, with or without data. When data are available and used, they provide post-hoc evaluations and the beginning of an empirical approach in which data are used in the decision making process. Finally, in the experimental model, not only are data used in decision making, but threats to validity are controlled, allowing for statements of relationship or cause-effect to evaluate the impact of accommodations on decisions. The use of three models should help to clarify the reasoning behind judgments of accommodations and task comparability in large-scale assessment programs.


Models for Understanding Task Comparability
in Accommodated Testing

The 1997 amendments to the Individuals with Disabilities Act (IDEA) contain a strong directive for the participation of students with disabilities in large-scale testing. Yet, the format of the assessments themselves is not specified. For example, the language of the reauthorized legislation includes the following statement: "Children with disabilities are included in general State and district-wide assessment programs, with appropriate accommodations, where necessary" [IDEA, section 612 (a)(17)(A)]. The question then, is what constitutes an "appropriate" accommodation?

To what degree are accommodated and non-accommodated tasks within large-scale assessment programs comparable? If slight changes are made in the manner in which tasks are configured, administered, or responded to, perhaps these variations can be ignored and scores considered comparable. In contrast, if significant changes exist in the accommodated tasks, then the scores may not be similar and thus should not be compared. When are two tasks considered comparable and when are they considered not comparable?

The National Center on Educational Outcomes (NCEO) has reported that there "appears to be no formal consensus on the use of the terms accommodation, modification, and adaptation, [and] they are used interchangeably" (NCEO, 1993, p. 2). In a later draft of a position presented at a June, 1996, State Collaborative on Assessment and Student Standards (SCASS), Ysseldyke describes an accommodation as "an alteration in how a test is presented to or responded to by the person tested; [it] includes a variety of alterations in presentation format, response format, setting in which the test is taken, timing, or scheduling. The changes are made in order to provide a student equal access to learning and equal opportunity to demonstrate what is known" (p. 1). Further clarification is provided in that the alterations should not substantially change the level of the test, the content of the test, or the performance criteria (what the test measures–i.e., construct validity).

Phillips (1994) argues that a differential effect is needed to help determine whether an accommodation is appropriate or not appropriate: The accommodation is effective for students with disabilities but is not effective for students without disabilities. If changing the test (how it is given and how it is completed) increases performance across the board for all students, then it may not be an appropriate accommodation. She specifically poses five issues to be addressed when considering accommodations that depart from normally used testing procedures:

  1. The measures used within any given eligibility area must be technically adequate (have established reliability and validity). For example, when students are found eligible to receive special education services, we need to consistently and truthfully document a disability and its adverse educational impact.
  2. To the greatest degree possible, students should be able to adapt to the standard testing situation, and if any changes are made they should be only minor ones. That is, changes should not be made if they don’t need to be made.
  3. The skill being tested should be the same regardless of any changes made in the way the test is given or taken. Changes in testing must be made that are limited to the removal of "irrelevant sources of difficulty but still measure the same construct" (Thurlow, Ysseldyke, & Silverstein, 1995, p. 264).
  4. The meaning of scores should be the same regardless of any changes being made in the manner in which the test is given or taken. As in the third point above, not only should the skill remain the same, but it’s meaning and the implications for using the scores to make decisions should not be different with changes in the test administration or response.
  5. The accommodation should not have the potential for benefit for students without disabilities. Any change must, therefore, reflect differential performance among different students: Some students should be affected positively by the change and others should be unaffected.

This approach to determining appropriate accommodations is controversial for two reasons: (a) it is uncertain what the best comparison groups are - all students in general education, just low performers, or just average performers; and (b) it is not clear how students with disabilities should be sorted for comparisons - by disability category or by area in which services are provided (e.g., reading decoding, language processing, etc.). The finding of differential effects implies that there has been differential access when no accommodations are used. Equal access is achieved, for example, when an accommodation improves the performance of a student in that area where the student receives special education assistance, but does not improve the performance of average students not on IEPs.


A Continuum of Testing Options

A framework can be constructed for placing accommodations on a continuum with the following definition used to create four groups: An accommodation is a change that (a) provides unique and differential access (to performance) so certain students may complete the tests and tasks without other confounding influences but (b) does not change the nature of the construct being tested. Such changes typically are designed for specific individuals and for particular purposes.

 

Figure 1. Accommodations and Modifications

Accommodations & Modifications Graphic

 

On the left side of the continuum (Figure 1 above) is a standard test and an accommodated test in which minor changes are made in the way it is given or taken and may be considered a "standard" assessment, and reflect the same construct. The purpose of making accommodations is to provide access to participation in an assessment program with a primary test or instrument. The changes may be implemented within or across special education populations, referring to the unique needs of students in a manner that may or may not be disability oriented and/or related to an adverse educational impact. Because the construct has not changed, scores can be aggregated.

Changes also can be made for specific students that are substantial and modify the construct being tested. Although the assessment still focuses on documenting performance as part of an assessment program with a primary instrument or test, substantial changes are made in the administration-response of it. The net effect is that different tasks and noncomparable scores are created compared to those generated in the standard-accommodated assessment. As a result, some type of dissagregated reporting system may need to be considered.

At the most extreme end of the continuum, changes in test administration and responses lead to alternate assessments. For changes made with students with severe disabilities and with unique needs for whom the primary measure is not appropriate, such alternate assessments may be needed. Rather than documenting performance on the primary instrument or test, one might sample behavior using tasks uniquely created to meet the needs of the student. The tasks and the scores would be noncomparable to those reflected in the primary measures and a different set of measures as well as dissaggregated reports may be needed. These changes in the assessment program are depicted in the figure above as falling on a continuum that quite likely reflects both decreasing numbers of students and increasing amounts of change in moving from left to right.

As can be seen in the table of accommodations provided by NCEO (see Table 1), tests can be administered in a variety of ways (e.g., individually or in whole class, in one session or in multiple sessions); such changes also can reflect the way in which the test is taken or behavior is sampled (e.g., orally responding for a scribe to write rather than writing). Four general classes of changes in testing practices have been used by various state departments: (a) timing/scheduling, (b) setting, (c) response, and (d) presentation (Thurlow, Ysseldyke, & Silverstein, 1995). Furthermore, responses can be changed through different test formats or with the use of assistive devices. Examples of the more obvious response accommodations include increasing the spacing on the test, using graph paper, using wider lines and/or wider margins, giving the response orally, using paper in an alternative format (word or line processed, Braille, etc.), and allowing the student to mark responses in a booklet instead of on an answer sheet. Likewise, within presentation changes, a distinction is made between changes to the test directions and the use of assistive devices or support changes. These authors also list a number of examples in which the test presentation format is changed by (a) using Braille, magnifying equipment, or large print; (b) signing directions; (c) interpreting directions; and finally, (d) orally reading the directions.

 

Table 1. Assessment Accommodations NCEO/1996-Working Draft

Timing/Scheduling
  • Flexible schedule
  • Allow frequent breaks during testing
  • Extend the time allotted to complete the test
  • Administer the test in several sessions, specify duration
  • Provide special lighting
  • Time of day
  • Administer test over several days, specify duration
  • Provide special acoustics
Setting
  • Administer the test individually in a separate location
  • Administer the test to a small group in a separate location
  • Provide adaptive or special furniture
  • Administer test in locations with minimal distractions
  • In a small group, study carrel, individually



Presentation
  • Braille edition or large-type edition
  • Prompts available on tape
  • Increase spacing between items or reduce items/page-line
  • Increase size of answer bubbles
  • Reading passages with one complete sentence/line
  • Multi-choice, answers follow questions down bubbles to right
  • Omit questions which cannot be revised, prorate credit
  • Teacher helps student understand prompt
  • Student can ask for clarification
  • Computer reads paper to student
  • Highlight key words or phrases in directions

Test Directions

  • Dictation to a proctor/scribe
  • Signing directions to students
  • Read directions to student
  • Reread directions for each page of questions
  • Simplify language in directions
  • Highlight verbs in instructions by underlining
  • Clarify directions
  • Provide cues (e.g. arrows and stop signs) on answer form
  • Provide additional examples

Presentations-Assistive Devices/Supports

  • Visual magnification devices
  • Templates to reduce visible print
  • Auditory amplification device, hearing aid or noisebuffers
  • Audiotaped administration of sections
  • Secure papers to work area with tape/magnets
  • Questions read aloud to student
  • Masks or markers to maintain place
  • Questions signed to pupil
  • Dark heavy or raised lines or pencil grips
  • Assistive devices (please specify)
  • Amanuenis (scribe)
Response

Test Format

  • Increase spacing
  • Wider lines and/or wider margins
  • Graph paper
  • Paper in alternative format (word processed, Braille, etc.)
  • Allow student to mark responses in booklet instead of answer sheet

Responses-Assistive Devices/Supports

  • Word processor
  • Student tapes response for later verbatim transcription
  • Typewriter
  • Communication device
  • Alternative response such as oral, sign, typed, pointing
  • Brailler
  • Large diameter, special grip pencil
  • Copy assistance between drafts
  • Slantboard or wedge
  • Tape recorder
  • Calculator, arithmetic tables, abacus
  • Spelling dictionary or spell check





















 

Although these sets of accommodations appear to make considerable sense in the practical world of large-scale assessment and while many states have adopted some of these and others (Siskind, 1993b), it is difficult to justify or explain why they are not uniformly adopted: Some states allow use of some of these accommodations and others do not allow them to be used (Thurlow, Scott, & Ysseldyke, 1995; Thurlow, Seyfarth, Scott, & Ysseldyke, 1997).


Models for Making Decisions about Task Comparability

In this paper, three models are presented for determining task comparability, reflecting both the current state of decision-making and providing a strategy for enhancing current practices. The first model is descriptive. It focuses on policy presentation, interpretation, and analysis. This approach is similar to that used in setting standards, in which judgments are made for passing scores. When setting standards, the judgment is about the cut-score; in identifying accommodations, the judgment is about task and response (score) comparability. The second major model in determining task comparability is comparative. This approach provides empirical, post-hoc evaluation information on implementation of accommodations to help develop or revise policies. The third model is experimental. It provides a more formal data collection system to control threats to validity (internal and external). This model implies control over the selection of participants rather than the use of intact groups and the assignment of subjects to conditions rather than post-hoc evaluations; both features help establish cause-effect relationships. Two types of experimental designs are available for studying groups and single cases.

In all three models, the common focus is on determining whether the construct being measured is modified when testing conditions or tasks are changed. Although differences exist among models in the emphasis given to data to inform the decision-making process, all three models or combinations of them can be used to make decisions on task comparability. Therefore, the models should not be viewed as completely separate. For example, with a policy model, judgments are made through simple reference to the policy to a more formal analysis. From a post-hoc evaluation perspective, decisions about accommodations are based on multiple data sources and the relationship among variables, ranging from process-implementation data to student outcome data, though in all instances the data are within extant structures and procedures. Finally, with an experimental approach, task comparability rests on well-controlled designs for collecting data and making inferences from findings. For both types of experiments, group or single case, threats to internal validity are minimized so that cause-effect statements can be made. When groups of students are studied, the designs call for matching treatments (types of accommodations), students, and outcomes. When single cases are studied, a functional analysis is used to either hypothesize and/or verify important distinctions in tasks or student responses.

In Figure 2 below, these models have been placed on a continuum from less to more data-based decisions. On the left, policy and data to inform policy are endlessly intertwined; moving to the right, data are used to justify, explicate, and validate policy, first being created from within policy and then being generated outside of the policy with increasing levels of sophistication and use.

 

Figure 2. Models and Types of Evidence

Models & Types of Evidence Graphic

 

This process for making decisions should help schools fulfill the mandates of the newly reauthorized IDEA legislation. In particular, it should help states ensure inclusion, provide accommodations or modifications in large-scale assessment programs, and report outcomes. Schools now are required to include students with disabilities in all district and state testing programs, as well as provide appropriate accommodation when necessary; therefore, systematic procedures are needed for classifying accommodations. Furthermore, for students exempted from taking such tests, not only must this decision be explained but alternate, modified measures need to be provided; these measures should be sensitive to individual student needs. Finally, performance must be reported for students with disabilities, whether participating in accommodated or modified testing programs. Performances will have to be reported both aggregated with other students, and disaggregated. If performance cannot be reported in the aggregate, then it still needs to be disaggregated and reported.

In Figure 2, the standard for making judgments appears as a continuum with an increasing scale of evidence. Furthermore, as states formulate policy, a very uneven mix of evidence may co-exist across the decision areas of inclusion, accommodation-modification, and reporting outcomes. Some states may have a better evidentiary base in some areas than others. Furthermore, with states changing rapidly in their ramping up to meet the legislative mandates, a quickly changing landscape is in the offing. Therefore, in the examples noted below, actual state policy may be different now; nevertheless, at the time to which it is referenced, it provides a good example of the six types of evidence within the three models.


Descriptive Model

Three types of evidence are used in a descriptive model. They rely on current policy for decision-making. They range from a simple presentation of policy with no other information, to a justified presentation, and finally an analysis of policy. They all use a common language that informs others through policy, with varying degrees of explanation or justification. Little external information, however, is presented in the policies to ascertain the worth of either the judgments or the policies, hence the term descriptive. Examples of external information include presentation of information outside of the policy itself (e.g., from other states’ policies).

Policy Presentation

In this type of evidence, decisions about task comparability are made by referring to policy. Only slight variations in the degree of justification for the decisions are made. In one extreme of the descriptive model is a simple list of allowable accommodations that are assumed to define task comparability.

 

Policy Presentation Example One.

In an August 4th, 1997 memorandum on allowable test accommodations from the Nevada Department of Education, the following statement is made about the guidance available for accommodations: "We do not have a system for answering questions. Any questions regarding accommodations are directed to Special Education consultants and our consultant in charge of statewide assessments...We have no resources available to answer questions outside of the enclosed document and the individual consultants." An accompanying appendix is presented with the memorandum outlining the policy for testing exceptional students, including permissible accommodations for exceptional students, which "have been judged as not violating the nature, content, or integrity of the test" (p. D2): test setting, scheduling, test directions, test format, test answer mode, and use of mechanical and non-mechanical aides. Within each of these accommodations, several specific applications are presented with a directive that they should not be interpreted broadly.

 

Some states simply present policy to define comparability, with little justification provided. In general, the major guiding premise has been that the accommodation be explicitly listed as allowable in state or district policy or test publisher’s manual and that it be described on the IEP, therefore making it part of the daily instruction and test situations. In the other extreme are policy directives with explanatory notes or interpretations. In these cases, the reasoning behind assumed comparability or noncomparability is explained in the policy.

 

Policy Presentation Example Two.

The Hawaii State Test of Essential Competencies (HSTEC) contains a similar policy presentation in its Guidelines and Procedures for Students with Disabilities (Hawaii Department of Education, September, 1996). Although slightly more expansive with justifications, the decision to consider accommodation as changing the nature of the task rests on two principles: (a) "The HSTEC may never be read to a student because of variation provided by different readers" (p. 3), and (b) consideration of whether the skill being tested is changed by the use of an accommodation. For example, "essential competency #1 requires the student read and use printed material from daily life. Because of the demands of the competency, Essential Competency #1 items must always be read independently by each student" (p. 3). Essential Competency #5, Math Computation, requires that students demonstrate their use of computational skills and therefore any items within this competency must be performed without a calculator.

 

Policy Presentation Example Three.

Mississippi includes a list of allowable accommodations organized around seating/setting, scheduling, format, and recording/transferring. In general, appropriate accommodations must (a) not affect the validity of the test, (b) function only to allow the test to measure what it purports to measure, and (c) be narrowly tailored to address a specific need in order to justify the request (Mississippi Assessment System Exclusions and Accommodations, revised, 1995). All three criteria must be met for the accommodation to be allowed. Two of the four tests used in this accountability system, The Iowa Tests of Basic Skills and the Tests of Achievement and Proficiency, allow for very limited accommodations in any of those areas listed. In some specific examples of accommodations, the student’s scores are not to be included in summary statistics because the student’s results ‘cannot be interpreted in the same manner’ as the results of students who meet the qualifications for test standardization procedures." Whenever unallowable accommodations are utilized or when any Special Education, 504, or LEP student who has met the criteria for exclusion but elects to take the test anyway, the scores will be excluded from the summary statistics. While these students’ scores are not included in the summary statistics, we believe that the results provide valuable information that should be used as a tool to tailor the student’s educational goals.

 

Policy Interpretation

When policy is interpreted, not just documented, explanations are provided that help guide the decision-making process. Although little external information is presented outside of the policy, the judgment process is clearly more transparent and can be internally cross-referenced. The two examples of such systems are from Texas and Maryland. Policy interpretation may anticipate implementation strategies, for example, with a list of allowable accommodations to be distributed to parents, students, and school personnel; training and technical support may be provided then to help educators select appropriate accommodations. Interpretations may help teachers understand which accommodations are appropriate for students: how to base them on the needs of individual students and how to ensure that accommodations have been used during instruction prior to testing.

 

Policy Interpretation Example One.

In Texas, a list of accommodations is presented in a test manual followed by a review of written comments from stakeholders suggesting the need for more clear guidance in the manual on what accommodations are allowable; it was reported that accommodations were interpreted differently across the state. On one end was the argument for individualized decisions about use of specific accommodations while the other end was based on an argument for all accommodations and modifications used in instruction to be allowed. The following revised policy on test accommodations was proposed. Disseminate widely a comprehensive list of allowable test modifications and train educators to use them. Continue to allow schools or districts to request additional accommodation. Justification. Current terminology regarding the use of accommodations that do not affect the validity of the assessment may lead to varying interpretations of what is and is not allowed. Some stakeholders asserted that teachers or administrators might be unwilling to provide accommodations currently allowed by the agency because there is little evidence available concerning the effects of accommodations on test validity (Thurlow, Ysseldyke, & Silverstein, 1995). The stakeholders also suggested that other teachers or administrators might be unaware of what is and is not permissible. This proposed change directly ties assessment accommodation to the IEP and to classroom instruction and testing. It will promote testing situations that reflect classroom practice and will preserve the ARD committee's primary role in decisions about appropriate accommodations. Also, by providing any additional detail in the list of accommodations permitted on assessments, this policy will increase the participation of students receiving special education services in the assessments, and students receiving special education services will have a better chance to demonstrate what they know and can do (p. 18).

 

Policy Interpretation Example Two.

In Maryland’s Guidelines for Accommodations, Excuses, and Exemptions (revised, 9/3/96) a listing of general principles is provided, as well as definitions and procedures for the Maryland Functional Testing Programs (MFTP), the California Test of Basic Skills (CTBS, 5th edition), and the Maryland School Performance Assessment Program (MSPAP). The general principles include (p. 2):

  • "accommodations are made to ensure valid assessment of a student’s real achievement."

  • "accommodations must not invalidate the assessment for which they are granted...accommodations must be based upon individual needs and not upon a category of disability, level of instruction, environment, or other group characteristics."

  • "accommodation must have been operational in the student’s ongoing instructional program and in all assessment activities during the school year; they may not be introduced for the first time in the testing of an individual."

  • "the decision of the validity or efficacy of not allowing an accommodation for testing purposes does not imply that the accommodation cannot be used for instructional purposes."

A summary is presented of five broad categories of accommodation: scheduling, setting, equipment, presentation, and response, which further include a list of specific accommodations permitted for all three types of tests. In general, few differences exist for the three types of tests, though interesting inferences can be deduced where allowable accommodations are differentiated. For example, calculators can be used with the Functional Testing Program, cannot be used with the CTBS/5, and can be used but invalidate the mathematics score with the MSPAP. A similar pattern exists for the use of electronic devices (allowed with MFTP, not allowed with CTBS/5, and invalidate the language usage score with the MSPAP).

Finally, in explicating the decision-making process, a series of case studies is presented in which students are described with information on student background and classroom functioning. For example, for Student #3 from an elementary school, a calculator is recommended as an allowable accommodation: He is described as a student with learning disabilities, memory problems with no mastery of facts, and an Individualized Educational Plan (IEP) addressing goals and objectives in reading, mathematics, and written expression. No statement is made about participation in statewide assessments. For middle school student # 4, who is identified with learning disabilities and IEPs in written communication/basic reading skills, participation is required in all classroom, system, and state testing programs. Furthermore, dictation of a response is allowed (with verbatim transcription by school personnel) for extended response tasks. In contrast, another middle school student (#7) is described with similar needs. Although, it is recommended that the student participate in all classroom, system, and functional tests with similar accommodations (dictating the response to an examiner for verbatim transcription), the student is exempted from the reading and written language and oral presentation of the MSPAP "due to required accommodations that invalidate the test."

 

The use of case studies to explicate the decision-making process provides an interesting array and range of different examples by focusing on the minimal differences (only slight nuances that distinguish the two examples from each other, one of which reflects a positive instance and the other of which represents a negative instance). This strategy is an extremely efficient and smart way to explicate the central feature of a concept (i.e., accommodation versus modification) because it is possible to ignore other (irrelevant) features of the concept except those that are minimally different.

 

Policy Implementation Analysis

At some level, task comparability may be considered by tracking the implementation of various accommodations. In conducting this kind of analysis, data are collected to understand factors associated with the use of accommodations. The focus of this perspective is to understand the degree to which accommodations are proposed and/or used within the context of other (student) mitigating factors so that one could begin to explain why some accommodations tend to be selected.

Policy Implementation Example One.

In Rhode Island, the following questions focus on student needs with respect to the assessment requirements. They represent questions the IEP team should be able to answer to identify needed accommodations:

  1. Can the student work independently?
  2. Can the student work in a room with 25 to 30 other students in a quiet setting?
  3. Can the student work continuously for 45-60 minutes?
  4. Can the student listen and follow oral directions?
  5. Can the student use paper and pencil to write paragraph length responses to open-ended questions?
  6. Based on the sample questions, can the student read and understand these questions?
  7. Can the student manipulate a tag board ruler and various tag board shapes in small sizes?
  8. Can the student operate a calculator?
  9. Can the student follow oral directions in English?
  10. Can the student write paragraph length responses to open-ended questions in English?
  11. Based on the sample questions, can the student read and understand these questions in English?

"No" to any of these questions for one or more students should result in identification of an appropriate accommodation. It is then recommended that the principal and/or relevant school and district staff be consulted on making any accommodations recommendations.

 

 

By analyzing the degree to which students arrive at testing situations with accompanying problems such as those noted above, it should be possible to understand why some accommodations are recommended while others are not recommended. For example, rather than simply noting that accommodations include separate testing with flexible scheduling, it would be possible to justify an accommodation based on the student’s capacity to work in a group, work independently, and remain on-task for 45-60 minutes. For students who cannot be engaged in this manner, it can be inferred that the performance is adversely affected and the test is not measuring what it should be measuring. This information, documented by the IEP team, would help teachers make decisions about accommodations.

In reporting accommodations data, meaning and the basis for understanding is provided in the relationships among the variables. For example, the same data reported for two groups of students may lead to an understanding why an accommodation is appropriate or permitted some of the time for some students and not at other times for different students. In the data reported by CCSSO, considerable differences exist in the number of states allowing various accommodations for students with disabilities versus those with limited English proficiency (see Table 3, which reflects accommodations permitted by various states, ordered from most to least frequent for students with disabilities).

 

Policy Implementation Example Two.

The Council of Chief State School Officers conducts an annual survey of state testing practices. In their latest published report, the following data have been presented on the number of states permitting various types of accommodation for students with disabilities and students with limited English proficiency (LEP). These data have been reprinted in the NCES document by Olson and Goldstein (1997).

Table 2. Number of States that Permit Accommodation for Students

Type of Accommodation With Disabilities LEP
Large Print

34

10

Braille or Sign Language

33

8

Small Group Administration

33

15

Flexible Scheduling

33

15

Separate Testing Session

31

17

Extra Time

30

14

Audiotaped Instructions/Questions

27

9

Multiple/Extra Testing Sessions

25

9

Word Processor

21

8

Simplification of Directions

15

11

Audiotaped Responses

12

4

Other Accommodation

12

10

Use of Dictionaries

9

9

Alternative Test

6

3

Other Languages

2

4

SOURCE: The Status of State Student Assessment Programs in the United States (CCSSO/NCREL 1996).

 

Further distinctions can be made comparing types of decisions (individual accountability versus group evaluation) or types of test instruments (published, norm-referenced versus state-specific, standards-based) to help inform policy about implementation, particularly the implications. In effect, task comparability is likely a function of the students being tested, the decision being made, and the type of test being used.

Clearly, policy implementation analysis begins to move task comparability judgments into a comparative realm, with data collected at various levels of the testing program. In the examples cited earlier (see Policy Implementation Analysis Example Two), the primary information was the number of states permitting an accommodation for various types of students. Another strategy would be to collect data on the actual frequency of use as done in Missouri (see Policy Implementation Analysis Example Three).

 

Policy Implementation Example Two.

The following data are reported in the 1996 Missouri Mathematics Field Test: Accommodations and Frequency of Use for 5th grade, presented in April, 1997 at the State Collaborative on Assessment and Standards (SCASS) meeting. Out of 8,700 students tested, administration of the test was changed with (a) oral reading for 212 students, (b) other unknown accommodation with 96 students, (c) repeated directions for 51 students, (d) oral translation with 12 students, and (e) amplification equipment for 4 students. Various other accommodations (e.g., Braille, signing, etc.) were implemented for a few students. Finally, other accommodations in the timing of the test also were implemented with varying frequencies as noted in Table 3 below.

Table 3. Missouri Data on Timing Accommodation Frequencies

TIMING

Accom. Freq.

Percent

Cum. Freq.

Percent

Total

8,151

89.8

8,151

89.8

Extended Time

98

1.1

8,249

90.9

More Frequent Breaks

6

0.1

8,255

90.9

Several Sessions

28

0.3

8,283

91.2

Different Days

795

8.8

9,078

100.0

 

Other perception and survey data also could have been collected. For example, how important are accommodation for helping students participate and succeed (as judged by students, parents, and teachers), and to what degree are there differences among various groups of people? How much better did certain students do when provided accommodations than comparable students who were not provided the same or similar accommodations? When questions like this are asked, judging comparability begins to move toward a comparative model involving post-hoc analyses. The shift is from collecting singular data within systems and making inferences across them to collecting multiple data within a system and co-relating the information together. In this shift, policy analysis moves to post-hoc evaluation.

 

Comparative Model

The comparative model shifts the definition of task comparability to include multiple data sources within a system. This model incorporates frequency data on use of accommodations, judgments about their appropriateness, and outcomes data on performance when accommodations are used. With these data sources, relational statements can be made about process and performance.

 

Comparative Model Example One.

In a study by Grise, Beattie, and Algozzine (1982), about 350 students in fifth grade took the Florida State Student Assessment Test with seven different changes made in the format of the test: (a) items were presented within a hierarchical progression of skills, (b) multiple-choice options had bubbles placed to the right, (c) the shape of the bubble was elliptical, (d) sentences were not broken to make a fill-justified paragraph, (e) reading passages were placed in shaded boxes, (f) examples were included for each skill, and (g) directional symbols were used for continuing and stopping; finally, this test also was enlarged by 30%.

They found that students with learning disabilities performed slightly higher on the regular print version (vs. the enlarged version) on only one of six subsections. Yet, they found 20% to 30% more students who were administered the modified version performed at mastery levels in various subsections of the test.

In a comparable study using the same modifications, Beattie, Grise, and Algozzine (1983) investigated the effects for a third grade sample of students (n= 345). Again, they found few differences on most subsections when comparing performance on the regular print version versus the enlarged print version. And, as in the other study, more students with learning disabilities mastered most of the skills when taking the modified test; on many skills, 20% more students reached mastery levels when the modified version was used than when taking the test under the standard conditions.

 

As the next example illustrates, one of the limitations of post-hoc evaluations is the sample used, in particular its representativeness to other samples. Furthermore, little information is known about how the accommodation was related to previous instructional programs and other sources of influence that may have been operational at the time of testing. Therefore, it is difficult to determine the exact cause of performance differences with and without accommodation.

 

Comparative Model Example Two.

Several studies on test accommodations were conducted by Educational Testing Services (ETS) on the Graduate Record Examination (GRE) and the Scholastic Aptitude Test (SAT) (Willingham, Ragosta, Braun, Rock, & Powers, 1988). They analyzed test scores of students who took the tests over several years, examining the scores of those with and without disabilities to compare the effects when accommodations were used versus when they were not used. The accommodations included alternative test formats (modifying the presentation by using Braille or audio presentations), assistive devices, and separate locations. While they considered task comparability (test content, testing accommodations, and test timing) their primary concern was score comparability using indicators like reliability, factor structure, differential item functioning, prediction of performance, and admissions decisions.

In general they found that between the standard and nonstandard (accommodated) administrations, there was (a) comparable reliability (Bennett, Rock, & Jirele, 1986; Bennett, Rock, & Kaplan, 1985, 1987); (b) similar factor structures (Rock, Bennett, & Kaplan, 1987); (c) similar item difficulties for examinees with and without disabilities (Bennett, Rock, & Kaplan, 1985, 1987); (d) noncomparable predictions of academic performance (with the nonstandard test scores less valid and test scores substantially underpredicting college grades for students with hearing impairments) (Braun, Ragosta, & Kaplan, 1986); and (e) comparable admissions decisions (Benderson, 1988). Furthermore, Willingham et al. (1988) found that although students with disabilities perceived the test to be harder, they performed comparable to peers without disabilities. They also found that college performance was overpredicted when extended time was allowed.

In the end, these researchers recommended that those analyzing any test results "(a) use multiple criteria to predict academic performance of disabled students, (b) give less weight to traditional predictors and more consideration to students' background and nonscholastic achievement, (c) avoid score composites, (d) avoid the erroneous belief that nonstandard scores are systematically either inflated or deflated, and (e) where feasible and appropriate, report scores in the same manner as those obtained from standard administrations" (ETS, 1990, Executive Summary Report).

 

Although these outcomes provide the field with a rich data source for considering accommodation, they represent post-hoc evaluation data that is confounded with several other variables. For example, the research is limited to college admission testing, all of which represents a limited group of tests for students with disabilities (e.g., those who are both secondary students and who are college bound). The proportions of those with disabilities who participate in such tests are very small and may not be representative of the larger group. And, the tests are unlike those used in statewide accountability systems.

 

Comparative Model Example Three.

Koretz (1996) analyzed student outcomes in assessment that categorized accommodations into four major classes: (a) dictation, (b) oral reading, (c) rephrasing, and (d) cueing. He examined data on these accommodations singly and in combination along with actual test performance in grades 4 and 8. He also provided a detailed analysis of the population of students with disabilities including information on (a) the degree to which participation in the testing program was inclusive, and (b) the comparability of the population receiving accommodations to both a national sample and to others in the testing program who did not receive any accommodations. Based on the frequency of accommodation use, several comparisons were made to determine the effect of single and multiple accommodations. Specifically, three major analyses were used: (a) comparisons of those with disabilities receiving the accommodation to those in general education not receiving the accommodation, (b) predictions of accommodation influence on outcome when applied singly, and (c) performance on specific items and differential item functioning when receiving accommodations.

He found that when fourth grade students with mild retardation were provided dictation with other accommodations, they performed much closer to the mean of the general education population, and actually above the mean in science. Similar results occurred for students with learning disabilities. For students in grade 8, the results were similar but less dramatic. In a second analysis using multiple regression to obtain an optimal estimate of each single accommodation and then comparing predicted performance with the accommodation to that without the accommodation, dictation appeared to have the strongest effect across the subject areas of math, reading, and science, as well as across grade levels. This influence was significantly stronger than that attained for paraphrasing and oral presentation, respectively.

Finally, for the item level analyses, descriptive statistics, item-to-total test correlations, and differential item functioning were used to explain why some accommodations may have worked better than others. Students with disabilities performed more poorly on common items as did students in general education. These common items, however, were found to correlate consistently and highly with the total test, reflecting their adequacy as measures of performance. Finally, performance on test items correctly predicted group membership (receiving or not receiving accommodations). For those receiving no accommodations, statistically significant differences in correct performance appeared on 5 of 22 items while for those receiving accommodations, significant differences in correct performance appeared for 13 of 22 items (an equal number were more difficult and more easy).

Koretz asserted that the frequency of accommodation use was high (80% in 4th grade and 67% in 8th grade). Furthermore, he suggested that the four accommodations were biased in that students with disabilities scored comparably to those without disabilities: "The highest scoring group of mentally retarded students (those assessed with oral presentation, paraphrasing, and dictation) scored near the mean for nondisabled students in all subjects other than mathematics - an implausible result given that these students have generalized cognitive defects" (Koretz, 1996, p. 64).

 

As can be seen with these three examples, a post-hoc evaluation perspective provides considerable data on both process and outcome. Generally, the results can be organized so that the findings may be somewhat conclusive though the explanations for them are not as certain. The major problems with this approach for examining task comparability is that cause-effect relationships are difficult to make using post-hoc evaluation data. For example, by using intact groups, many threats to internal validity exist about selection of subjects, as well as the mortality of subjects (those who drop out or appear with incomplete data), historical events that occur during the study period, and subject maturation (physical or psychological changes). Furthermore, subject selection may interact with any of these latter threats. For example, in many school systems, differential drop out rates occur for students from various ethnic backgrounds, reflecting a problem in which subject selection interacts with mortality. Another major limitation is the manner in which subjects are assigned to treatments (accommodation): It often is not clear and the representativeness of the subjects is uncertain. Finally, in most evaluations, no control or comparison groups are used to determine the differential effect of a treatment (accommodation). For example, while the two studies from Florida (Beattie, Grise, & Algozzine, 1983; Grise, Beattie, & Algozzine, 1982) provide important initial findings about the comparability of tasks, no general education students received the modified tests.

 

Experimental Model

In the experimental model, a research design is established before data are collected. Appropriate controls are implemented in the manner in which data are collected to ensure that conclusions have integrity. Two components to an experimental model include (a) a research design for organizing the investigation and controlling threats to the validity of the findings, and (b) a technically adequate measurement system to scale behavior so that inferences can be made from the outcomes.

Either group designs or single-subject designs can be used within the experimental model. Both designs should address four specific threats to validity identified by Cook & Campbell (1979): (a) internal, to allow appropriate inferences about cause and effect; (b) statistical conclusion, to ensure that data are appropriately analyzed; (c) external, to consider other populations and settings for applying the findings; and (d) construct, to explain the theoretical network in which the findings are placed. Group designs for assessing the effects of accommodations typically involve comparing students with disabilities using and not using accommodations to either themselves or another group of students. Single subject designs most often use the same student, comparing performance when an accommodation is used to performance when the accommodation is not used.

 

Group Designs

Three factors need to be considered in conducting a group design: (a) students, (b) treatments (accommodations), and (c) outcomes (measures). These three factors may be either crossed with each other (all factors are presented with each other) or nested (only some factors are presented with each other). Below is a study that depicts each of these relationships for two factors: students and treatment accommodations.

 

Group Designs Example One.

In a study of a response accommodation (marking the booklet versus using the standard bubble sheet) and an administration accommodation (reading the math test aloud versus the standard silently read administration), Tindal et al. randomly assigned students to various conditions. All students took the test with and without the response accommodation (crossed), but participated in only one of the two administration conditions (nested). While no differences were found between the two response conditions, both statistical and practical differences were found between the two administration conditions favoring the read-aloud accommodation (Tindal, Heath, Hollenbeck, Harniss, & Almond, in press).

The reasons for using these two designs (with students taking part in both response conditions while only participating in one administration condition) was to control for two critical threats to internal validity. To be certain students did not perform better in one response condition than the other, simply because it was provided first, the order of responding was counter-balanced: Half the group marked the booklet first and half bubbled the answer sheet first. Yet, the investigators also assumed that potential drift might occur in the administration accommodation (i.e., contamination from exposure to the accommodation) and, therefore, one of the factors (subjects) was nested within another factor (treatment): students were randomly assigned to participate in one or the other administration condition. These procedures are never used within program evaluations.

Because the study was done in fourth grade, however, generalization of the findings (external validity) may be limited to students who are learning to read versus reading to learn. Therefore, another read-aloud study was conducted. This second investigation used a video-taped read-aloud of math problems, in part to determine whether reading math helped older students and to ensure that the read-aloud condition was done consistently. In this study, 6th grade students were presented 30 math word problems in standard test booklet format and 30 problems read aloud by a trained reader (Helwig, Tedesco, & Tindal, in press). These researchers found that for math problems containing large numbers of both difficult and total words, students who had both low reading skill as well as high math skill performed significantly better by having the problems read aloud. For other students, the differences were either nonsignificant or in the opposite direction.

 

With an experimental model, not only must a research design be used to investigate the relationship among factors (students, treatments, and measures), but a valid measurement system must be used. Validation, however, is not of the tests or measures but of the inferences made from them (Messick, 1989). Messick employs a two-by-two matrix for understanding the construct validity of score meaning and social values. One facet is the source of justification for testing, which may consider evidence to understand score meaning or values to understand consequences. The other facet is the function or outcome from testing, which addresses how the test is to be interpreted or used. Crossing these two facets creates a four-celled table highlighting a unitary view of validity integrating meaning and values. The matrix is designed to build understanding in a progressive manner so that any construct being measured is neither underrepresented nor with irrelevant variance.

  1. The evidential basis of test interpretation reflects construct validity (CV) in addressing convergent-discriminant evidence; the focus of interpretation is primarily scientific and empirical.
  2. The evidential basis of test use focuses on the construct validity (CV) of performance in applied settings with the benefits of testing considered in relation to costs and relevance/utility (R/U).
  3. The consequential basis of test interpretation is comprised of construct validity (CV) with reference to broad theories and philosophical views, all of which address value implications (VI) and become embedded within score meaning. This block often "triggers" score-based actions.
  4. The consequential basis of test use considers construct validity (CV), relevance/utility (R/U) and value implications (VI), and potential as well as social consequences (SC) in applied settings, focusing on equity and fairness along with many other broad social interpretations, in a sense the functional worth of the test.

 

Figure 3. Facets of Validity as a Progressive Matrix

 

Test Interpretation

Test Use

Evidential Basis

1. Construct Validity (CV)

2. CV + Relevance/Utility (R/U)

Consequential Basis

3. CV + Value Implications (VI)

4. CV + R/U + VI + Soc. Conseq. (SC)

The validation process in this progressive matrix is never definitive but proceeds in an iterative manner with evidence interacting with values to create score meaning and interpretations. The validation process includes both evidence and value in an evaluative rhetoric of argumentation. Finally, validation is interpretative, addressing both the meaning of scores and the use of them to make decisions in applied settings (Messick, 1995).

 

Group Designs Example Two.

In this set of studies, a creative writing task was used that incorporates the statewide administration and scoring procedures with both handwritten and word-processed compositions. A total of 164 8th graders (152 general education, 12 special education) participated in the study. All students took part in two writing examinations. The first involved students composing handwritten essays over 3 days as part of the Oregon State Writing Assessment. The second occurred approximately 3 months later under identical conditions except students composed their essays on a word processor. All essays were scored by certified state raters using the state’s 6 trait analytic writing scale.

  1. CV: Handwritten and word processed essays were rated on six traits. A correlation matrix reveals stronger relationships between traits within a mode (handwritten or word processed) than between modes within traits (i.e., Ideas and Content, Voice, etc.). Factor analysis shows: (a) a handwritten factor with six traits and (b) a word-processed factor with the same six traits. The evidential basis for test interpretation is that separate trait scoring is not needed when students write their compositions by hand - they form one factor. The same is true when students write their compositions with computers - a single factor is found. However, these two factors are different from each other. On explanation is the tasks (writing by hand versus writing with a computer) are not comparable (Helwig, Stieber, Tindal, Hollenbeck, Heath, & Almond, 1997).
  2. CV+R/U: When the original handwritten compositions are transcribed into a typed (word-processed) format, and then rated by the state judges, the handwritten composition is rated significantly higher than the typed composition on four of the six traits. The evidential basis of test use implies that the two tasks (writing compositions by hand and writing them with computers) are not comparable and shouldn’t be used in the same evaluation system. Their use is likely a function of teacher sensitivity to the format and until this confound can be removed from the judgment process, instructional programs based on the outcomes would be incorrectly recommended. For example, it may be presumed that a student not passing on Conventions as judged on a word-processed composition, needs more instruction on this trait, when in fact, the handwritten composition reflects a passing score (Tindal, Hollenbeck, Heath, Stieber, 1997).
  3. CV+VI: When students are allowed to use computers differentially, the judged quality of some traits appears to vary. For example, when students are allowed to use a spellchecker as part of inputting the composition into the computer, ratings on Organization, Word Choice, Sentence Fluency, and Conventions are significantly higher than students who had no such spell checker available during three days of writing with a computer. The consequential basis for test interpretation is that the theory of writing itself may need to be reconsidered. Presently, the writing process is generally viewed as linear in which brainstorming, writing, and editing are built into the test administration. Yet, with findings such as these, the process may be much more recursive in which editing and writing co-occur in a non-linear fashion. Furthermore, writing tools may help with different functions in the process - with spellchecking not only serving to address problems in Conventions but also with word usage (Voice and Fluency) (Tindal, Hollenbeck, Heath, & Almond, 1997).
  4. CV+R/U+VI+SC: When the analysis of judgment reliability is not just score agreement (using both exact matches and adjacent agreements), but the consistency of making a decision about passing a standard, considerable discord is found. In fact, when disagreements occur, they are most likely to occur at the cut-score rather than at the extremes, ranging from 15% to 34%. The implications for test use from a consequential basis is that many students may be incorrectly denied a certificate of passing and be required to participate in alternative learning environments. Furthermore, curricular changes may be more systemically implemented, further limiting the options available in the language arts areas to primarily remediation rather than enrichment. As a consequence, students who exceed the cut score may not have enrichment courses available as schools focus their resources on students who have fail to pass at a minimum level (Hollenbeck, Tindal, & Heath, & Almond, 1997).

 

 

In summary, group design experimental studies need to rely not only on systematic and prospective data collection procedures but also on technically adequate measurement systems. By engaging in a prospective design, explanations of outcomes are less confounded and contain fewer threats to validity. Furthermore, experimental investigations with groups of students need to reflect a program of research in which studies are linked together not just to generate findings but also to inform decision-making. Tests need to be used and interpreted with classroom implications; they need to be validated in the context of both evidence and consequences in this interpretation and use. In the final analysis, however, group designs cannot be used to make predictions for individuals. Although the average (mean) performance may have been higher or lower in either of these groups, not all students may have been equally affected.

 

Single Case Studies

For individual decisions, only a behavioral approach, utilizing single case studies, can be considered. When the experimental research is aimed at judging task comparability for individual students, the research paradigm is anchored to the unique needs of the student and maximizes the relationship between the process and the outcomes. Using group designs an accommodation may be found to be effective for students with disabilities and not for students without disabilities. Nevertheless, this general finding cannot be used to make predictions of effect or lack of effect for every specific student. All effects are based on group averages and reflect a likelihood for finding that effect with individual students.

Functional analysis focuses on the function of various environmental variables in controlling behavior within the context of reinforcement contingencies. In a functional analysis, the single case is used to define the comparability of a task or response. Generalizations across individual cases can proceed only tentatively, because the hallmark of a functional analysis is the focus on the individual case. When the function of behavior is experimentally analyzed, the proof of comparability is documented consistently within and across phases in which tasks have been changed. Task comparability can be considered from two perspectives with a functional analysis, either in (a) defining a response class or (b) to infer the function of the behavior and response class.

Within a functional analysis approach, contingencies are considered in terms of specific reinforcement paradigms (i.e., positive and negative reinforcement, escape, avoidance, etc.). Basically, a class of responses has a common effect on the environment (the definition of an operant) and is therefore premised upon the assessment of environmental variables that influence the behavior or class of behaviors. Two caveats need to be considered in this perspective: (a) reinforcers are defined in terms of the probability of increasing or decreasing a behavior, and (b) a history of reinforcement "in turn, influences how an individual responds to current environmental contingencies" (Mace, 1994, p. 385): The same "reinforcers" may not be considered uniformly across individuals.

Response class analysis. In determining task comparability, it is important to entertain the possibility that different specific behaviors may function as part of the same response classes. This provides a useful way to think about slightly different behaviors that are maintained in a similar manner in the environment. Response classes are used in behavior analysis to describe discrete behaviors that (a) may have different topographies but serve the same function (i.e., are controlled by the same reinforcement paradigm) or (b) may have similar topographies but serve different functions (are controlled by different reinforcement paradigms). Individual behaviors may belong to more than one response class (i.e., may have multiple controlling contingencies).

Behaviors that form response classes are easy to describe with social behaviors, for example, many behaviors that reflect "compliance" include attending, making eye contact, initiating behaviors upon request, responding with speed and accuracy, etc., all of which have the same function: They reflect a connection between the "mand" (command or demand) and the response. In the case of students who "comply" when interacting with others, they perform upon request and with attention to the task; for those who do not "comply," they delay their response or fail to fully perform it.

 

Single Case Studies Example One.

How does this apply to testing students with disabilities in large scale assessments? Consider a student with Attention Deficit-Hyperactivity Disorder (ADHD). A number of behaviors may be exhibited in the presence of academic tasks with problems of considerable difficulty (worksheets, tests, assignments, etc.). When presented with a paper-pencil task and given directions to read and complete problems independently, the student may exhibit many different behaviors, all of which result in attention from the teacher and eventually removal of the task: fidgeting, playing with pencils (tapping, chewing, throwing), talking out, making distractive noises, etc. Although, these behaviors are different in terms of topography, duration, frequency, or intensity, they all have the same function of providing the student escape from an aversive stimulus. When an IEP team meets, it may decide that the student needs to take the test in a one-to-one situation, when in fact, the contingencies maintaining the behavior are laced into the classroom and contingencies are in place for maintaining the behavior. Therefore, a behavior management program also may be needed in the classroom to extinguish these problem behaviors.

 

Behaviors that apply to academic situations are less common. In this kind of functional analysis, the focus needs to be on the manner in which the student generates a response that serves the same function as other behaviors. For example, with students having a physical disability, a paper-pencil task may be impossible because of a lack of limbs or digits. If this student can read the problems and choices on a multiple-choice test, but cannot fill in the bubbles because of this physical disability, the test may include an accommodation in which the student presses a keypad on a communication board or signals the correct response verbally. Few people would have trouble in considering this response comparable to the physical act of filling a bubble on a response sheet.

But what if the test is designed to measure a broad construct with a number of possible responses used to provide a solution, as is the case with many performance tasks? Then, it is critical to make decisions about task comparability in relation to the outcomes, not the indicators. By distinguishing between outcomes and indicators, it is possible to consider tasks as comparable at one level (as a reflection of an outcome) when obvious differences exist in their topography (as reflections of different indicators).

 

Single Case Studies Example Two.

The Oregon Certificate of Initial Mastery (CIM) provides an excellent example of a response class in which indicators and outcomes have been properly distinguished. This assessment system requires students to participate in three forms of testing in which they: (a) take a multiple-choice test in reading and mathematics, (b) complete an on-demand problem (in math) that is evaluated using scoring guides of 1-6 points, and (c) submit work samples that are evaluated using the similar guides as the on-demand tasks. Using a series of benchmarks established at grades 3, 5, and 8, students are prepared to take the final CIM tests and tasks in 10th grade. Passing the CIM requires meeting specific levels of proficiency in these three assessment components across several subject areas. In essence, the CIM within any subject area can be considered a response class in which a functional response reflects many different specific behaviors engaged in while completing the task.

In Figure 4, an example of Oregon's reading outcome is showing a response class and reading behaviors including Oregon standards (Common Curriculum Goals and Content Standards). For example, the goal itself is described in terms of a broad construct with outcomes clearly differentiated from indicators: Reading is not limited to decoding text from print.

English includes knowledge of the language itself, its use as a basic means of communication, and appreciation of its artistry as expressed in literature. The study of English prepares students to understand and use information and to communicate fluently and effectively.

READING: Comprehend a variety of printed materials.

 

Figure 4. COMMON CURRICULUM GOALS, Content Standards, and BENCHMARKS in OREGON

Use a variety of reading strategies to increase comprehension and learning. Locate information and clarify meaning by skimming, scanning, close reading and other reading strategies.

GRADE 3 Benchmarks:

Read accurately by using phonics, language structure, word meaning, and visual cues. Read orally with natural phrasing, expressive interpretation, flow and pace.
Determine meanings of words using contextual clues and illustrations.

GRADE 5, 8, and 10 Benchmarks:

Determine meanings of words using contextual clues, illustrations, and other reading strategies (3 and 5).Determine meanings of words, including those with multiple meanings, using contextual and structural clues and other reading.

GRADE 3
Locate information using illustrations, tables of contents, glossaries, indexes, headings, graphs, charts, diagrams and/or tables.
GRADE 5
Locate information and clarify meaning by using illustrations, tables of contents, glossaries, indexes, headings, graphs, charts, diagrams and/or tables.
GRADE 8
Locate information and clarify meaning by using tables of contents, glossaries, indexes, headings, graphs, charts, diagrams and/or tables.
GRADE 10
Locate information and clarify meaning by using tables of contents, glossaries, indexes, headings, graphs, charts, diagrams, tables and other reference sources.

 

COMMON CURRICULUM GOALS and Content Standards-Demonstrate literal comprehension of a variety of printed materials.
GRADE 3
Retell, summarize or identify sequence of events, main ideas and facts in literary and informative selections.

 

GRADE 5
Identify in literary, informative and practical selections sequence of events, main ideas, facts and supporting details.

 

GRADE 8
Identify in literary, informative and practical selections sequence of events, main ideas, facts, supporting details and opinions.

 

GRADE 10
Identify in literary, informative and practical selections sequence of events, main ideas, facts, supporting details and opinions.

 

COMMON CURRICULUM GOALS and Content Standards-Demonstrate inferential comprehension of a variety of printed materials.
GRADE 3
Identify cause and effect relationships and make simple predictions.
GRADE 5
Identify relationships, images, patterns or symbols and draw conclusions about their meanings
GRADE 8
Identify relationships, images, patterns or symbols and draw conclusions about their meanings.
GRADE 10
Identify relationships, images, patterns or symbols and draw conclusions about their meanings.


COMMON CURRICULUM GOALS and Content Standards-Demonstrate evaluative comprehension of a variety of printed materials.
GRADE 3



GRADE 5
Analyze and evaluate information and form conclusions.

 

GRADE 8
Analyze and evaluate whether a conclusion is validated by the evidence in a selection.
GRADE 10
Analyze and evaluate whether an argument, action or policy is validated by the evidence in a selection.

 

At the outcome level, several different behaviors form a response class, all which are different topographically. Phillips (1994) suggests this approach when she states: "To determine whether an accommodation affects test validity, one must examine the test objectives as they are currently written...if test objectives are to communicate accurately to test users, the skills tested must be measured as specified in the objective so that the inferences made from the test scores will be the same for everyone" (Phillips, 1994, p. 99).

By considering whether test behaviors function within the same response classes, it is possible to both differentiate indicators from outcomes and ascertain the degree to which they are maintained by similar contingencies. With the above reading outcome defined in Oregon, a number of different behaviors can be used to show reading, essentially providing a relatively robust response class. For example, consider a person who is visually impaired to such a degree that only Braille or print-to-voice transcription can be used to acquire information from text. In Oregon, the person would be considered to be receiving a comparable task if it can be demonstrated that the behavior results in the same effect upon the environment as the regular reading task does for those with sight. For example, the verbs within the various content standards include locating, skimming, identifying, determining, analyzing, and evaluating. All of these behaviors essentially belong to a large response class in which the person being tested interacts with printed text, though no limitation is made for how that text is accessed. The outcome is interacting with text and task comparability is determined by analyzing the objectives of the test, not the indicators reflecting their attainment. In this case, many different indicators could reflect effective reading using a variety of tasks, all of which are comparable at some level dealing with the effect they have on the environment.

In contrast, if the outcomes and indicators are confused, then task comparability is difficult to analyze by reference to response classes. For example, in states that mandate passing a specific norm-referenced test as the outcome, then any administration of the test in any manner different from that used in developing the normative standard, would be a different task, making it noncomparable.

The issue of behavior-maintaining contingencies, which is considered in the next section, can only be considered after clarity is brought to the issue of response classes in defining the critical behaviors for determining success. In the next section, it is assumed that specific behaviors are clearly defined in terms of outcomes and indicators; obviously, if confusion exists, these subsequent strategies may be less than useful.

 

Functional analysis through interviews, observations, and behavioral manipulation

Assuming that specific behaviors are defined and their meaning has been qualified in terms of outcomes and indicators, three methodologies for conducting a functional analysis can be used in ascertaining task comparability and developing accommodations for students. First, interviews can be conducted of key individuals who would be knowledgeable about the student. Second, direct observations can be made of students to determine what contingencies maintain the occurrence of specific behaviors in various settings. Third, behavioral manipulations and interventions can be made to determine how behavior fluctuates with the presence-administration of various reinforcers-punishers. All three approaches, however, are aimed at understanding how the behavior functions in the environment and is maintained in level and rate (O’Neil, Horner, Albin, Storey, & Sprague, 1990).

Interviews. With interviews, the following issues are addressed: (a) dimensions of behavior (topography, frequency, duration, and intensity), (b) ecological events that have a potential impact on behavior (e.g., medications, sleep cycles, eating routines, or activity schedules), (c) events and situations that predict the occurrence of behaviors (e.g., time of day, setting, social control, or activity), (d) the functions of undesirable behaviors and a history of their occurrence, (e) the efficiency of undesirable behaviors (e.g., physical effort, along with likelihood of "payoff" and delay), (f) primary mode of communication, (g) events, actions, and objects that are perceived as positive, and (h) the presence of "functional alternative" behaviors. As can be anticipated, with answers to such questions, it should be possible to formulate the degree to which various behaviors interfere with or facilitate appropriate social and academic functioning.

 

Functional Analysis Example One.

This kind of information, for example, would be very useful in developing Individualized Educational Plans that include various accommodations for participating in and completing statewide tests. It may be discovered that a student taking medications has times during the day in which the uptake or dissipation of a drug influences different dimensions of behavior like attention, time on task, or concentration. Without knowing this from the parents, the specialist, or the nurse, a teacher administering a large-scale test would assume the student is "taking" the test during the prearranged time when in fact, the test is merely in front of the student with no attention being given to its completion. By discovering this information, the test could be given at another time of the day with the assumption that the task is more comparable to the standard administration given that the student is now actively interacting with the materials.

 

Direct observations. With direct observations, similar information is collected on the ecological context, as well as antecedent-consequent functions, of various behaviors. With this format for collecting information, the following dimensions are considered: (a) the setting to observe, (b), the time of day, (c) specific behaviors, (d) setting events that appear to serve as discriminative stimuli, (e) the perceived functions of behavior (i.e., whether to obtain something or avoid-escape something), (f) the actual consequences that occur, and (g) any anecdotal reactions noted by the observer. In addition to helping members of IEP teams collect useful information for developing appropriate accommodations, this kind of information, when collected across various settings, can be used to begin explaining why some behaviors tend to occur more or less than other behaviors. The advantage of collecting direct observational data over interview data, is that the information is more objective (verifiable) and contemporary.

 

Functional Analysis Example Two.

In many test-taking situations, students need to be attentive to several components of the actual test, such as listening to directions, reading various problems, scanning alternate choices on multiple choice measures, and planning-writing responses on performance measures. Yet it is very possible for students to proceed in completing a test without fully engaging in all of these dimensions of the test. For example, students may not be paying attention while the directions are being given and, when the actual response is being encoded, they may complete the wrong sequence. On multiple choice tests, the student may place an "x" in the bubble sheet or make the bubble too large, in either case, leaving a protocol that would render a low score if it is not noticed upon collection and thus scanned incorrectly. Students may fail to provide full responses on performance tasks or show their work, when in fact, the directions focused exactly on this issue. When the performance is judged by trained raters in a group session weeks following the test administration and no one realizes the performance is incomplete because the student never really knew what to do, an incorrect inference may be made that a low score correctly reflects low skill. Direct observation would have been helpful to highlight that, in many classroom situations, the student tends to pay attention to distracting things when teachers are talking at the front of the room, whether teachers are instructing or providing directions in how to take the test. For this student, an adult reading from a note sheet or book in the front of the room is all the same and it reflects a situation from which escape has been successful in the past.

 

Behavioral manipulation. Finally, systematic manipulations of reinforcement contingencies can be implemented, based on either the interview or the observational data, to determine whether specific behavioral changes can be made. With minimal inferences made about cause and effect, this method of conducting a functional analysis can provide the most persuasive information about why specific behaviors are present and persist in relation to various environmental events. With this system for conducting a functional analysis, several designs are available within two broad classes of withdrawal-reversal and multiple-baseline designs.

 

Functional Analysis Example Three.

In the following case, a student was presented an accommodation using Irlen lens technology (reduction of various light frequencies to reduce distractive stimulation and focus attention) (Robinson & Miles, 1987) as part of a research design using successive treatment phases (an ABC design). This student, with an attention deficit-hyperactivity disorder, had seen a local physician for treatment and been prescribed the following accommodation oriented around the reduction of artificial light which was thought to be overstimulating his sight and brain: (a) wear a baseball cap, (b) sit near the window (for maximal exposure to natural light), and (b) use colored overlays on all his work. The special education teacher had been using Direct Instruction (A), then Direct Instruction with a point reinforcement system (B). Upon recommendation of the IEP team, she began the treatment with the Irlen lens treatment (C).

The data in Figures 5, 6, and 7 can be interpreted as not supportive of the accommodation: With the Irlen lens (a) correct reading performance decreased, (b) incorrect reading performance increased, and (c) on-task behavior decreased. Had the measure of performance been a statewide test, it could be inferred that access to the test items and response opportunities is better without the recommended accommodation.

 

When functional analysis is used to consider whether behavior is being maintained under different conditions, "the function of a behavior may be affected by the context in which the behavior occurs" (Day, Horner, & O’Neill, 1994, p. 279). With a functional analysis, the focus is on the larger investigation of the maintaining contingencies, their identification and manipulation of variables that control behavior.

 

Figure 5. Academic Performance under Varying Accommodations: Words Correct Per Minute

Words Correct Per Minute

Figure 6. Academic Performance under Varying Accommodations: Error Count and Prosody Rating

Error Count and Prosody Rating

 

Figure 7 - On-Task Behavior under Varying Accommodations

 

On-Task Behavior under Varying Accommodations

 

Summary of single case approaches. In summary, functional analyses of individual students provides a basis for determining whether tasks are comparable for accommodated and standard administrations. In two major ways, tasks may be judged as comparable: (a) by classifying test behaviors within comparable response classes and (b) by identifying potential variables that appear to be maintaining behavior in various settings. When considering tested behaviors as part of a response class, accommodations may focus on either the adjunct behaviors in which the student accesses the test-taking situation or as part of the tested behavior itself. When considering task comparability through a functional analysis, interviews, observation, and direct manipulation may be used to provide a line of evidence. Obviously, fewer inferences are made as one proceeds first through the use of interviews (identifying plausible explanations), then through the direct observation of students in various settings (comparing actual contingencies), and finally through behavioral manipulation of contingencies (providing confirmation of cause-effect relationships).

One caveat should be noted in using a functional analysis to determine task comparability when accommodations are used. In many instances, students are performing on tests and in testing situations with a complex arrangement of contingencies maintaining their behavior. Generally, these contingencies have been developed over a long time and involve many individuals, from parents to teachers and peers. In considering accommodations to maximize student participation in and successful completion of large-scale tests, the accommodations are not considered prosthetics for life. Rather, they are an arrangement of the environment representing a next step toward independent behavior management (that is, self-managed behavior). This view of accommodations is very different than those in which a prosthetic is being used that is an ever-present part of that student’s life (for example, Braille is not a temporary accommodation for students with vision impairments). Yet, a consistent reinforcement program aimed to get students to attend, listen, read, and respond, may well be needed for only a short while and eventually be faded into the more standard approach in which a few short, motivating statements are made at the beginning of the test for students to do their best.


Summary and Conclusions

Six approaches for determining the comparability of tasks have been presented in this paper, organized into three models: descriptive, comparative, and experimental. This evidential continuum was developed to reflect an increase in both the amount and systematicity of data-information that is collected to support the decision-making process. By referring to data-information, the overall purpose is to make consistent decisions about the acceptability of variations in tasks, and therefore decisions about whether they can be considered accommodations, or whether they are modifications that change the nature of the construct being measured.

The continuum began with three types of descriptive data-information, in which task comparability is simply presented, with decisions presented in relative isolation within policies. No other evidence would appear, with policy serving as a reference to be followed. For educators to use this form of decision making, they simply have to know the policy. The advantage to this strategy is ease of use and clarity of direction: Either tasks appear on the list of accepted accommodations or they do not appear on the list. The disadvantage is that the list is finite, when in fact, a nearly infinite number of variations in assessment construction and administration actually is possible: When an accommodation is proposed and it is not on the list, little guidance is provided for determining whether it is an acceptable one.

Further along on the evidential continuum, task comparability was determined by reference to policy that was elaborated and interpreted. By having information that explains policies, educators can make decisions about the use of various accommodations and also can apply a similar logic to instances not specifically addressed in the policy. Through the use of examples and nonexamples, minimum differences can be highlighted, providing educators a frame of reference for making decisions that not only supports a list of accepted accommodations but the rationale for their placement into either acceptable or unacceptable changes. The advantages clearly are the breadth of instances that can be addressed, moving well beyond a finite list. Yet, in this breadth, it is possible that rules and rationales become misinterpreted and misapplied, reflecting a disadvantage in the confusion and lack of standardization that likely is to result.

The last approach within the descriptive model was a policy analysis perspective, in which the decision to consider tasks as comparable was made by reference to data-information within the assessment system. This perspective provides educators a frame of reference for not only interpreting but understanding the implications from any decisions that are made. Although viewed with singular data sources that are non-relational, this information can be extremely valuable in providing outcomes that contextualize any decisions. Obviously, the advantage to this perspective is the possibility of better understanding what it means when a task is deemed comparable; educators thus have the ability to make predictions. The major disadvantage is the expense in data collection needed to implement this perspective. This perspective was considered descriptive because the data-information were singular and not relational. In contrast, the remaining perspectives can be viewed in a relational manner, in which understanding an accommodation is dependent upon several other variables that can be inter-related.

In a comparative strategy, post-hoc evaluations are used to make judgments of task comparability, with several variables considered in the process. Truly relational statements can be made in which implications can be framed and analyzed to not only make decisions but also test predictions, with follow-up data collected to formatively evaluate the effects of those decisions. This perspective has great appeal in providing an empirical basis for change and therefore presents educators with a significant advantage of being data-based; furthermore, over time decisions can continue to be racked in outcomes and implications. Nevertheless, this perspective has a disadvantage in being post-hoc with intact groups of students and lack of control from many threats to validity. Therefore, although results may be data-based, cause-effect statements may be made inadvertently, potentially creating more harm than good.

Finally, experimental strategies may be used to organize either a group or single case study. Both of these approaches help improve the decision-making process with not only data-information, but with many threats to validity controlled. For example, rather than only considering intact groups, studies are devised to include groups that represent important constituencies for comparison. In addition, resources are strategically allocated so that cause-effect statements can be made with more certainty. Rather than simply reporting outcomes and postulating inferences, these approaches provide information leading to explanations. While group designs are used to study task comparability with well-defined collectives of students having common demographics, they cannot be used to understand task comparability for individuals; therefore, single case designs are used to extend the logic of this scientific approach to individual students with unique needs. The clear advantage of experimental strategies is the integrity of understanding what accrues and the strength of predictions that can be made in similar situations. Yet, this certainty also is attained with considerable investment in collecting data-information. Furthermore, in many educational circumstances, it is difficult to both manage programs as they are currently configured and develop alternative ones concurrently. Therefore, such strategies are less likely to be implemented without external funds.

In the end, the continuum is meant to be broad and encompassing so that any educational agency can place itself on it. Furthermore, it is displayed as a true continuum that increases in the amount and quality of information used in making decisions about task comparability. As the 1997 IDEA mandates are implemented, it may be important to begin the process of taking inventory not on the issue of task comparability itself, but on the manner in which such decisions are made.


References

Benderson, A. (Ed.) (1988). Testing, equality, and handicapped people. Focus (ETS Publication; Princeton, NJ), 21 (23).

Bennett, R. E., Rock, D. A., & Kaplan, B. A. (1985, November). The psychometric characteristics of the SAT for nine handicapped groups (ETS Research Report RR-85-49). Princeton, NJ: Educational Testing Service.

Bennett, R. E., Rock, D. A., & Kaplan, B. A. (1987). SAT differential item performance for nine handicapped groups. Journal of Educational Measurement, 24(1), 44-55.

Bennett, R. E., Rock, D. A., & Kaplan, B. A. (1988). Level reliability and speededness of SAT scores for nine handicapped groups. Special Services in the Schools, (3/4), 37-54.

Cook, D., and Campbell, D. (1979). Experimental and quasi-experimental research designs. Boston: Rand McNally.

Helwig, R., Stieber, S., Tindal, G., Hollenbeck, K., Heath, W., & Almond, P. (1997). A comparison of factor analyses of handwritten and word-processd writing of middle school students. Eugene, OR: University of Oregon Research Consultation, and Teaching Program.

Hollenbeck, K., Tindal, G., Heath, W., & Almond, P. (1997). Reliability and decision consistency: An analysis of writing mode at two times on a statewide test. Eugene, OR: University of Oregon Research Consultation, and Teaching Program.

Koretz, D. (1996). The assessment of students with disabilities in Kentucky.CSE Technical Report No. 431. Los Angeles: CRESST/Rand Institute on Education and Training.

Messick, S. (1989). Validity. In R. Linn (Ed.), Educational Measurement (3rd Edition), pp. 13-105. New York: MacMillan.

Messick, S. (1995). Validity of psychological assessment: Validations of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741-749.

National Center on Educational Outcomes (1993). NCEO finds states active in ‘outcomes’ efforts. Outcomes, 2(2), p.2. Mpls., University of Minnesota: Author.

Nolen, S. B., Haladyna, T. M., & Haas, N. S. (1992). Uses and abuses of achievement test scores. Educational Measurement: Issues and Practice, 11(2), 9-15.

O’Neil, R., Horner, R., Albin, R., Storey, K., & Sprague, G. (1990). Functional Analysis of Problem Behavior. Sycamore, IL: Sycamore Publishing Company.

O’Sullivan, R. G., & Chalnick, M. K. (1991). Measurement-related course work requirements for teacher certification and recertification. Educational Measurement: Issues and Practice, 10(1), 17-19, 23.

Phillips, S. E. (1994). High stakes testing accommodations: validity versus disabled rights. Applied Measurement in Education, 7(2), 93-120.

Robinson, G. L., & Miles, J. (1987). The use of coloured overlays to improve visual processing - A preliminary survey. The Exceptional Child, 34(1), 63-70.

Siskind, T. G. (1993a). Modifications in statewide criterion-referenced testing programs to accommodate pupils with disabilities. Diagnostique, 18(3), 233-249.

Siskind, T. G. (1993b). Teachers’ knowledge about test modifications for students with disabilities. Diagnostique, 18(2), 145-157.

Thurlow, M. L., Ysseldyke, J. E., & Silverstein, B. (1995). Testing accommodations for students with disabilities. Remedial and Special Education, 16(5), 260-270.

Thurlow, M., L., Scott, D. L., & Ysseldyke, J. E. (1995). A compilation of states’ guidelines for accommodations in assessments for students with disabilities, [Synthesis Report No. 18]. Mpls., MN: University of Minnesota National Center on Educational Outcomes.

Tindal, G., Hollenbeck, K., Heath, W., & Almond, P. (1997). The effect of using computers as an accommodation in a statewide writing test. Eugene, OR: University of Oregon Research Consultation, and Teaching Program.

Tindal, G., Hollenbeck, K., Heath, W., & Stieber, S. (1997). Trait differences on handwritten versus word-processed compositions: Do judges rate them differently? Eugene, OR: University of Oregon Research Consultation, and Teaching Program.

Return to Accommodations Topic Area