Setting Standards on Alternate Assessments
NCEO Synthesis Report 42
Published by the National Center on Educational Outcomes
Prepared by:
Ed Roeber
Measured Progress
April 2002
Any or all portions of this document may be reproduced and distributed without prior permission, provided the source is cited as:
Roeber, E. (2002). Setting standards on alternate assessments (Synthesis Report 42). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes. Retrieved [today's date], from the World Wide Web: http://cehd.umn.edu/NCEO/OnlinePubs/Synthesis42.html
As more and more states and
districts develop their alternate assessments for those students unable to
participate in the regular assessment, they are faced with the challenge of
setting standards for their alternate assessments. Despite the variability in
alternate assessments currently developed, including checklists, structured and
unstructured observations, performance assessments, samples of student work, and
portfolios, all need to be scored and assigned proficiency levels. With
background information on the nature of student scores from alternate
assessments and the ways in which alternate assessment results are reported,
this report identifies common standards-setting techniques and how they might be
applied to alternate assessments. The techniques addressed are: reasoned
judgment, contrasting groups, modified Angoff, bookmarking or item mapping, body
of work, and judgmental policy capturing. It is recommended that the technique
selected take into account not only the technical aspects of the alternate
assessment strategies, but also the practical aspects of implementing the
standard-setting technique for the alternate assessment process.
The Individuals with
Disabilities Education Act Amendments of 1997 (IDEA 97) require that all
students with disabilities, even those with the most significant disabilities,
participate in state and district-wide assessment systems. Participation
generally occurs in one of three ways: with or without accommodations in the
general, on-demand assessment, or through an alternate assessment. IDEA 97
requires that states report the performance of students with disabilities in the
regular assessment and the alternate assessment with the same frequency and in
the same detail that they report on the performance of non-disabled student
[Section 612(a)(17)(B)(iii)]. Approaches to reporting that include both
aggregation of all students with disabilities and non-disabled students
together, as well as the disaggregated performance of students with disabilities
are consistent with the requirements of IDEA 97 (Heumann & Warlick, 2000).
These requirements suggest
that regardless of the nature of students’ disabilities and nature of the
alternate assessment used with the students, the alternate assessment results
and those of the regular assessment program will need to be reported in a common
fashion. Many states are opting to aggregate scores from the regular assessment
and the alternate assessment (Thompson & Thurlow, 2001). The challenge in
summing these assessment results is that the assessments, and the types of
results each produces, are different. How can these assessments be reported
together?
Most statewide assessments
are one of two types: standards-based or norm-referenced (Olson, Jones, & Bond,
2001). Standards-based assessment programs directly measure the state’s content
standards, often using both multiple-choice and constructed-response items.
Student performance is reported relative to the content standards of the state.
The goal of these programs is to encourage all students to achieve all standards
at high levels.
Norm-referenced tests are
ones in which the performance of students is reported relative to a norm group
(a representative group of students used as a comparison sample). Results are
reported relative to this norm group, either in percentile ranks, normal-curve
equivalents, grade levels, or other comparative scores. However, since the
participation rate of students with disabilities in the norming samples is often
low, the on-going participation rate of students with disabilities on such tests
is also quite low (Thompson & Thurlow, 1999).
In this report, I focus
primarily on how the results of standards-based assessments are reported. These
results may be reported in several ways, including how students performed on
each test item, how the student performed on a cluster of items measuring a
content standard or a sub-unit thereof (e.g., benchmark or performance
indicator), how the student performed overall on the assessment (raw score,
scaled score, or other metric), and finally, the overall level of achievement,
that is, which of several predetermined levels of performance the student’s
achievement fell into. States use three, four, or more levels to describe the
performance of students. Terms such as “novice,” “basic,” “proficient,” “meeting
the standard,” “advanced” or “exceeding the standard” may be used to describe
the overall level of achievement of each student.
Types of Alternate Assessment
Because the small group of
students who, due to the nature of their disabilities, require an alternate
assessment, are quite diverse, the manner in which they are alternately assessed
will also differ. Some common ways in which students are assessed include:
Checklists.
This method relies on teachers to remember whether students are able to carry
out certain activities. This technique has the advantage of permitting the rapid
collection of information, but due to the nature of the observation, may not be
highly reliable. Scores reported are usually the number of skills that the
student was able to successfully perform. This method will permit the scores of
students to be added up and reported.
Observation in
Structured and Unstructured Settings. This assessment method encourages
teachers, after training, to observe whether students are able to perform
certain activities. Observation in unstructured situations is on-going
observation of the student in everyday classroom and other settings, without any
overt attempt to increase the likelihood that the skill will occur. By setting
up structured situations, the teachers is setting up a structure in which the
skill being observed is more likely to occur, thus making the observation of it
more likely. Scores reported are usually the number of skills that the student
was able to successfully perform. This method will permit the scores of students
to be added up and reported.
Performance
Assessments. These assessments are direct measures of the skill,
usually in a one-on-one assessment. Due to the nature of students’ disabilities,
rarely are these paper-based assessments. More likely, the teacher and the
student work through an assessment that uses manipulatives, and the teacher
observes whether students are able to perform the assigned tasks. Such
assessments have the disadvantage of being time-intensive, so that an assessment
may be limited to only a handful of skills. Scores are typically assigned to
each performance assessment, although in more complex performance assessments,
there is an underlying scale of task complexity that may form the basis for
reporting.
Samples of Student
Work. Students may in the course of learning produce samples of work
that demonstrate the skills being assessed. These “artifacts” can be assessed.
While this assessment method has the advantage of using existing work in the
assessment process, not all students will be able to produce samples, and even
for those who do, it may not be possible to determine how much of the work is
that of the student. Scores are assigned to each piece of work.
Portfolios.
This assessment method uses a collection of student work, performance
assessments, observations, and other data about students to judge student
achievement. Usually, the various pieces collected to demonstrate the
performance on each standard or group of standards are judged together, although
occasionally, the entire content area or entire portfolio may be assigned a
single score.
The nature of the scores
derived from alternate assessments depends on the nature of the alternate
assessment. Those that are comprised of checklists will yield multiple items
scored 1-4 or 1-5, and an overall total score, which may be the number of items
checked with 1s, 2s, 3s and 4s, or some other way of summarizing overall student
performance. The same may be true for observational assessments, and even
performance events, particularly when the latter consist of a number of small
steps along a scale of performance, each scored on a 1-4 or 1-5 scale.
Portfolios, and some
performance tasks, are scored using a scoring rubric, and the student score is
derived from the nature and number of rubrics employed. In some assessments,
students’ performances or portfolios are scored holistically, such as across multiple content areas using an overall
performance rubric. In other cases, students’ performances are scored
analytically, that is, according to multiple dimensions.
The dimensions that states
are using to score student work analytically focus on student
performance, program opportunities
provided to students, or a combination of the two (Thompson & Thurlow, 2001).
The student dimensions can include a qualitative judgment about the overall
level of performance of the student, as well as how close the student’s
performance was to the written content standards. The program dimensions could
include whether students were provided instruction in multiple settings, whether
they were provided opportunities to plan, monitor, and evaluate their work,
whether they was evidence that they worked with non-disabled peers, and whether
they were provided with appropriate human and technological supports.
These program opportunities
can also be expressed as student performance dimensions as well: whether the
student could demonstrate the skill in multiple settings; plan, monitor, and
evaluate his or her work; work with non-disabled peers; and work independently
(using appropriate human and technological supports). Depending on the nature of
the scoring dimensions used, scores may be a simple sum across dimensions, be
multiplied with one another, or be weighted in some fashion.
Reporting Alternate Assessment Information
While each state will
report its alternate assessment information in somewhat different ways (in
concert with differences in which their regular assessment program information
is being reported), there are some similarities to these reporting schema.
Depending on the nature of the regular assessment program – its purposes and
methods – reports of the alternate assessment program may focus on program information or individual student information, or both.
Program data are reported
in order to hold the school in which students are being taught, or the school
that sent the student to the school in which the student is being taught,
accountable
for the student’s performance. Programs might be held accountable for whether
needed supports (human or technological) were made available, the inclusiveness
of the student’s program, the number of settings in which the student’s
accomplishments were evidenced, or the extent to which students are given
opportunities to plan, monitor, and evaluate their own performances.
Individual student
information is reported in order to describe the current level of the student’s
performance. It may focus on the qualities of the student’s performance on the
assessment, as well as how close to the content standards (as written) the
student was able to come, the breadth or number of standards achieved, and the
levels of supports needed to achieve at the level observed.
In almost all cases, the
use of both student and program data is not for student accountability (to
promote or graduate students, nor to retain them). Instead, it is to hold the
school accountable for the learning opportunities afforded students, whether
evidenced directly through student performance measures or more indirectly,
through program measures.
While the parameters on
reporting may vary from state to state, many states are opting to
aggregate
the performance of all students with disabilities with the performance of
students without disabilities, so that 100% of the students are counted (not
simply “accounted for” by reporting them in a “no report” category). Since IDEA
97 requires that the performance of students with disabilities be disaggregated
from the performance of other students, it has been suggested that the
performance of the students with disabilities who take the tests with or without
accommodations be added to the performance of students without disabilities and
reported together (Heumann & Warlick, 2000).
The key question is how to
accomplish these combined and disaggregated performances, when some students
take a test and others participate in an alternate assessment comprised of a
portfolio, performance events, or a checklist. Some sort of total report of
results by content area is needed.
One way that states using
portfolios accomplish this is by summing the performances of students to arrive
at a total score. Once a series of total scores is determined, how the scores
will be reported along with scores from the regular assessment program is
determined by the manner in which these scores are labeled.
Score Scale for Reporting
States typically report the
assessment results from their regular assessment program summarized at an
overall test level, according to one of several performance descriptors. These
performance descriptors or performance standards serve to describe “how good is
good enough.” This helps to give additional meaning to the reports of results.
There are two ways to
report the results. The first is to use an
absolute scale, where the score scale used for the alternate assessment is
equated to the score scale for the regular assessment program. This means that
most of the alternate assessments will be reported in the bottom category of the
regular score scale, that given the level of the performance and linkage to the
standards as written, is viewed by some as an accurate representation of
performance (Bechard, 2001). Nonetheless, to consign all of the students with
significant disabilities to the bottom of the score scale also serves to
reinforce low expectations for these students and perhaps to “punish” the
educators who serve them. In the long run, this approach may discourage
educators from offering a quality program or challenging the student to strive
to accomplish more (since more challenging skills may lead to lower performance
scores).
An alternative is to adopt
the policy of reporting students in the alternate assessment on a relative scale. In this model, a score
scale is constructed for students with significant disabilities in the alternate
assessment, without equating the scale to the test scale. This can result in
some students with significant disabilities being labeled “proficient” or
“advanced,” even though their accomplishments are viewed by some as lower than
students who took the test (Bechard, 2001). The result of using this scale is to
reward educators who are offering a quality program in which students
demonstrate significant accomplishments. This recognizes the successful work of
educators, even when the nature of the student’s disability prevents the student
from demonstrating typical performance levels. In the long run, this should
encourage better programs for students. However, the “downside” of this is that
the alternate assessment scale is different from the one used with the regular
assessment program, and the difference may be interpreted as meaning “lower.”
Standard-setting Techniques
However the total score is derived (absolute or relative scales), the manner in which the performance of the student in the alternate assessment is categorized will depend on which of several standard-setting strategies is used. The strategies used for regular assessment programs may or may not be appropriate for standard-setting on the alternate assessment, depending on the regular assessment, the techniques used to set standards on it, and the nature of the alternate assessment component. Several techniques are used to set standards (Cizek, 2001) Each of these is described here and summarized in Table 1.
Table 1. Standard-setting Techniques that Might be Applied to Alternate Assessments
Technique |
Description |
Reasoned Judgment |
A score scale (e.g., 32
points) is divided into a desired number of categories (e.g., 4) in some way
(equally, larger in the middle, etc.); the categories are determined by a
group of experts, policymakers, or others. |
Contrasting Groups |
Teachers separate students into groups based on their observations of the
students in the classroom; the scores of the students are then calculated to
determine where scores will be categorized in the future. |
Modified Angoff |
Raters estimate the percentage of students at the bottom score range who are
expected to “pass” each test item; these individual estimates are summed to
produce an overall percentage of items correct that correspond to the
minimum passing score for that level. |
Bookmarking or Item Mapping |
Standard-setters mark the spot in a specially constructed test booklet
(arranged in order of item difficulty) where a desired percentage of
minimally proficient (or advanced) students would pass the item; or,
standard-setters mark where the difference in performance of the proficient
and advanced student on an exercise is a desired minimum percentage of
students. |
Body
of Work |
Reviewers examine all of the data for a student and use this information to
place the student in one of the overall performance levels. Standard setters
are given a set of papers that demonstrate the complete range of possible
scores from low to high. |
Judgmental Policy Capturing |
Reasoned Judgment.
The most straight-forward manner in which to set standards is for an appropriate
group (either an expert panel, a representative group of users, or a policymaker
group) to examine the score scale and to divide the full range of possible
scores into the number of desired categories (Kingston, Kahl, Sweeney, & Bay,
2001). For example, a 32-point scale might be divided into 4 categories of
approximately equal numbers of points (or different numbers of points in each of
the categories), as the group sees fit.
The advantages of this
strategy are that it takes little time, requires little in the way of a process,
and does not hide the standard-setting in a cloak of mysterious statistical
procedures. Presumably, the rationale for the choices is relatively evident. The
major disadvantage is that rarely do natural divisions of performance occur, so
that it may be difficult to defend the choices that were made or the assignment
of particular students to one level or another, since other reasonable people
could arrive at different choices.
A special case of this
technique that has been used with alternate assessments is to locate solid
student exemplars for each score scale point. For example, if portfolios are
scored on a four-point scale, the goal of this strategy is to locate solid 1s,
2s, 3s, and 4s of student work on all pertinent dimensions. These exemplars,
which represent the different score levels of the scoring rubric, are then used
for training purposes in scoring. A set of rules, which are predetermined, help
determine the total score assigned to portfolios that are not given a consistent
score across the various scoring dimensions.
Contrasting Groups.
In this technique, a group of teachers familiar with the students, and with the
definitions of the various groups into which students are to be placed, separate
the students into these groups based on their observations of the students in
their classroom (Livingston & Zeikey, 1982); then, the assessment scores in each
of the groups are calculated. The distribution of scores among the different
groups is examined; typically, where the scores between the two groups overlap
is where the “cut score” between the two groups is set, since this is the point
at which the classification errors are minimized.
A problem with this method
is that it is highly dependent on the distributional characteristics of the
sample. That is the rationale behind the development of a similar method called
“Classroom Teacher Judgment” (Roeber, 2001).
The contrasting groups
technique can be used with any type of assessment. One major disadvantage of it
for alternate assessment is that teachers may not know what the performance of
the student is on the types of skills (e.g., academic skills) measured by the
alternate assessment. Nevertheless, it is a relatively easy technique to
implement and is easily understood by educators and parents.
Modified Angoff.
In this technique, an appropriate group (either an expert panel, a
representative group of users, or a policymaker group) examines each test item
in a multiple-choice exam. What each rater does is to estimate the percentage of
students at the bottom of the score range (e.g., the “minimally proficient” or
the “minimally advanced” students) who will be able to pass each test item.
These individual estimates are then summed and result in an overall percentage
of the items correct that correspond to the minimum passing score for that level
of the test.
This technique is typically
used with multiple-choice items, and is a classic standard-setting strategy for
tests. It might be applied to checklists, such as those used to assess students
with significant disabilities, although this use has not been tried. The major
challenge to using the modified Angoff technique is a conceptual one: raters not
only need to understand the theoretical “minimally-proficient student,” they
also have to determine how many of these students will pass the assessment. This
is not an easy task, and hence one of the reasons why psychometricians looked
for an improved way to set standards.
Bookmarking or Item
Mapping. In this technique, an appropriate group (either an expert
panel, a representative group of users, or a policymaker group) reviews a
specially-constructed test booklet that is arranged in item difficulty order
(Lewis, Mitzel, & Green, 1996). The standard-setter is asked to mark the spot in
the booklet where a set percentage of minimally-proficient or minimally-advanced
students would pass the item. An alternative method is for the standard-setter
to mark where the difference in performance of the proficient and advanced
student on an exercise is a set minimum percentage of students.
This technique has the
advantage of being usable with both multiple-choice and constructed-response
exercises. It could be used with inventories or checklists since the percentage
of students at each level in an inventory or checklist could be calculated. It
would be challenging, however, to use this technique with portfolio assessments,
where an overall score is derived.
Body of Work.
In this technique, an appropriate group (an expert panel, a representative group
of users, or a policymaker group) examines all of the data for a student and
uses all of the information to place the student in one of the overall
performance levels (Kingston et al., 2001). On a test, the multiple-choice and
constructed-response performance of the student is examined together. Rather
than examining test items, standard-setters examine students, and determine what
combination of scores from the various test components would place a student in
the advanced or proficient category. Standard-setters are given a set of papers
that demonstrate the complete range of possible scores from low to high.
The advantage of this
method is that all of the information about a student is used to set standards,
which is an easier, more logical decision for a standard-setter to make.
Discussions are more focused on tangible students rather than intangible
percentages of students passing test items. This strategy could be used with
checklists, inventories, and assessments using scoring rubrics (such as
performance events or portfolios).
Judgmental Policy
Capturing. In this technique, an appropriate group (either an expert
panel, a representative group of users, or a policymaker group) reviews the
various components of an overall assessment (which might be quite similar or
quite dissimilar) and determines which of the components are more important than
others. This might suggest weighting one type of item more important than
another, or might be to weight one type of evidence (e.g., performance measures)
as more important than another (e.g., a checklist) (Jaeger, 1994, 1995).
This method allows very
dissimilar types of information to be used to make decisions about students, and
permits these to be weighted differentially. This is not a technique that has
been used widely, so little is known about its technical characteristics,
particularly in student assessment.
Which Technique to Use with Which Alternate
Assessments
The nature of the alternate
assessment used will help determine the type of standard-setting procedure that
will be used. In the case of portfolios, which include a variety of types of
evidence, the arbitrary, preponderance of evidence, whole student, or
policy-capturing procedures can be used. For alternate assessments that use
performance events, with a range of indicators associated with them, any of the
procedures could be used.
The technique used should
take into account not only technical aspects of the alternate assessment
strategies being used and the standard-setting strategy, but also the practical
aspects of implementing the standard-setting technique for the alternate
assessment process. For example, portfolios that take an hour to review may make
the whole student procedure, while technically sound, impractical to implement
on a statewide basis. Examining the amount of time that such reviews take – both
in the beginning of such an effort and after reviewers are experienced – is an
important part of the practical aspects that must be considered.
It will be important in the
years to come to document the standard-setting approaches used with various
types of alternate assessments. Unexpectedly high rates of students in the
alternate assessment system who are achieving at high levels may be a reason for
rethinking standard-setting approaches.
Angoff, W. M. (1971)
Scales, norms, and equivalent scores. In R. L. Thorndike (ed.) Educational measurement (2nd edition, pp. 508-600). Washington, DC: American
Council of Education.
Bechard, S. (2001). Models for reporting the results of alternate
assessments within state accountability systems (Synthesis Report 39).
Minneapolis, MN: University of Minnesota, National Center on Educational
Outcomes.
Cizek, G. J. (2001) Setting performance standards: Concepts,
methods, and perspectives. Mahwah, NJ: Lawrence Erlbaum.
Heumann, J. E., & Warlick,
K. R. (2000). Questions and answers about
provisions in the Individuals with Disabilities Education Act Amendments of 1997
related to students with disabilities and state and district-wide assessments
(Memorandum OSEP 00-24). Washington, DC: U.S. Department of Education, Office of
Special Education and Rehabilitative Services.
Jaeger, R. M. (1994, April) Setting standards through two-stage
judgemental policy capturing. Paper presented at the American Educational
Research Association and the National Council on Measurement in Education.
Jaeger, R. M. (1995)
Setting performance standards through two-stage judgmental policy capturing. Applied Measurement in Education, 8,
15-40.
Kingston, N., Kahl, S. R.,
Sweeney, K., & Bay, L. Setting performance standards using the body of work
method. In G. J. Cizek (ed.), Setting
performance standards: Concepts, methods, and perspectives. Mahwah, NJ:
Lawrence Erlbaum.
Lewis, D. M., Mitzel, H.
C., & Green, D. R. (1996, June) Standard
setting: A bookmark approach. Paper presented at the Council of Chief State
School Officers Large-Scale Assessment Conference, Colorado Springs, CO.
Livingston, S. A., & ,
Zeikey, M. J. (1982). Passing scores: A
manual for setting standards of performance on educational and occupational
tests. Princeton, NJ: Educational Testing Service.
Olson, J. F., Jones, I., &
Bond, L. (2001). State student assessment
programs annual survey: Data on 1998-99 statewide student assessment programs.
Washington, DC: Council of Chief State School Officers.
Roeber, E. D. (2001). Setting Standards Using Classroom Teacher
Judgment.
Dover, New Hampshire: Measured Progress.
Thompson, S. J., & Thurlow, M. L. (2001). 2001 State special education outcomes: A report on state activities at the beginning of a new decade. Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.