Test Publishers' Views on Out-of-Level Testing

Out-of-Level Testing Project Report 3

Published by the National Center on Educational Outcomes

Prepared by John Bielinski, Jim Scott, Jane Minnema, and Martha Thurlow

November 2000

This document has been archived by NCEO because some of the information it contains may be out of date.

Any or all portions of this document may be reproduced and distributed without prior permission, provided the source is cited as:

Bielinksi, J., Scott., J., Minnema, J., & Thurlow, M. (2000). Test publishers' views on out-of-level testing (Out-of-Level Testing Project Report 3 ). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes. Retrieved [today's date], from the World Wide Web: http://cehd.umn.edu/NCEO/OnlinePubs/OOLT3.html

Executive Summary

Out-of-level testing is the administration of a test at a level above or below the level generally recommended for students based on their age or grade level. Existing literature on out-of-level testing includes several empirical studies that yielded inconsistent results and offered conflicting conclusions about the appropriateness of out-of-level testing, providing little direction for a practitioner wishing to make decisions about whether to, or how to implement an out-of-level testing program. Given the lack of direction for practitioners available from the literature, it is important to determine what kinds of information or guidelines major test publishers make available to practitioners who wish to test students out of level.

We conducted a series of telephone interviews with test publishing company representatives. Additional information was obtained through "in-house" publications, catalogs, and technical manuals that were furnished by the testing companies. Through our questions we examined: (1) what kinds of information and guidance three major test publishers provide to their clients about the administration, scoring, and interpretation of out-of-level tests, (2) how test publishers address the technical issues associated with out-of-level testing, and (3) the extent to which the quality of information provided by test publishers is consistent with the Standards for Educational and Psychological Testing.

Our findings indicate that test publishers provide inadequate information on out-of-level testing administration and score interpretation. Providing more complete information to clients should reduce the likelihood of misuse and misinterpretation. With the increasing role that test scores from large-scale assessments play in a student’s educational experience, it is incumbent upon test developers to ensure that misuse and misinterpretation is minimized.

Overview

Out-of-level testing is "the administration of a test at a level above or below the level generally recommended for students based on their age-grade level" (Study Group on Alternate Assessment, 1999). Out-of-level testing was originally introduced in school systems during the 1960s as a way to more accurately evaluate the effectiveness of Title I programs. Out-of-level testing was believed to result in more reliable scores for students who were achieving lower than their grade-level peers (Ayrer & McNamara, 1973). Adequate yearly progress (a measure of achievement growth) was used as the indicator of the effectiveness of the Title 1 program. Because it was thought that out-of-level testing would result in a more reliable measurement of student achievement, it should also result in a more reliable measurement of achievement growth.

Measurement theory provides the foundation for why it is that out-of-level testing would improve the reliability of a student’s test score. Using item response theory, it can be shown that the precision with which a student’s score is estimated improves as the match between that student’s ability and the test’s difficulty increases (Crocker & Algina, 1986). A low score indicates that the test was too difficult for that student, and thus the match is poor. A score in the middle of the test score range indicates that the match is good. Giving a group of low performing students an easier test should increase test score reliability for that group. Targeting the appropriate level of test difficulty for every student should result in more accurate measurement of adequate yearly progress for all students.

Much of the research concerning out-of-level testing was conducted in the 1960s through the late 1980s. Several studies examined differences in test performance for groups of students who took both the grade-level test and an out-of-level test (Ayrer & McNamara, 1973; Crowder & Gallas, 1978; Easton & Washington, 1982; Slaughter & Gallas, 1978). The authors of these studies did not always articulate their expectations about how out-of-level test performance should differ from performance on the in-level test. However, when they found that test performance was higher on the in-level test, the authors usually attributed it to increased guessing. Some of these studies also compared test score reliability between the in-level test scores and the out-of-level scores (Ayrer & McNamara, 1973; Cleland & Idstein, 1980; Easton & Washington, 1982). The premise was that low-performing students would be more likely to score above "chance" (a score that could be obtained by random guessing on every item) on the out-of-level tests. In these studies, chance was used to index the reliability of test scores in each group. Although scores below a certain threshold may be less reliable than scores above that threshold, using terms such as "chance level" can obscure the fundamental reason why low scores tend to be less reliable. This term also implies that any score below a certain threshold is obtained solely by random guessing. However, there is no basis for making this assertion.

Our review of the literature (Minnema, Thurlow, Bielinski, & Scott, 2000) indicated that there are many unresolved issues with out-of-level testing. Many studies yielded results that were either inconclusive or were inconsistent with the results of related studies. At this time there seem to be more questions associated with out-of-level testing than there are clear and definitive answers. The main issues relate to the following questions: Does out-of-level testing produce more reliable scores? Which students should take an out-of-level test? What level of the test should the student be given, and how is this determined? How should these scores be used or interpreted? The need for careful consideration before deciding to test students out-of-level is clear. An important part of this decision is to consider how test publishers have addressed these issues.

Test publisher responses to consumer needs related to out-of-level testing were not included in our literature review. Still, several studies specifically mentioned that CTB/McGraw-Hill, the publisher of the California Achievement Test, provided locator tests to help teachers select the most appropriate test level for individual students (Cleland & Idstein, 1980; Jones, Barnette & Callahan, 1983; and Slaughter & Gallas, 1978). Roberts (1976) reported that some test publishers provide tables that convert raw scores to expanded standard scores, raw scores to percentile ranks (CTB McGraw-Hill, Harcourt Brace Educational Measurement) and expanded standard scores to percentile ranks (Riverside/Houghton Mifflin, Harcourt Brace Educational Measurement) for each grade and time of year. This suggests that some achievement test publishers are making an effort to assist their consumers in decisions about how to use scores from out-of-level tests.

In order to address some of the issues surrounding out-of-level testing, we interviewed major test publishers to determine what steps they have taken to ensure the integrity of the out-of-level testing process. We also examined the technical manuals, norm tables, bulletins, and other materials that are made available by test publishers. The following questions were explored:

• Can the company’s test be used for out-of-level testing?

• Do test publishers make recommendations about appropriate and inappropriate uses of test scores obtained from an out-of-level testing administration?

• Do test publishers provide data on how much error was introduced through their vertical equating procedure?

• Do test publishers provide guidelines or procedures to help identify students for which out-of-level testing is appropriate, and do those guidelines or procedures address which level of the test such students should take?

• Are there recommendations about how many levels below or above grade level one can test and still use the test publishers’ norms?

• Are guidelines for scoring, interpreting, and reporting out-of-level tests provided?

This report summarizes the findings from interviews conducted with several test publishing company personnel who have expertise in psychometrics, and from an examination of technical manuals, locator test materials, and bulletins that are available to consumers. The purpose of the interviews was to obtain a current estimate of the availability of tests designed for out-of-level purposes, and to assess what types of information test publishers provide to practitioners who wish to use their tests for this purpose.

Method

Participants

Three companies, CTB-McGraw-Hill, Harcourt Educational Measurement (PsyCorp), and Houghton Mifflin (Riverside), were chosen to participate in this study after a review of National Center on Educational Outcomes (NCEO) state survey publications and a Council of Chief State School Officers (CCSSO, 1999) publication that showed that the tests published by these companies were being used by many states as part of their statewide large-scale assessment programs. The five most frequently administered norm-referenced achievement tests appeared to be the California Achievement Test (CAT), Iowa Test of Basic Skills (ITBS), Metropolitan Achievement Test (MAT), Stanford Achievement Test (SAT), and the Terra Nova.

Survey Instrument

We generated a list of questions based on a critical review of the literature (Bielinski, Thurlow, Minnema, & Scott, 2000; Minnema et al., 2000). We determined that someone from each company who had expertise in psychometrics would be the most appropriate candidate to respond to the range of questions that we had developed. A research assistant contacted a representative from each publishing company and conducted interviews over a four-week period from November through December, 1999. During the interviews the representatives were asked to furnish technical reports and any consumer publications that addressed the issues related to out-of-level testing. The complete list of "standard" questions that were posed to each representative is shown in Table 1. Additional questions were asked during the interview depending on the kind of information received, or when clarification was needed.

Table 1. Questions designed for telephone interviews with test company representatives.

1. Is your test designed so that it can be used for out-of-level testing?

2. Is this explicitly stated in your technical manual?

3. Does your manual or do other publications include recommendations and guidelines for the practitioner who
wishes to use the test for out-of-level testing?

4. How many levels below grade level would be appropriate for your test?

5. Are there locator tests available for determining which test level is most appropriate for an individual student? If no locator tests are available, does the publisher provide any recommendations to assist practitioners in making this decision?

6. Was Item Response Theory used to estimate item parameters? Which model?

7. Which method was used to equate scores between test levels (common persons/common items)?

8. If common persons was the chosen method for equating, were kids with low scores oversampled, and was the equating sample representative of the norming sample for that grade?

9. Should an examinee get the same score on different levels of the test? If yes, is there any empirical evidence that supports this?

10. Do you include information in your technical manual concerning the standard error of equating?

11. Are norm tables provided to convert raw scores to standard scores and from standard score to normative scores?

12. Is information provided to guide consumers through the score conversion and interpretation process?

Table 2. Summary of responses to test publisher interview questions and information obtained from test company publication materials.

Interview Questions	Test Publisher (Tests Published)
Interview Questions	Riverside/Houghton Mifflin (ITBS)	Harcourt Brace/ PsyCorp (SAT-9, MAT-7)	CTB-McGraw Hill (CAT-5, Terra Nova)
Are locator tests available?	No. The "teacher will need to decide for each student whether a lower-level test better fits the student’s instructional level" after "the teacher becomes familiar with the content of the test levels under consideration."	No.	Yes, locator tests are provided. Locator test 1 is indicated for grades 1-6, and locator test 2 is indicated for grades 6-12. Complete directions for administering and scoring the locator test are available.
What method was used in equating?	Equipercentile	IRT	IRT with common forms design.
What amount of error is associated with the equating process?	Does not believe that anyone can determine this. Unaware of any empirical investigation of equating error.	Not sure. Unaware of any empirical investigation of equating error.	Difficult to determine. "Devilishly hard to get at." The representative did not believe that equating error would be greater than the standard error of measurement. Unaware of any empirical investigation of equating error.
Do you make explicit recommend-ations for using your instrument for out-of-level testing?	Publications address administering both higher and lower levels of the tests but do not include recommendations about reporting out-of-level test scores. Publications refer to "appropriate school policies and careful teacher judgments about the use of out-of-level testing to maximize the instructional benefits of standardized achievement testing."	Publications do not recommend a method for determining which level to administer; however, they advise against going more than one level higher or lower.	Publications suggest that if locator tests indicate testing more than two levels out, content relevance should be considered. The representative would not recommend more than one or two levels out because of content validity issues. Stated that comparing derived scores is inappropriate. Publications cite advantages of OOLT as "increased motivation, decreased frustration, and more accurate, meaningful scores."
Were low-scoring students oversampled?	Yes.	No.	No. Stated that a national sample is a robust sample, although they did oversample for ethnicity.
Was the sample in the vertical equating study representative of the population for that grade?	Not necessary, it just has to represent the range of scores.	-Not Available-	-Not Available-
Are norm tables provided to convert raw scores to standard scores and standard scores to normative scores (percentiles)? Is information provided to guide consumers through the score conversion and interpretation process?	There are tables that allow raw score conversion to standard scores, and from standard scores to percentile ranks. IEP teams would determine which level students would take. They do not do IRT scaling, although the technical representative stated that their scale is interpretable between all test levels. How scores are reported (aggregated/ disaggregated) depends on the contract that Riverside has with the particular agency (unless agency does reporting itself).	There are conversion tables (raw to scaled, scaled to normative). Representative did not believe there is anything to guide an interpretation when converting out-of-level raw scores to in-level normative scores; rather, the representative stated that it can be done, but stressed again that the limit is one level.	Yes, there are conversion tables. A catalog was sent in response to the question concerning the conversion and interpretation process. The information provided in the Terra Nova Assessment Products and Services Catalog referred to a coding system that could flag special populations (Title I students mentioned) but did not address out-of-level test administration. The catalog indicates that CTB reports contain an explanation of how to interpret the reports, although it remains unknown whether this explanation includes details for an out-of-level interpretation. CTB assessment products can be scored locally by hand, locally with special software, by local scanner, or by CTB scoring services. It appears that interpretive guidance accompanies the report prepared by CTB scoring services. One conclusion is that consumers may only receive interpretive guidance if they elect the CTB scoring option.

Results

The data from the interviews and the publications are organized and summarized in Table 2. Some of the most salient results are addressed here in more detail.

Can the Tests be Used for Out-of-Level Purposes?

Of the five norm-referenced tests that were examined, all were developed in a way that could accommodate out-of-level testing practices. In other words, each of the test publishers conducted some form of vertical equating so that scores from different levels of the test could be placed onto a single scale.

For Which Purposes is Out-of-Level Testing Appropriate or Inappropriate?

None of the test publishers we contacted make specific recommendations about the purposes for which out-of-level testing is appropriate (criterion-related decision-making, norm-referenced comparisons, monitoring academic growth over time, etc.).

Test Level Limitations

All three test publishers provide recommendations about limitations in how many levels above or below grade level a student could be tested and still obtain a score that could be interpreted with respect to the publisher’s norms. One test publisher recommended that no student be tested more than one level above or below his or her grade level. Another recommended not going beyond two levels above or below grade level. Only one publisher recommends testing more than two levels above or below, and only if the teacher determines that the content of the test closely matches the content of the student’s curriculum.

Which Level of the Test Should a Student Receive?

CTB/McGraw-Hill provides locator tests to assist the practitioner in determining which level of the test to administer to individual students. A publication that accompanied the locator test included a recommendation that the consumer could go more than two levels above or below grade level as long as the content of the test adequately matched the student’s curriculum. A teacher should make this decision. In contrast, the technical representative believed that testing more than two levels out would be inappropriate (citing issues with content validity).

Riverside does not provide locator tests. Rather, its reports recommend that the student’s teacher should determine which level of the test to administer. Riverside recommended that teachers become familiar with the content of the different test levels before deciding which level to administer. Riverside’s technical representative did not make any recommendations about how many levels above or below grade level would be appropriate.

Harcourt Educational Measurement/PsyCorp does not provide locator tests. In its consumer publications, the company firmly recommends that its test not be administered more than one level above or below the student’s grade level.

How Should Scores from Out-of-Level Tests be Interpreted?

Riverside provides norm tables that allow conversions from the raw score to the standard score and from the standard score to the normative score. Tests are scored on a case-by-case basis according to the contract that the school system has developed with Riverside. School systems can, however, elect to score the tests themselves. Harcourt Educational Measurement provides similar conversion tables, although the representative did not believe that any information is provided to guide consumers through an out-of-level conversion and interpretation of test results.

Psychometric Properties of Out-of-Level Test Scores: Reliability and Equating Error

No information was provided about the reliability of the equating constant that is used to place the scores from each level of the test onto a common scale. None of the technical representatives was able to provide a statistically derived estimate of the amount of error that was introduced through the equating process.

When Should Consumers Decide to Test Students Out-of-Level?

All of the test companies stated that content alignment (or the alignment between test items and the instruction that the student is receiving) is a key factor in determining whether out-of-level testing is appropriate. Two of the test company publications cited other considerations for deciding whether an out-of-level test should be used. A CTB-McGraw Hill publication stated that any added challenges to administering out-of-level tests were outweighed by "increased motivation, decreased frustration, and more accurate, meaningful scores" that result from out-of-level testing. Riverside (Houghton Mifflin) publications stated that testing a student with a test level that is too difficult "would likely frustrate the student immensely."

Discussion

The most recent edition of Standards for Educational and Psychological Testing (APA/AERA/NCME, 1999) suggests that test publishers should provide consumers with information concerning each recommended interpretation and use of test scores, the procedures used to interpret the scores, and clear explanations of the meaning and intended interpretation of derived score scales and their limitations. Furthermore, test publishers should provide explicit forewarning if there is sound reason to believe that specific misinterpretations of a scale score are likely. It seems reasonable that consumers of achievement tests would be more likely to misinterpret or misuse the scores that are obtained through an out-of-level testing program in the absence of firm guidelines from the test publisher.

The guidelines that test publishers furnish the consumer about who should take an out-of-level test are varied. Generally, test publishers left the decision to the classroom teacher. Guidelines were short on specifics, but they stressed the importance of the match between test content and each student’s curriculum. Because guidelines are rather vague and vary from one test publisher to another, informed decisions and best practices may be compromised. Decisions to test out of level are complex. It is important that test publishers make readily available the necessary technical assistance to guide these decisions.

Test publishers need to provide clients more detailed guidelines as to how to use norm tables for out-of-level test scores. If the test publisher believes that such scores as the percentile rank, or normal curve equivalent for a particular grade are only appropriate when the grade level test is used, then that should be clearly stated. Otherwise, if the use of grade-level norms is appropriate under certain guidelines (e.g., only if the student takes the grade-level test, or one of the adjacent levels), then these need to be articulated. Furthermore, data are needed to demonstrate that the scaled score a student obtains from one level of the test is the same (within measurement error) as his or her score from another level of the test. Test publishers simply did not provide this degree of specificity when we asked them for information.

The new Standards also indicate that test publishers should provide detailed technical information on the accuracy of equating functions and that information about the intended interpretation and limitations of the score conversions should be clearly defined. Test companies did not provide detailed information on the accuracy of their equating functions. The amount of error associated with the equating process is of particular importance if scores from one level of a test battery are to be converted and interpreted with scores that were obtained by students who took the in-level test.

The Standards stated that if validity or some common interpretation is inconsistent with available evidence, that fact should be made clear and potential users should be cautioned about making unsupported interpretations. Test publishers should also provide a comprehensive summary of the evidence and theory bearing on the intended use or interpretation of their product.

Out-of-level testing may represent an efficient way to increase test score reliability for examinees who would otherwise obtain low scores on the in-level test, but improving reliability does not ensure improved validity. In such cases where inferences are being made about an examinee’s proficiency on well-defined standards, but the examinee takes a test other than the one specifically designed to measure these standards, validity is compromised. Many states are using norm-referenced tests, or customized versions of these tests for measuring the performance of students against their standards, thus further illuminating the need for well-defined guidelines as to how out-of-level test scores should be interpreted.

Our findings indicate that test publishers need to provide more detailed information about the use of out-of-level testing and the interpretation of scores that result from its use. More complete information to their clients would reduce the likelihood of misuses and misinterpretation of such data. With the increasing role that test scores from large scale assessment play in a student’s educational experience and in broader accountability measures, it is incumbent upon test developers to ensure that misuses and misinterpretations are minimal.

References

American Psychological Association, American Educational Research Association, & National Council on Measurement in Education (1999). Standards for educational and psychological testing. Washington, DC: American Psychological Association.

Arter, J. A. (1982, March). Out-of-level versus in-level testing: When should we recommend each? Paper presented at the annual meeting of the American Educational Research Association, New York, NY.

Ayrer, J. E. & McNamara, T. C. (1973). Survey testing on an out-of-level basis. Journal of Educational Measurement, 10 (2), 79-84.

Bielinski, J., Thurlow, M., Minnema, J., & Scott, J. (2000). How out of level testing affects the psychometric quality of test scores (Out-of-Level Testing Project Report 2). Minneapolis: University of Minnesota, National Center on Educational Outcomes.

Clarke, M. (1983, November). Functional level testing decision points and suggestions to innovators. Paper presented at the meeting of the California Educational Research Association, Los Angeles, CA.

Cleland, W. E. & Idstein, P.M. (1980, April). In-level versus out-of-level testing of sixth grade special education students. Paper presented at the annual meeting of the National Council on Measurement in Education, Boston, MA.

Council of Chief State School Officers. (1999, Fall). State student assessment programs: Annual survey. Washington, DC: Author.

Crocker, L. & Algina, J. (1986). Introduction to classical and modern test theory. Chicago, IL: Holt, Rhinehart, and Winston Inc.

Crowder, C. R., & Gallas, E. J. (1978, March). Relation of out-of-level testing to ceiling and floor effects on third and fifth grade students. Paper presented at the annual meeting of the American Educational Research Association, Toronto, Ontario, Canada.

Easton, J. A., & Washington, E. D. (1982, March). The effects of functional level testing on five new standardized reading achievement tests. Paper presented at the annual meeting of the American Educational Research Association, New York, NY.

Jones, E. D., Barnette, J. J., & Callahan, C. M. (1983, April). Out-of-level testing for special education students with mild learning handicaps. Paper presented at the annual meeting of the American Educational Research Association, Montreal, Quebec.

Minnema, J., Thurlow, M., Bielinski, & Scott, J. (2000). Past and present understandings of out-of-level testing: A research synthesis (Out-of-Level Testing Project Report 1). Minneapolis: University of Minnesota, National Center on Educational Outcomes.

Roberts, A. (1976). Out-of-level testing. ESEA Title I Evaluation and reporting system (Technical Paper No. 6). Mountain View, CA: RMC Research Corporation.

Slaughter, H. B., & Gallas, E. J. (1978, March). Will out-of-level norm-referenced testing improve the selection of program participants and the diagnosis of reading comprehension in ESEA Title I programs? Paper presented at the annual meeting of the American Educational Research Association, Toronto, Ontario, Canada.

Study Group on Alternate Assessment (1999). Alternate assessment resource matrix: Considerations, options, and implications. (ASES SCASS Report). Washington, DC: Council of Chief State School Officers.

Top of page