Making Classroom Assessments Reliable and Valid. Robert J. Marzano
Such a coefficient ranges from a 0.00 to a 1.00, with 1.00 meaning there is no random error operating in an assessment, and 0.00 indicating that the test scores completely comprise random error. While there are no published tests with a reliability of 1.00 (simply because it’s impossible to construct such a test), there are also none published with a reliability even remotely close to 0.00. Indeed, David A. Frisbie (1988) notes that most published tests have reliabilities of about 0.90, but most teacher-designed tests have much lower reliabilities of about 0.50. Others have reported higher reliabilities for teacher-designed assessments (for example, Kinyua & Okunya, 2014). Leonard S. Feldt and Robert L. Brennan (1993) add a cautionary note to the practice of judging an assessment from its reliability coefficient:
Although all such standards are arbitrary, most users believe, with considerable support from textbook authors, that instruments with coefficients lower than 0.70 are not well suited to individual student evaluations. Although one may quarrel with any standard of this sort, many knowledgeable test users adjust their level of confidence in measurement data as a hazy function of the magnitude of the reliability coefficient. (p. 106)
As discussed earlier, the reliability coefficient tells us how much a set of scores for the same students would differ from administration to administration, but it tells us very little about the scores for individual students. The only way to examine the precision of individual scores is to calculate a confidence interval around the observed scores. Confidence intervals are described in detail in technical note I.1 (page 110), but conceptually they can be illustrated rather easily. To do so, table I.2 depicts the 95 percent confidence interval around an observed score of seventy-five out of one hundred points for tests with reliabilities ranging from 0.55 to 0.85.
Table I.2: Ninety-Five Percent Confidence Intervals for Observed Score of 75
Note: The standard deviation of this test was 8.33 and the upper and lower limits have been rounded.
Table I.2 depicts a rather disappointing situation. Even when a test has a reliability of 0.85, an observed score of 75 has a 95 percent confidence interval of 69 to 81. When the reliability is as low as 0.55, then that confidence interval is between 64 and 86. From this perspective, CAs appear almost useless in that they have so much random error associated with them. Fortunately, there is another perspective on reliability to use to render CAs more precise and, therefore, more useful.
The New CA Paradigm for Reliability
As long as the reliabilities of CAs are determined using the coefficients of reliability based on formulas that examine the difference in patterns of scores between students, there is little chance of teachers being able to demonstrate the precision of their assessments for individual students. These traditional formulas typically require a great many items and a great many examinees to use in a meaningful way. Classroom teachers usually have relatively few items on their tests (which are administered to relatively few students).
This problem is solved, however, if we consider CAs in sets administered over time. The perspective of reliability calculated from sets of assessments administered over time has been in the literature for decades (see Rogosa, Brandt, & Zimowski, 1982; Willett, 1985, 1988). Specifically, a central tenet of this book is that one should examine reliability of CAs from the perspective of groups of assessments on the same topic administered over time (as opposed to a single assessment at one point in time). To illustrate, consider the following five scores, each from a separate assessment, on the same topic, and administered to a specific student over time (such as a grading period): 71, 75, 81, 79, 84.
We must analyze the pattern that these scores exemplify to determine the reliability or precision of the student’s scores across the set. This requires a new foundational equation from the one used in traditional assessment. That new equation must account for the timing of an assessment. The basic equation for analyzing student learning over time is:
Observed score = time of assessment (true score) + error
The part of the equation added to the basic equation from traditional assessment is that the true score for a particular student on a particular test is at a particular time. A student’s true score, then, changes from assessment to assessment. Time is now a factor in any analysis of the reliability of CAs, and there is no need to assume that students have not changed from assessment to assessment.
As we administer more CAs to a student on the same topic, we have more evidence about the student’s increasing true score. Additionally, we can track the student’s growth over time. Finally, using this time-based approach, the pattern of scores for an individual student can be analyzed mathematically to compile the best estimates of the student’s true scores on each of the tests in the set. Consider figure I.2.
Figure I.2: Linear trend for five scores over time from an individual student.
Note that there are five bars and a line cutting across those bars. The five vertical bars represent the individual student’s observed scores on five assessments administered on one topic over a given period of time (let’s say a nine-week grading period).
Normally, an average of these five scores is computed to represent the student’s final score for the grading period. In this case, the average of the five scores is 78. This doesn’t seem to reflect the student’s learning, however, because three of the observed scores were higher than this average. Alternatively, the first four scores might be thought of as formative practice only. In this case, the last score of 84 is considered the summative, and it would be the only one reported. But if we consider this single final assessment in isolation, we also must consider the error associated with it. As shown in table I.2, even if the assessment had a reliability coefficient of 0.85, we would have to add and subtract six points to be surer of the student’s true score. That range of scores within the 95 percent confidence interval would be 78 to 90.
Using the new paradigm for CAs and the new time-based equation, estimates of the true score on each assessment can be made. This is what the line cutting through the five bars represents. The student’s observed score on the first test was 71, but the estimated true score was 72. The second observed score was 75, as was the estimated true score, and so on.
We consider how this line and others are computed in depth in chapter 4 (page 83), but here the point is that analyzing sets of scores for the same student on the same topic over time allows us to make estimations of the student’s true scores as opposed to using the observed scores only. When we report a final summative score for the student, we can do so with much more assuredness. In this case, the observed final score of 84 is the same as the predicted score, but now we have the evidence of the previous four assessments to support the precision of that summative score.
This approach also allows us to see how much a student has learned. In this case, the student’s first score was 71, and his last score was 84, for a gain of thirteen points. Finally, chapter 3 (page 59) presents ways that do not rely on complex mathematical calculations to make estimates of students’ true scores across a set of assessments. I address the issue of measuring student growth in chapters 3 and 4. This book also presents formulas that allow educators to program readily available tools like Excel to perform all calculations.
The Large-Scale Assessment Paradigm for Validity
The general definition for the validity of an assessment is that