Making Classroom Assessments Reliable and Valid. Robert J. Marzano
and [led] to development and adoption of a common core of state standards … in English language arts and mathematics for grades K–12” (as cited in Rothman, 2011, p. 62). This effort, referred to as the Common Core State Standards (CCSS), resulted in the establishment of two state consortia that were tasked with designing new assessments aligned to the standards. One consortium was the Partnership for Assessment of Readiness for College and Careers (PARCC); the other was the Smarter Balanced Assessment Consortium (SBAC):
Each consortium planned to offer several different kinds of assessments aligned to the CCSS, including year-end summative assessments, interim or benchmark assessments (used throughout the school year), and resources that teachers could use for formative assessment in the classroom. In addition to being computer-administered, these new assessments would include performance tasks, which require students to demonstrate a skill or procedure or create a product. (Marzano, Yanoski, Hoegh, & Simms, 2013, p. 7)
These efforts are still under way although with less widespread use than in their initiation.
Next, I discuss previous abuses of large-scale assessments that occurred in the first half of the 20th century (Houts, 1977). To illustrate the nature and extent of these abuses, consider the first intelligence test usable for groups that Alfred Binet developed in 1905. It was grounded in the theory that intelligence was not a fixed entity. Rather, educators could remediate low intelligence if they identified it. As Leon J. Kamin (1977) notes in his book on the nature and use of his IQ test, Binet includes a chapter, “The Training of Intelligence,” in which he outlines educational interventions for those who scored low on his test. There was clearly an implied focus on helping low-performing students. It wasn’t until the Americanized version of the Stanford-Binet test (by Lewis M. Terman, 1916) that the concept of IQ solidified as a fixed entity with little or no chance of improvement. Consequently, educators would use the IQ test to identify students with low intelligence so they could monitor and deal with them accordingly. Terman (1916) notes:
In the near future intelligence tests will bring tens of thousands of these high-grade defectives under the surveillance and protection of society. This will ultimately result in curtailing the reproduction of feeble-mindedness and in the elimination of an enormous amount of crime, pauperism, and industrial inefficiency. It is hardly necessary to emphasize that the high-grade cases, of the type now so frequently overlooked, are precisely the ones whose guardianship it is most important for the State to assume. (pp. 6–7)
The perspective that Lewis Terman articulated became widespread in the United States and led to the development of Arthur Otis’s (one of Terman’s students) Army Alpha test. According to Kamin (1977), performance scores for 125,000 draftees were analyzed and published in 1921 by the National Academy of Sciences, titled Memoirs of the National Academy of Sciences: Psychological Examining in the United States Army (Yerkes, 1921). The report contains the chapter “Relation of Intelligence Ratings to Nativity,” which focuses on an analysis of about twelve thousand draftees who reported that they were born outside of the United States. Educators assigned a letter grade from A to E for each of the draftees, and the distribution of these letter grades was analyzed for each country. The report notes:
The range of differences between the countries is a very wide one …. In general, the Scandinavian and English speaking countries stand high in the list, while the Slavic and Latin countries stand low … the countries tend to fall into two groups: Canada, Great Britain, the Scandinavian and Teutonic countries … [as opposed to] the Latin and Slavic countries. (Yerkes, 1921, p. 699)
Clearly, the perspective regarding intelligence has changed dramatically and large-scale assessments have come a long way in their use of scores on tests since the early part of the 20th century. Yet even now, the mere mention of the terms large-scale assessment or standardized assessment prompts criticisms to which assessment experts must respond (see Phelps, 2009).
The Place of Classroom Assessment
An obvious question is, What is the rightful place of CA? Discussions regarding current uses of CA typically emphasize their inherent value and the advantages they provide over large-scale assessments. For example, McMillan (2013b) notes:
It is more than mere measurement or quantification of student performance. CA connects learning targets to effective assessment practices teachers use in their classrooms to monitor and improve student learning. When CA is integrated with and related to learning, motivation, and curriculum it both educates students and improves their learning. (p. 4)
Bruce Randel and Tedra Clark (2013) explain that CAs “play a key role in the classroom instruction and learning” (p. 161). Susan M. Brookhart (2013) explains that CAs can be a strong motivational tool when used appropriately. M. Christina Schneider, Karla L. Egan, and Marc W. Julian (2013) identify CA as one of three components of a comprehensive assessments system. Figure I.1 depicts the relationship among these three systems.
Figure I.1: The three systems of assessment.
As depicted in figure I.1, CAs are the first line of data about students. They provide ongoing evidence about students’ current status on specific topics derived from standards. Additionally, according to figure CAs should be the most frequently used form of assessments.
Next are interim assessments. Schneider and colleagues (2013) describe them as follows: “Interim assessments (sometimes referred to as benchmark assessments) are standardized, periodic assessments of students throughout a school year or subject course” (p. 58).
Year-end assessments are the least frequent type of assessments employed in schools. Schneider and colleagues (2013) describe them in the following way:
States administer year-end assessments to gauge how well schools and districts are performing with respect to the state standards. These tests are broad in scope because test content is cumulative and sampled across the state-level content standards to support inferences regarding how much a student can do in relation to all of the state standards. Simply stated, these are summative tests. The term year-end assessment can be a misnomer because these assessments are sometimes administered toward the end of a school year, usually in March or April and sometimes during the first semester of the school year. (p. 59)
While CAs have a prominent place in discussions about comprehensive assessments, they have continually exhibited weaknesses that limit their use or, at least, the confidence in their interpretation. For example, Cynthia Campbell (2013) notes the “research investigating evaluation practices of classroom teachers has consistently reported concerns about the adequacy of their assessment knowledge and skill” (p. 71). Campbell (2013) lists a variety of concerns about teachers’ design and use of CAs, including the following.
■ Teachers have little or no preparation for designing and using classroom assessments.
■ Teachers’ grading practices are idiosyncratic and erratic.
■ Teachers have erroneous beliefs about effective assessment.
■ Teachers make little use of the variety of assessment practices available.
■ Teachers don’t spend adequate time preparing and vetting classroom assessments.
■ Teachers’ evaluative judgments are generally imprecise.
Clearly, CAs are important, and researchers widely acknowledge their potential role in the overall assessment scheme. But there are many issues that must be addressed before CAs can assume their rightful role in the education process.
Reliability and Validity at the Heart of the Matter
Almost all problems associated with CAs find their ultimate source in the concepts of reliability and validity. Reliability is generally described as the accuracy of a measurement. Validity is generally thought of as the extent to which an assessment measures