Making Classroom Assessments Reliable and Valid. Robert J. Marzano
theory, and practice—paid little if any attention to classroom assessment. Finally, both editions of The Standards for Educational and Psychological Testing (American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 1999, 2014)—which, as their titles indicate, are designed to set standards for testing in both psychology and education—made little explicit reference to classroom assessment. It wasn’t until the fourth edition in the first decade of the 21st century (Brennan, 2006) that a chapter was included addressing classroom assessment.
Most recently, the SAGE Handbook of Research on Classroom Assessment made a stand for the rightful place of classroom assessment: “This book is based on a single assertion: Classroom assessment (CA) is the most powerful type of measurement in education that influences student learning” (McMillan, 2013a, p. xxiii). Throughout this text, I take the same perspective. I also use the convention of referring to classroom assessment as CA. Since the publication of the SAGE Handbook, this abbreviation is now the norm in many technical discussions of classroom assessment theory. My intent is for this book to be both technical and practical.
What, then, is the place of CAs in the current K–12 system of assessment, and what is their future? This resource attempts to lay out a future for CA that will render it the primary source of evidence regarding student learning; this would stand in stark contrast to the current situation in which formal measurements of students are left to interim assessments, end-of-course assessments, and state assessments. In this introduction, I will discuss several topics with regard to CAs.
■ The curious history of large-scale assessments
■ The place of classroom assessment
■ Reliability and validity at the heart of the matter
■ The need for new paradigms
■ The large-scale assessment paradigm for reliability
■ The new CA paradigm for reliability
■ The large-scale assessment paradigm for validity
■ The new CA paradigm for validity
Before delving directly into the future of CA, it is useful to consider the general history of large-scale assessments in U.S. education since it is the foundation of current practices in CA.
The Curious History of Large-Scale Assessments
The present and future of CA are intimately tied to the past and present of large-scale assessments. In 2001, educational measurement expert Robert Linn published “A Century of Standardized Testing: Controversies and Pendulum Swings.” Linn notes that the original purpose of large-scale assessment was comparison and began in the 19th century.
Educators commonly refer to J. M. Rice as the inventor of the comparative large-scale assessment. This assignment is based on his 1895 assessment of the spelling ability of some thirty-three thousand students in grades 4 through 12 for which comparative results were reported (Engelhart & Thomas, 1966). However, assessments that educators administered to several hundred students in seventeen schools in Boston and one school in Roxbury in 1845 predated this comparative large-scale assessment. Because of this, Horace Mann (who initiated the effort) deserves credit as the first to administer large-scale tests. Lorrie A. Shepard (2008) elaborates on the contribution of Horace Mann, noting:
In 1845, Massachusetts State Superintendent of Instruction, Horace Mann, pressured Boston school trustees to adopt written examinations because large increases in enrollments made oral exams unfeasible. Long before IQ tests, these examinations were used to classify pupils … and to put comparative information about how schools were doing in the hands of state-level authority. (p. 25)
Educators designed these early large-scale assessments to help solve perceived problems within the K–12 system. For example, in 1909, Leonard P. Ayres published the book Laggards in Our Schools: A Study of Retardation and Elimination in City School Systems. Despite the book’s lack of sensitivity to labeling large groups of students in unflattering ways, it brought attention to the problems associated with repeated retention of students in grade levels. This helped buttress the goal of reformers who wanted to develop programs that would mitigate failure.
The first half of the 20th century was not a flattering era for large-scale assessments. They focused on natural intelligence, and educators used them to classify examinees. To say the least, this era did not represent the initial or current intent of large-scale assessment. I address this period in more detail shortly.
By the second half of the 20th century, educators began to use large-scale assessments more effectively. Such assessments were a central component of James Bryant Conant’s (1953) vision of schools designed to provide students with guidance as to appropriate career paths and support in realizing related careers.
The use of large-scale assessment increased dramatically in the 1960s. According to Shepard (2008), the modern era of large-scale assessment started in the mid-1960s: “Title I of the Elementary and Secondary Education Act (ESEA) of 1965 launched the development of the field of educational evaluation and the school accountability movement” (p. 26). Shepard (2008) explains that it was the ESEA mandate for data with which to scrutinize the reform efforts that compelled the research community to develop more finely tuned evaluation tools: “The American Educational Research Association began a monograph series in 1967 to disseminate the latest thinking in evaluation theory, and several educational evaluation organizations and journals date from this period” (p. 26).
The National Assessment of Educational Progress (NAEP) began in 1969 and “was part of the same general trend toward large-scale data gathering” (Shepard, 2008, p. 27). However, researchers and policymakers designed NAEP for program evaluation as opposed to individual student performance evaluation.
The need to gather and utilize data about individual students started minimum competency testing in the United States. This spread quickly, and by 1980 “all states had a minimum competency testing program or a state testing program of some kind” (Shepard, 2008, p. 31). But this, too, ran aground because of the amount of time and resources necessary for large-scale competency tests.
The next wave of school reform was the “excellence movement” spawned by the high visibility report A Nation at Risk (National Commission on Excellence in Education, 1983). It cited low standards and a watered-down curriculum as reasons for the lackluster performance of U.S. schools. It also faulted the minimum competency movement, noting that focusing on minimum requirements distracted educators from the more noble and appropriate goal of maximizing students’ competencies.
Fueled by these criticisms, researchers and policymakers focused on the identification of rigorous and challenging standards for all students in the core subject areas. Standards work in mathematics set the tone for the reform:
Leading the way, the National Council of Teachers of Mathematics report on Curriculum and Evaluation Standards for School Mathematics (1989) expanded the purview of elementary school mathematics to include geometry and spatial sense, measurement, statistics and probability, and patterns and relationships, and at the same time emphasized problem solving, communication, mathematical reasoning, and mathematical connections rather than computation and rote activities. (Shepard, 2008, p. 35)
By the early 1990s, virtually every major academic subject area had sample standards for K–12 education.
Shepard (2008) notes that standards-based reform, begun in the 1990s, “is the most enduring of test-based accountability reforms” (p. 37). However, she also cautioned that the version of this reform enacted in No Child Left Behind (NCLB) “contradicts core principles of the standards movement” mostly because the assessments associated with NCLB did not place ample focus on the application and use of knowledge reflected in the standards researchers developed (Shepard, 2008, p. 37). Also, the accountability system that accompanied NCLB focused on rewards and punishments.
The beginning of the new century saw an emphasis on testing that was highly focused on standards. In 2009, the National Governors Association Center for Best Practices (NGA) and the Council of Chief State School Officers