The New Art and Science of Classroom Assessment. Robert J. Marzano

The New Art and Science of Classroom Assessment

operations are to demonstrating proficiency. It does not make clear how important remainders are to demonstrating proficiency, and it seems to treat four operations equally, although they have significant differences in their execution.

As another example of equivocality in standards, consider the following high school ELA standard and the upper elementary civics standards.

ELA (high school): Analyze multiple interpretations of a story, drama, or poem (e.g., recorded or live production of a play or recorded novel or poetry), evaluating how each version interprets the source text (RL.11–12.7; NGA & CCSSO, 2010a)

Civics (grades 3–5): Identify the major duties, powers, privileges, and limitations of a position of leadership (e.g., class president, mayor, state senator, tribal chairperson, president of the United States) … evaluate the strengths and weaknesses of candidates in terms of the qualifications of a particular leadership role. (section H, standard 1; Center for Civic Education, 2014)

In both of these examples, assessments would be quite different depending on a teacher’s selection of available options. For example, in the ELA standard, comparing the treatment of the same content in a story and a poem is a quite different task from comparing a story and a play. In the civics standard, knowing the duties, powers, and privileges of a class president is a quite different task from knowing the duties of a state senator.

Standards as Inconsequential

We have observed two practices that appear to address the problems associated with standards but, in fact, render standards inconsequential: (1) tagging multiple standards and (2) relying on sampling across standards.

Tagging Multiple Standards

One common practice is for teachers to assess standards by simply tagging multiple standards in the tests they give. For example, assume that a teacher has created the following assessment in a seventh-grade ELA class.

We have been reading Roll of Thunder, Hear My Cry by Mildred D. Taylor, which tells the story of Cassie Logan and her family who live in rural Mississippi. In the novel, Taylor develops several themes. Describe how the author develops the theme of the importance of family through characters, setting, and plot. Compare the importance of family theme with one other theme from the book. Write a short essay that explains which of the two themes you think is the most important to the development of the novel. Justify your choice with logical reasoning and provide textual evidence.

Because the teacher must cover all the seventh-grade ELA standards, he or she simply identifies all those standards directly or tangentially associated with this assessment. For example, the teacher might assert that this assessment addresses the following Common Core standards to one degree or another:

RL.7.1

Cite several pieces of textual evidence to support analysis of what the text says explicitly as well as inferences drawn from the text.

RL.7.2

Determine a theme or central idea of a text and analyze its development over the course of the text; provide an objective summary of the text.

RL.7.10

By the end of the year, read and comprehend literature, including stories, dramas, and poems, in the grades 6–8 text complexity band proficiently, with scaffolding as needed at the high end of the range.

WHST.6–8.1.B

Support claim(s) with logical reasoning and relevant, accurate data and evidence that demonstrate an understanding of the topic or text, using credible sources.

WHST. 6–8.10

Write routinely over extended time frames (time for reflection and revision) and shorter time frames (a single sitting or a day or two) for a range of discipline-specific tasks, purposes, and audiences. (NGA & CCSSO, 2010a)

In effect, then, the teacher uses the score on one test to represent a student’s standing on five separate standards. Such an approach gives the perception that teachers are addressing standards but in reality, it constitutes a record-keeping convention that wastes teachers’ time and renders the standards inconsequential. In fact, we believe that this approach is actually the antithesis of using standards meaningfully.

Relying on Sampling Across Standards

At first glance, it might appear that designing assessments that sample content from multiple standards solves the problem of too much content. If a teacher has seventy-three standards statements to cover in a year, he or she can design assessments that include items from multiple statements. One assessment might have items from three or more statements. If a teacher systematically samples across the standards in such a way to equally emphasize all topics, then in the aggregate, the test scores for a particular student should paint an accurate picture of the student’s standing within the subject area. This is different from and better than tagging because the teacher designs assessments by starting with the standards. With tagging, the teacher designs assessments and then looks for standards that appear to be related.

Even though sampling has an intuitive logic to it, it still doesn’t work well with classroom assessments. Indeed, sampling was designed for large-scale assessments, but even there it doesn’t work very well. To illustrate, consider the following example:

You are tasked with creating a test of science knowledge and skills for grade 5 students. The school will report test results at both the individual and school levels to help students, parents, teachers, and leaders understand how well students are learning the curriculum. The test must address a variety of topics such as X, Y, Z and, in order to effectively assess their knowledge, many of the items require students to construct and justify responses. Some of the items are multiple choice.

Pilot testing of items indicates that students require about 10 minutes to complete a constructed response item and about two minutes to complete a multiple-choice item. Your team has created 32 constructed response items and 16 multiple choice items that you feel cover all topics in the grade 5 science curriculum. Based on your estimates of how much time a student needs to complete items, the test will require approximately 6 hours to complete, not including time for set up and instructions, and breaks. And that’s just one content area test. (Childs & Jaciw, 2003, p. 8)

We can infer from the comments of Ruth A. Childs and Andrew P. Jaciw (2003) that adequate sampling, even for three topics, requires a very long assessment. As a side note, Childs and Jaciw (2003) imply that fifth-grade science involves three topics only (for example, X, Y, and Z). In fact, Simms (2016) has determined that fifth-grade science involves at least twelve topics, four times the amount of content that Childs and Jaciw’s (2003) example implies.

Finally, even with a relatively slim version of the content involved in fifth-grade science (three topics as opposed to twelve), and a test that requires six hours to complete, the sampling process might not be robust enough to justify reporting scores for individual students. Childs and Jaciw (2003) describe the following concern for any test that purports to provide accurate scores for individual students:

Whether there is enough information at the student level to report subscores may be a concern. For example, research by Gao, Shavelson, and Baxter (1994) suggests that each student must answer at least nine or ten performance tasks to avoid very large effects of person-by-item interactions. To produce reliable subscores, even more items may have to be administered. Given that there are limits in test administration time, it may not be feasible to administer enough items to support student-level subscores. Instead, only overall scores might be reported at the student level, while both overall scores and subscores are reported at the school level. (p. 8)

Despite these clear flaws in sampling procedures as the basis for test design, educators do it all the time. Everyone in the system (students, teachers, leaders, parents) relies on the resulting information to make important decisions that influence student grades, placement in classes and coursework, and advancement to the next grade or course.

As we mention in the introduction, using proficiency scales solves a variety of assessment problems,

Скачать книгу