Experimental Evaluation Design for Program Improvement. Laura R. Peck
have large impacts, substantial, useful policy learning has come from this class of experimental evaluations (e.g., Gueron & Rolston, 2013; Haskins & Margolis, 2014). For example, the experimentation that focused on reforming the U.S. cash public assistance program was incremental in its influence. That program’s evaluation—Aid to Dependent Children (ADC) and Aid to Families with Dependent Children (AFDC) from 1935 until 1996 and Temporary Assistance for Needy Families (TANF) since then—amassed evidence that informed many policy changes. Evidence persuaded policymakers to change various aspects of the program’s rules, emphasize a work focus rather than an education one, and end the program’s entitlement.
Nudge or Opportunistic Experiments
In recent years, an insurgence of “opportunistic” or “nudge” experiments has arisen. An “opportunistic” experiment is one that takes advantage of a given opportunity. When a program has plans to change—for funding or administrative reasons—the evaluation can take advantage of that and configure a way to learn about the effects of that planned change. A “nudge” experiment tends to focus on behavioral insights or administrative systems changes that can be randomized in order to improve program efficiency. Both opportunistic and nudge experiments tend to involve relatively small changes—such as to communications or program enrollment or compliance processes—but they may apply to large populations such that even a small change can result in meaningful savings or benefits. For example, in the Fall of 2015, the Obama administration established the White House Social and Behavioral Sciences Team (SBST) to improve administrative efficiency and embed experimentation across the bureaucracy, creating a culture of learning and capitalizing on opportunities to improve government function.
The SBST 2016 Annual Report highlights 20 completed experiments that illustrate how tweaking programs’ eligibility and processes can expand access, enrollment, and related favorable outcomes. For instance, a test of automatic enrollment into retirement savings among military service members boosted enrollment by 8.3 percentage points from a low of 44% to over 52%, a start at bringing the savings rate closer to the 87% among civilian federal employees. Similarly, waiving the application for some children into the National School Breakfast and Lunch program increased enrollment, thereby enhancing access to food among vulnerable children. Both of these efforts were tested via an experimental evaluation design, which randomized who had access to the new policy so that the difference between the new regime’s outcomes and the outcomes of the status quo could be interpreted as the causal result of the new policy. In both cases, these were relatively small administrative changes that took little effort to implement; they could be implemented across a large system, implying the potential for meaningful benefits in the aggregate.
Rapid-Cycle Evaluation
Rapid-cycle evaluation is another relatively recent development within the broader field of program evaluation. In part because of its nascency, it is not yet fully or definitively defined. Some scholars assert that rapid-cycle evaluation must be experimental in nature, whereas others define it as any quick turnaround evaluation activity that provides feedback to ongoing program development and improvement. Regardless, rapid-cycle evaluations that use an experimental evaluation design are relevant to this book. In order to be quick-turnaround, these evaluations tend to involve questions similar to those asked by nudge or opportunistic experiments and outcomes that can be measured in the short term and still be meaningful. Furthermore, the data that inform impact analyses for rapid-cycle evaluations tend to come from administrative sources that are already in existence and therefore quicker to collect and analyze than would be the case for survey or other, new primary data.
Meta-Analysis and Systematic Reviews
The fourth set of evaluation research relevant to experiments involves meta-analysis, including tiered-evidence reviews. Meta-analysis involves quantitatively aggregating other evaluation results in order to ascertain, across studies, the extent and magnitude of program impacts observed in the existing literature. These analyses tend to prioritize larger and more rigorous studies, down-weighting results that are based on small samples or that use designs that do not meet criteria for establishing a causal connection between a program and change in outcomes. Indeed, some meta-analyses use only evidence that comes from experimentally designed evaluations. Likewise, evidence reviews—such as those provided by the What Works Clearinghouse (WWC) of the U.S. Department of Education—give their highest rating to evidence that comes from experiments. Because of this, I classify meta-analyses as a type of research that is relevant to experimentally designed evaluations.
Getting Inside the Black Box
Across these four main categories of experimental evaluation, there has been substantial activity regarding moving beyond estimating the average treatment effect to understand more about how impacts vary across a variety of dimensions. For example, how do treatment effects vary across subgroups of interest? What are the mediators of treatment effects? How do treatment effects vary along dimensions of program implementation features or the fidelity of implementation to program theory? Most efforts to move beyond estimating the average treatment effect involve data analytic strategies rather than evaluation design strategies. These analytic strategies have been advanced in order to expose what is inside the “black box.”
As noted in Box 1.1, the black box refers to the program as implemented, which can be somewhat of a mystery in impact evaluations: We know that the impact was this, but we have little idea what caused the impact. In order to expose what is inside the black box, impact evaluations often are paired with implementation evaluation. The latter provides the detail needed to understand the program’s operations. That detail is helpful descriptively: It allows the user of the evaluation to associate the impact with some details of the program from which it arose. The way I have described this is at an aggregate level: The program’s average impact represents what the program as a whole did or offered. Commonly, a program is not a single thing: It can vary by setting, in terms of the population it serves, by design elements, by various implementation features, and also over time. The changing nature of interventions in practice demands that evaluation also account for that complexity.1
Within the field of program evaluation, the concept of impact variation has gained traction in recent years. The program’s average impact is one metric by which to judge the program’s worth, but that impact is likely to vary along multiple dimensions. For example, it can vary for distinct subgroups of participants. It might also vary depending on program design or implementation: Programs that offer X and Y might be more effective than those offering only X; programs where frontline staff have greater experience or where the program manager is an especially dynamic leader might be more effective than those without. These observations about what makes up a program and how it is implemented have become increasingly important as potential drivers of impact.
Accordingly, the field has expanded the way it thinks about impacts, to be increasingly interested in impact variation. Assessments of how impacts vary—what works, for whom, and under what circumstances—are currently an important topic within the field. The field has expanded its toolkit of analytic strategies for understanding impact variation to addressing “what works” questions, this book will focus on design options for examining impact variation.2
1 In Peck (2015), I explicitly discuss “programmatic complexity” and “temporal complexity” as key factors that suggest specific evaluation approaches, both in design and analysis.
2 For a useful treatment of the relevant analytic strategies—including an applied illustration using the Moving to Opportunity (MTO) demonstration—I refer the reader to Chapter 7 in New Directions for Evaluation #152 (Peck, 2016).
The Ethics of Experimentation
Prior research and commentary considers whether it is ethical to randomize access to government and nonprofit services (e.g., Bell & Peck, 2016). Are those who “lose