KATHARINE B. STEVENS AND ELIZABETH ENGLISH
Statistical Versus Policy Significance: An Illustration
I
magine that a large group of sixth graders in a particular school district scores an average of 75 on a year-end district math test, with a standard deviation of 10. Children who score below 55 are rated as low performing; from 55 to 64 as below average; from 65 to 84 as average; from 85 to 94 as above average; and above 95 as exceptional. The gaps between lower- and higher-achieving children are substantial. A gap between a child in the middle of the below-average group and one in the middle of the above-average group is 30 points, or 3 SD. The gap between the top of the low-performing scores and the bottom of the exceptional scores is 40 points, or 4 SD. The next year, the district sets up a yearlong tutoring program to raise the performance of the lowest-scoring
an important social problem, whether outcomes justify the cost, and whether limited resources could be better spent on something else that more effectively addresses that particular problem.
Guidelines for Interpreting Study Findings To interpret study findings correctly, three aspects of findings are important to understand: effect sizes, the meaning of statistical versus practical significance, and the specific nature of the outcomes a study is measuring. Effect Sizes. A program’s impact on participating children is reported in different ways depending on the data presented. Sometimes results are reported in terms of percentages—such as “children who participated in the program had 30 percent less special education placement by fourth grade”—which are relatively intuitive. But sometimes results are reported in terms of “effect sizes”—such as “an effect size of 0.38 in math achievement”—which are harder to interpret.
children. Researchers conduct an end-of-year evaluation and find that the program increased those children’s scores significantly, reporting a large effect size of 0.8 SD. The district concludes that it is a successful program that is closing the gap. In the real world, though, the “significant, large effect” researchers found meant that the average score of children in the tutoring group was increased from 60 to 68—which is better, but makes only a small dent in the overall achievement gap and may not be the game-changer those children really need. The point is that an impact researchers describe as “significant” and “large” may not actually be significant or large from a policy point of view—and policymakers need to take that into account when making decisions about how to address a particular problem.
An effect size is expressed as a fraction of one standard deviation (SD)—that is, an effect size of 0.2 is 20 percent of an SD. (See the sidebar above for an explanation of standard deviation.) Conventional guidelines consider effect sizes of less than 0.3 SD as “small,” of 0.3 to 0.8 SD as “moderate,” and of 0.8 SD or more as “large.”15 For example, if researchers are investigating a program’s impact on children’s math achievement and find an effect size of 0.52 SD, they would usually describe that as a moderate effect on children’s achievement. However, this general rule can vary depending on the context. To address that, effect sizes are sometimes translated into practical information, such as how many months ahead an effect size represents in terms of children’s average annual gains. For example, a moderate effect size of 0.5 SD in reading would roughly translate to three months of the average achievement gain in kindergarten: in other words, children reached a particular level in September that they otherwise would not have reached until December. Similarly, a small effect size of 0.2 SD would mean they reached a level in September that they otherwise would not have reached until 13