«Abstract I follow R.A. Fisher’s The Design of Experiments, using randomization statistical inference to test the null hypothesis of no treatment ...»
Channelling Fisher: Randomization Tests and the Statistical
Insignificance of Seemingly Significant Experimental Results*
London School of Economics
This draft: February 2016
I follow R.A. Fisher’s The Design of Experiments, using randomization statistical
inference to test the null hypothesis of no treatment effect in a comprehensive sample of 2003
regressions in 53 experimental papers drawn from the journals of the American Economic
Association. Randomization F/Wald tests of the significance of treatment coefficients find that 30 to 40 percent of equations with an individually significant coefficient cannot reject the null of no treatment effect. An omnibus randomization test of overall experimental significance that incorporates all of the regressions in each paper finds that only 25 to 50 percent of experimental papers, depending upon the significance level and test, are able to reject the null of no treatment effect anywhere. Bootstrap and simulation methods support and confirm these results.
*I am grateful to Alan Manning, Steve Pischke and Eric Verhoogen for helpful comments, to Ho Veng-Si for numerous conversations, and to the following scholars (and by extension their co-authors) who, displaying the highest standards of academic integrity and openness, generously answered questions about their randomization methods and data files: Lori Beaman, James Berry, Yan Chen, Maurice Doyon, Pascaline Dupas, Hanming Fang, Xavier Giné, Jessica Goldberg, Dean Karlan, Victor Lavy, Sherry Xin Li, Leigh L. Linden, George Loewenstein, Erzo F.P. Luttmer, Karen Macours, Jeremy Magruder, Michel André Maréchal, Susanne Neckerman, Nikos Nikiforakis, Rohini Pande, Michael Keith Price, Jonathan Robinson, Dan-Olof Rooth, Jeremy Tobacman, Christian Vossler, Roberto A. Weber, and Homa Zarghamee.
I: Introduction In contemporary economics, randomized experiments are seen as solving the problem of endogeneity, allowing for the identification and estimation of causal effects. Randomization, however, has an additional strength: it allows for the construction of exact test statistics, i.e. test statistics whose distribution does not depend upon asymptotic theorems or distributional assumptions and is known in each and every sample. Randomized experiments rarely make use of such methods, relying instead upon conventional econometrics and its asymptotic theorems.
In this paper I apply randomization tests to randomized experiments, using them to construct counterparts to conventional F and Wald tests of significance within regressions and, more ambitiously, an exact omnibus test of overall significance that combines all of the regressions in a paper in a manner that is, practically speaking, infeasible in conventional econometrics. I find that randomization F/Wald tests at the equation level reduce the number of regression specifications with statistically significant treatment effects by 30 to 40 percent, while the omnibus test finds that, when all treatment outcome equations are combined, only 25 to 50 percent of papers can reject the null of no treatment effect. These results relate, purely, to statistical inference, as I do not modify published regressions in any way. I confirm them with bootstrap statistical inference, present empirical simulations of the bias of conventional methods, and show that the equation level power of randomization tests is virtually identical to that of conventional methods in idealized situations where conventional methods are also exact.
Two factors lie behind the discrepancy between the results reported in journals and those produced in this paper. First, published papers fail to consider the multiplicity of tests implicit in the many treatment coefficients within regressions and the many regressions presented in each paper. About half of the regressions presented in experimental papers contain multiple treatment regressors, representing indicators for different treatment regimes or interactions of treatment with participant characteristics. When these regressions contain a.01 level significant coefficient, there are on average 5.8 treatment measures, of which only 1.7 are significant. I find treatment measures within regressions are generally mutually orthogonal, so the finding of a significant coefficient in a regression should be viewed as the outcome of multiple independent rolls of 20-sided or 100-sided dice. However, only 31 of 1036 regressions with multiple treatment measures report a conventional F- or Wald-test of the joint significance of all treatment variables within the regression.1 When tests of joint significance are applied, far fewer regressions show significant effects. I find that additional significant results appear, as additional treatment regressors are added to equations within papers, at a rate comparable to that implied by random chance under the null of no treatment effect. Specification search, as measured by the numbers of treatment regressors, produces additional significant results at a rate that is consistent with spurious correlation.
While treatment coefficients within regressions are largely orthogonal, treatment coefficients across regressions, particularly significant regressions, are highly correlated. The typical paper reports 10 regressions with a treatment coefficient that is significant at the.01 level, and 28 regressions with no treatment coefficient that is significant at this level.2 I find that the randomized and bootstrapped distribution of the coefficients and p-values of significant regressions are highly correlated across equations, while the insignificant regressions are much more independent. Thus, the typical paper presents many independent tests that show no treatment effect and a small set of correlated tests that show a treatment effect. When combined, this information suggests that most experiments have no significant effects. I should note that this result is unchanged when I restrict attention only to regressions with dependent variables that produce a significant treatment coefficient in at least one regression. Thus, it is not a consequence of combining the results of regressions of variables that are never significantly correlated with treatment with those concerning variables that are consistently correlated with treatment. Dependent variables that are found to be significantly related to treatment in a subset of highly correlated specifications are not significantly related to treatment in many other, statistically independent, specifications.
The second factor explaining the lower significance levels found in this paper is the fact that published papers make heavy use of statistical techniques that rely upon asymptotic theorems These occur in two papers. In an additional 8 regressions in two other papers the authors make an attempt to test the joint significance of multiple treatment measures, but accidentally leave out some treatment measures. In another paper the authors test whether a linear combination of all treatment effects in 28 regressions equals zero, which is not a test of the null of no treatment effect, but is closer. F-tests of the equality of treatment effects across treatment regimes (excluding control) or in non-outcome regressions (e.g. tests of randomization balance) are more common.
Naturally, I only include treatment outcome regressions in these calculations and exclude regressions related to randomization balance (participant characteristics) or attrition, which, by demonstrating the orthogonality of treatment with these measures, confirm the internal validity of the random experiment.
that are largely invalidated and rendered systematically biased in favour of rejection by their regression design. Chief amongst these methods are the robust and clustered estimates of variance, which are designed to deal with unspecified heteroskedasticity and correlation across observations. The theorems that underlie these and other asymptotic methods depend upon maximal leverage in the regression going to zero, but in the typical regression design it is actually much closer to its upper limit of 1. High leverage allows for a greater spread in the bias of covariance estimates and an increase in their variance, producing an unaccounted for thickening of the tails of test distributions, which leads to rejection rates greater than nominal size. The failure and potential bias of asymptotic methods is, perhaps, most immediately recognized by noting that no less than one fifth of the equation-level coefficient covariance matrices in my sample are singular, implying that their covariance estimate of some linear combination of coefficients is zero, i.e. a downward bias of 100 percent. I show that the conventional test statistics of my experimental papers, when corrected for the actual thickness of the tails of their distributions, produce significant results at rates that are close to those of randomization tests.
Conventional econometrics, in effect, cannot meet the demands placed on it by the regressions of published papers. Maximal leverage is high in the typical paper because the authors condition on a number of participant observables, either to improve the precision with which treatment effects are estimated or convince sceptical referees and readers that their results are robust. These efforts, however, undermine the asymptotic theorems the authors rely on, producing test statistics that are biased in favour of rejecting the null hypothesis of no treatment effect when it is true. Randomization inference, however, remains exact regardless of the regression specification. Moreover, randomization inference allows the construction of omnibus Wald tests that easily combine all of the equations and coefficient estimates in a paper. In finite samples such tests are a bridge too far for conventional econometrics, producing hopelessly singular covariance estimates and biased test statistics when they are attempted. Thus, randomization inference plays a key role in establishing the validity of both themes in this paper, the bias of conventional methods and the importance of aggregating the multiplicity of tests implicitly presented in papers.
The reader looking for a definitive breakdown of the results between the contribution of the multiplicity of tests and the contribution of the finite sample bias of asymptotic methods should be forewarned that a unique deconstruction of this sort simply does not exist. The reason for this is that the coverage bias, i.e. rejection probability greater than nominal size, of conventional tests increases with the dimensionality of the test.3 I find, both in actual results and in size simulations, that the gap between conventional and randomization/bootstrap tests is small at the coefficient level, larger at the equation level (combining coefficients) and enormous at the paper level (combining all equations and coefficients, in the few instances where this is possible using conventional techniques). If one first uses conventional methods to move from coefficients to equations to paper level tests (where it is possible to implement them conventionally) and then compares the paper level results with randomization tests, one concludes that the issue of multiplicity is of modest relevance and the gap between conventional and randomization inference (evaluated at the paper level) explains most of the results. If, however, one first compares conventional and randomization results at the coefficient level and then uses randomization inference to move from coefficients to equations to paper level tests, one concludes that the gap between randomization and conventional inference is small, and multiplicity (as captured in the rapidly declining significance of randomization tests at higher levels of aggregation) is all important. The evaluation of these differing paths is further complicated by the fact that power also compounds with the dimensionality of the test, and that tests with excess size typically have greater power, which, depending upon whether one wishes to give the benefit of the doubt to the null or the alternative, alters ones view of conventional and randomization tests.
Although I report results at all levels of aggregation, I handle these issues by focusing on presenting the path of results with maximum credibility. F/Wald tests of the overall significance of multiple coefficients within an equation are eminently familiar and easily verifiable, so I take as the first step the conventional comparison of individual coefficient versus equation level significance. The application of conventional F/Wald tests to equations with multiple treatment A possible reason for this lies in the fact that coverage bias relative to nominal size for each individual coefficient is greater at smaller nominal probabilities, i.e. the ratio of tail probabilities is greater at more extreme outcomes. In the Wald tests below, after the transformation afforded by the inverse of the coefficient covariance matrix, the test statistic is interpreted as being the sum of independently distributed squared random variables. As the number of such variables increases, the critical value for rejection is increased. This requires, however, an accurate assessment of the probability each squared random variable can, by itself, attain increasingly extreme values. As the dimensionality of the test increases this assessment is proportionately increasingly wrong and the overall rejection probability rises.
measures finds that 12 and 26 percent of equations (at the.01 and.05 level, respectively) that have at least one significant treatment coefficient are found to have, overall, no significant treatment effect. Allowing for single treatment coefficient equations whose significance is unchanged, these conventional tests reduce the number of equations with significant treatment effects by 8 to 17 percent at the.01 and.05 levels, respectively. Moving further, from the equation to the paper level, using conventional covariance estimates for systems of seemingly unrelated equations is largely infeasible, as the covariance matrices produced by this method are usually utterly singular. I am able to calculate such a conventional test for only 9 papers, and simulations show that the test statistics have extraordinarily biased coverage (i.e. a.30 rejection probability at the.01 level). Hence, it is not credible to advance to the paper level analysis using conventional methods.