«IZA DP No. 8583 PAPER Statistical Power of Within and Between-Subjects Designs in Economic Experiments Charles Bellemare DISCUSSION Luc Bissonnette ...»
IZA DP No. 8583
Statistical Power of Within and Between-Subjects
Designs in Economic Experiments
zur Zukunft der Arbeit
Institute for the Study
Statistical Power of Within and
Between-Subjects Designs in
Laval University and IZA
Laval University Sabine Kröger Laval University and IZA Discussion Paper No. 8583 October 2014 IZA P.O. Box 7240 53072 Bonn Germany Phone: +49-228-3894-0 Fax: +49-228-3894-180 E-mail: email@example.com Any opinions expressed here are those of the author(s) and not those of IZA. Research published in this series may include views on policy, but the institute itself takes no institutional policy positions.
The IZA research network is committed to the IZA Guiding Principles of Research Integrity.
The Institute for the Study of Labor (IZA) in Bonn is a local and virtual international research center and a place of communication between science, politics and business. IZA is an independent nonprofit organization supported by Deutsche Post Foundation. The center is associated with the University of Bonn and offers a stimulating research environment through its international network, workshops and conferences, data service, project support, research visits and doctoral program. IZA engages in (i) original and internationally competitive research in all fields of labor economics, (ii) development of policy concepts, and (iii) dissemination of research results and concepts to the interested public.
IZA Discussion Papers often represent preliminary work and are circulated to encourage discussion.
Citation of such a paper should account for its provisional character. A revised version may be available directly from the author.
IZA Discussion Paper No. 8583 October 2014
This paper discusses the choice of the number of participants for within-subjects (WS) designs and between-subjects (BS) designs based on simulations of statistical power allowing for different numbers of experimental periods. We illustrate the usefulness of the approach in the context of field experiments on gift exchange. Our results suggest that a BS design requires between 4 to 8 times more subjects than a WS design to reach an acceptable level of statistical power. Moreover, the predicted minimal sample sizes required to correctly detect a treatment effect with a probability of 80% greatly exceed sizes currently used in the literature. Our results suggest that adding experimental periods in an experiment can substantially increase the statistical power of a WS design, but have very little effect on the statistical power of the BS design. Finally, we discuss issues relating to numerical computation and present the powerBBK package programmed for STATA. This package allows users to conduct their own analysis of power for the different designs (WS and BS), conditional on user specified experimental parameters (true effect size, sample size, number of periods, noise levels for control and treatment, error distributions), statistical tests (parametric and nonparametric), and estimation methods (linear regression, binary choice models (probit and logit), censored regression models (tobit)).
JEL Classification: C8, C9, D03 Keywords: within-subjects design, between-subjects design, sample sizes, statistical power, experiments
Sabine Kröger Laval University Department of Economics Pavillon J.A.DeSève Québec G1V 0A6 Canada E-mail: firstname.lastname@example.org * Part of the paper was written at the Institute of Finance at the School of Business and Economics at Humboldt Universität zu Berlin and at the Department of Economics at Zurich University. We thank both institutions for their hospitality. We thank Nicolas Couët for his valuable research assistance. We are grateful to participants at the ASFEE conference in Montpellier (2012), ESA meeting in New York (2012), the IMEBE in Madrid (2013), and seminar participants at the Department of Economics at Zurich University (2013) and at Technische Universität Berlin (2013).
1 IntroductionResearchers planning an experimental study have to decide about the number of subjects, treatments, experimental periods to employ and whether to conduct a within or betweensubjects design. All these decisions require a careful balancing between the chance of ﬁnding an existing eﬀect and the precision with which this eﬀect can be measured.1 For example, subjects taking part in a within-subjects (WS hereafter) design are exposed to several treatment conditions while subjects in a between-subjects (BS hereafter) design are exposed to only one. WS designs thus oﬀer the possibility to test theories at the individual level and can boost statistical power, making it more likely to correctly reject a null hypothesis in favor of an alternative hypothesis. They can, however, also generate spurious treatment eﬀects, notably order eﬀects. BS designs, on the other hand, can attenuate order eﬀects but may have lower statistical power as we illustrate in this paper. Charness, Gneezy, and Kuhn (2012) summarize the tradeoﬀ between both designs by saying: “Choosing a design means weighing concerns over obtaining potentially spurious eﬀects against using less powerful tests.”(p.2.) In addition, the number of subjects and the number of periods (McKenzie, 2012) aﬀect the statistical power of a study. As a result, understanding the statistical power of WS and BS designs in relation to sample size and periods is an essential step in the process of designing economic experiments.
More generally, recent work has raised awareness about the relationship between power of statistical tests and optimal experimental designs (e.g., List, Sadoﬀ, and Wagner (2011); Hao and Houser (forthcoming)). Yet, statistical power remains largely undiscussed or reported in published experimental economic research. Zhang and Ortmann (2013), for example, reviewed all articles published in Experimental Economics between 2010 and 2012 and fail to ﬁnd a single study discussing optimal sample size in relation to statistical power.2 We conjecture that this can partly be explained by the incompatiThe former inﬂuence is referred to in the literature as the power of a study, that is the probability of not rejecting the Null hypothesis when in fact it is false, in other words of not committing a Type II error. The latter inﬂuence refers to the width of the conﬁdence interval, i.e., the conviction with which we are conﬁdent not committing a Type I error, i.e., rejecting the Null hypothesis when in fact it is true.
The practice of not reporting power or discussing optimal sample sizes is not speciﬁc to experimental bility of existing power formulas derived under very speciﬁc conditions with experimental data. The formulas are not adapted for the diversity of experimental data (with WS and BS designs; discrete, continuous, and censored outcomes; multiple periods; non-normal errors) nor are they available for the variety of statistical tests (nonparametric and parametric) used in the literature. This incompatibility poses challenges to experimentalists interested in predicting power for the designs they consider. As a result, researchers may unknowingly conduct underpowered experiments which lead to a waste in scarce resources and potentially guide research in unwanted directions.3 The main objective of this paper is to provide experimental economists with a simple uniﬁed framework to compute ex-ante power of an experimental design (WS or BS) using simulation methods. Simulation methods are general enough to be used in conjunction with a variety of statistical tests (nonparametric and parametric), estimation methods (for linear and non-linear models), and samples sizes used in experimental economics. It can also easily handle settings with non-normal errors. Conversely, closed form expressions for statistical power computation are typically derived for simple statistical models and tests and tend to be valid under speciﬁc conditions (e.g., large sample sizes, normally distributed errors). For other conditions, power computation using closed form expressions may overestimate the level of power in ﬁnite samples (see, e.g., Feiveson, 2002). The simulation approach to power computation is simple and well known in applied statistics and can help researchers determine the number of subjects, the number of periods, and the design (WS or BS) required to reach an acceptable level of statistical power. In this paper we focus on simulating the statistical power of a test for the null hypothesis of no treatment eﬀect against a speciﬁc alternative.4 For our simulations, we consider a population of economics, and applies more widely to other ﬁelds such as education (Brewer and Owen, 1973), marketing (Sawyer and Ball, 1981), and various sub-ﬁelds in psychology (Mone, Mueller, and Mauland, 1996;
Cohen, 1962; Chase and Chase, 1976; Sedlmeier and Gigerenzer, 1989; Rossi, 1990).
Long and Lang (1992) reviewed 276 articles (not necessarily experimental) published in top journals in economics and proposed a method to estimate the share of papers falsely failing to reject the null hypothesis. Their estimates suggest that all non-rejection results in their sample of articles are false, a consequence of low statistical power.
Precise interpretation of the null hypothesis will depend on the test used.
agents whose outcome variable is generated using a possibly non-linear panel data model which depends on a binary treatment variable, individual unobserved heterogeneity, and idiosyncratic shocks. From this population, researchers sample subjects and assign them to either treatment or control over several periods. In this setup a BS design assigns subjects to either treatment or control conditions for all periods while a WS design assigns subjects to a minimum of one period to both treatment and control conditions. We look at both balanced and unbalanced WS designs – subjects in a balanced WS design are observed for the same number of periods under both treatment conditions while subjects in an unbalanced design are observed for diﬀerent number of periods on both treatment conditions. Additionally, we look at the relationship between the statistical power of both designs and the number of experimental periods. All other aspects of the model (treatment eﬀect sizes and noise parameters) require calibration using data from existing economic experiments.
We illustrate the approach in the context of gift exchange experiments and calibrate our model using data from two existing ﬁeld experiments. We ﬁnd that the BS design requires approximately 4 times more subjects than the WS design to reach acceptable levels of power (80%) when the number of experimental periods is small (2 periods). Power of the WS design is found to increase substantially with the number of experimental periods.
Power of the BS design is found to be less sensitive to an increase in experimental periods.
As a result, the BS design requires approximately 12 times more subjects compared to a WS design when the number of experimental periods is larger (6 periods). We ﬁnd that these results are relatively robust to the true treatment eﬀect sizes. Increasing the noise level requires a larger sample size in both designs, however, the ratios become less large. Then, the BS design requires approximately 3 times more observations with a low number of periods and 6 times more when the number of experimental periods is larger.
Our analysis suggests that the number of subjects needed to reach an acceptable level of power in this research area can be large. For example, we ﬁnd that minimal sample sizes required to reach a power of 80% with a BS design range from 232 to 1054 subjects under our low noise scenario and range from 458 to 2200 subjects under our high noise scenario.
Corresponding sample sizes with a WS design ranged from 20 to 218 subjects under our low noise scenario and ranged from 66 to 738 subjects for our high noise scenario.
Finally, we present the powerBBK package for STATA that we developed to simulate power with the needs of economists in mind. This package allows to simulate the minimal necessary sample size to reach a user-speciﬁed level of statistical power or to compute the statistical power of a particular design, given information on sample size, variances, and minimal detectable eﬀect size. The package can handle panel data and can be used for non parametric (e.g., Wilcoxon Sign test or Mann-Whitney-U test) and parametric tests.
It can also be used in the context of linear regression models with or without normal errors, binary response models (probit and logit) and censored regression models (tobit).
The paper is organized as follows. Section 2 presents a brief survey of the experimental parameters used in recent articles published in Experimental Economics, the top ﬁeld journal for experimental work in economics, to illustrate typical sample sizes and design choices employed in this ﬁeld. Section 3 discusses the simulation of statistical power and introduces the powerBBK package. Section 4 presents our application to gift exchange.
Section 5 concludes.
Brief survey of experimental designs in Experimental Economics In this section we present a brief analysis of sample sizes and design choices of all papers published in Experimental Economics in volumes 15 and 16 (2012 and 2013). We focus on three aspects aﬀecting statistical power: the choice of experimental design (WS vs.
BS), the average number of subjects per treatments and the distribution of the subjects across treatments. In the two volumes we surveyed, a total of 71 papers were published.
Our analysis focus on papers with original data and which provided suﬃcient information to determine the number of subjects in each treatments, leaving us with a sample of 58 papers (36 in 2012, 22 in 2013).
We ﬁrst classify the experimental design in these studies as using either WS or BS designs. In some cases where elements of both designs are applied, we classiﬁed the papers as mixed design. The ﬁrst two columns of Table 2 present the frequency of each type of designs in each year.
We see from this table that the majority of the paper (41 out of 58) used a BS design.