# «Abstract Background A key requirement for a useful power calculation is that the calculation mimic the data analysis that will be performed on the ...»

Calculating power by bootstrap, with an application to cluster-randomized trials

Ken Kleinman1, Susan S. Huang2

1 Corresponding author, ken.kleinman@gmail.com, Department of Population Medicine, Harvard

Medical School and Harvard Pilgrim Health Care, Boston, MA, USA

2 Division of Infectious Diseases and Health Policy Research Institute, University of California Irvine

School of Medicine, Irvine, CA, USA

Abstract

Background

A key requirement for a useful power calculation is that the calculation mimic the data analysis that will be performed on the actual data, once it is observed. Close approximations may be difficult to achieve using analytic solutions, however, and thus Monte Carlo approaches, including both simulation and bootstrap resampling, are often attractive. One setting in which this is particularly true is clusterrandomized trial designs. However, Monte Carlo approaches are useful in many additional settings as well. Calculating power for cluster-randomized trials using analytic or simulation-based methods is frequently unsatisfactory due to the complexity of the data analysis methods to be employed and to the sparseness of data to inform the choice of important parameters in these methods.

Methods We propose that among Monte Carlo methods, bootstrap approaches are most likely to generate data similar to the observed data. Means of implementation are described.

Results We demonstrate bootstrap power calculation for a cluster-randomized trial with a survival outcome and a baseline observation period.

Conclusions Bootstrap power calculation is a natural application of resampling methods. It provides a relatively simple solution to power calculation that is likely to more accurate than analytic solutions or simulationbased calculations. It has several important strengths. Notably, it is simple to achieve great fidelity to the proposed data analysis method and there is no requirement for parameter estimates, or estimates of their variability, from outside settings. So, for example, for cluster-randomized trials, power calculations need not depend on intracluster correlation coefficient estimates from outside studies. We are not aware of bootstrap power calculation being previously proposed or explored for cluster-randomized trials, but it can also be applied for other study designs. We demonstrated power calculations for a time-to-event outcome in a cluster randomized trial setting, for which we are unaware of an analytic alternative. The method can easily be used when preliminary data is available, as is likely to be the case when research is performed in health delivery systems or other settings where electronic medical records can be obtained.

Keywords Power and sample size; cluster randomized trials, bootstrap; resampling Background Statistical power is defined as the probability of rejecting the null hypothesis, given that some particular alternative hypothesis (“the alternative”) is true. Power is particularly important from the perspectives of ethics and of allocating scarce resources. It is often ethically unjustifiable to randomize more subjects than are required to yield sufficient power, and it is a waste of resources to invest time or money in studies which have little chance of rejecting the null or when power is far greater than necessary.

In many settings, the question of how to calculate power is reasonably well addressed by closed-form equations or easily tractable mathematical methods. For instance, the power for an ordinary least squares regression is described in basic textbooks [1]. Power for logistic regression can use iterative techniques or relatively simple formulae [2,3]. Major statistical packages such as SAS (SAS Institute, Cary NC) contain routines for power calculation, and both functions and packages for power calculation are available for the free and open-source R environment [4]. There are also several stand-alone packages that simplify the calculation of power, for example, PASS (NCSS Inc., Kayesville, UT).

However, there are many settings in which these simple solutions are unsatisfactory. In order for power calculations to usefully inform our planning, the methods used must conform reasonably well to the planned analysis. If we plan to study a confounded relationship using a linear regression, the power assessment must include the confounder. If we know the outcome-predictor relationship is heteroscedastic, we should not use closed-form solutions that depend on homoscedasticity. If our study design includes a baseline period, we should not use a post-only comparison for estimating the power.

One setting in which power assessment is not simple is cluster-randomized trials. In this design, a relatively small number of administrative clusters, such as hospitals, classrooms, or physician practices, are recruited. Each cluster may contain a large numbers of individuals upon whom outcomes will be measured. Rather than randomize subjects individually to treatment arms, all of the individuals within a cluster are randomized to the same treatment arm, and in practice we say that the cluster itself is randomized to one treatment arm or another This study design often reduces cost considerably, and in many settings it is the only way to get estimates of pragmatic effects-- the effects of an intervention in a typical clinical population and in settings like those which non-trial patients are likely to encounter. For example, interventions on doctors to affect prescribing practices could hardly generate generalizable results if we randomize patients. We must randomize doctors, but examine the impact on patients.

Randomization by cluster leads to complications in data analysis that have long been recognized by statisticians [5,6]. This is due to the likelihood of patients within a cluster to resemble each other, or, more formally, a lack of independence between subjects. This can be parameterized as the covariance or correlation between subjects within a cluster (the intracluster correlation coefficient or ICC) or as the variance of cluster-specific parameters ( ). Valid approached include calculating summary statistics by cluster in a first step and then comparing cluster summaries by treatment arm in a second step, and mixed effects models that incorporate all individual observations in a single model [5,6].

There are several existing analytic approaches to calculating the power for cluster-randomized

is the ICC [5-8]. The “effective sample size” is calculated by dividing the actual per cluster and number of subjects by the design effect. Power assessment can then continue using methods for uncorrelated data, based on the effective sample size. While this approach can be surprisingly accurate, we do not recommend using it in practice. We mention the approach here because it helps clarify the importance of the ICC: with as few as 1000 subjects per cluster, increasing the ICC from 0.001 to 0.002 results in a 33% loss of effective sample size. In contrast, the confidence limits for estimated ICC are likely to be much broader than 0.001. Cluster sizes of 1000 or greater are common in trials involving health delivery systems or communities [9,10].

While the effective sample size approach is an approximation, accurate analytical approaches also depend on the design effect, and are similarly dramatically affected by the ICC. However, many approaches based on the design effect require that each cluster has an equal number of subjects, which may well not be the case. Several investigations into the impact of this have been performed, though their results are not general [11-14]. Approximate methods of incorporating the impact of variable cluster size have been proposed, however [15-17].

These analytic and approximate options for power assessment become difficult or untenable when more complex study designs are used. For example, it is often possible to record a baseline period in which neither the treatment clusters nor the control clusters receive the intervention followed by an intervention period in which only clusters so randomized receive the intervention. This design is much stronger than an “intervention period only” design, since it can account for some pre-existing or baseline differences among the clusters. Power calculation via analytic methods are known for normal-distributed outcomes in this design, (see, e.g., Murray pages 368-369 [6], Teerenstra et al.[18]). A Stata add-on due to Hemming and Marsh provides approximate power and sample size estimation with variable cluster size and can accommodate a baseline observation period [19]. For more complex designs and, e.g., dichotomous, count, or survival outcomes, analytic results may be unknown.

Another option useful in any difficult setting and in cluster randomized trials in particular is to use simulation, as follows. First, generate data resembling the data anticipated for the study under the specific alternative hypothesis for which a power estimate is required, then perform the planned test on that data. Repeat this process many times: the proportion of the simulated data sets in which the null hypothesis is rejected is an estimate of the power. This approach is very powerful, and has been implemented for cluster-randomized trials with baseline observation periods in at least one package for R [20,21]. The package also accommodates more general crossover trials.

But despite the robustness of simulation-based methods to some design issues, they share one key weakness with the analytic approach: it is often extremely difficult to obtain credible estimates of the ICC or. Assessments of the variability of the ICC or are even harder to find, and small differences in these parameters can lead to large differences in the estimated power, as was demonstrated using the effective sample size approximation. The difficulty of obtaining estimates has led to reliance on rules of thumb and to articles which report ranges of ICCs, to serve as reference [22]. While perhaps better than no estimate at all, estimates from unrelated areas may lead to poor estimates of power.

In addition, covariate imbalance between arms is likely when few units are randomized. Though there remains debate among trialists about whether covariate adjustment is ever appropriate, it may be thought desirable in the case of a cluster-randomized trial. If so, the adjustment should also be incorporated into the power assessment. As the model gets more complex and parameters multiply, we should have less confidence in power estimates that depend on simplifying assumptions such as a lack of covariate effects, or on external ICC estimates.

Our purpose in the current article is to propose a means of avoiding these problems and obtaining the greatest possible verisimilitude in power calculation. In the Methods section, we discuss the general approach to power assessment using resampling methods, and outline two distinct settings in which they are likely to be useful. In the Results section we describe an application in which we implemented the method, and show the resulting power assessment.

Methods In short, we propose to use bootstrapped samples to assess statistical power, modifying the samples as necessary to generate the desired alternative. This approach is a rather natural one. Bootstrapping for power calculation has been described previously in a few specific applications, but its generality, flexibility, benefits, and heuristic motivation have not been fully explored, to the best of our knowledge [23-26]. Nor has the application or its unique advantages in cluster randomized trials been described.

Bootstrap resampling is simply sampling from observed data, with replacement. Heuristically, the idea is that the observed data represent the population from which they were drawn, and thus sampling an observation from among the observed data can be substituted for sampling from the population itself.

For estimating the power in a typical medical study, the method requires having relatively detailed data before the power must be calculated. Ideally, this data should be as similar to the prospective study data as possible; one example would be baseline data which would also then be used in the study itself. Another setting where the method may be possible is in laboratory studies, where new experiments may be quite similar to completed experiments. We will describe how the approach might be implemented in each of these cases.

We begin with the laboratory experiment, a simple non-clustered setting, to introduce the idea. Suppose conditions “A” and “B” were compared in “Study I”, which has been completed. Now we wish to assess the power for a new experiment, “Study II”, where we will compare condition A to condition, “C”, a modification of condition B. Let us assess the power under the alternative that the mean of condition C in Study II is 5 units greater than was observed for condition B in study I. We denote the data observed

5. Repeat steps 1-4 many times

6. The proportion of rejections is the estimated power A diagram of steps 1 through 4 is presented in Figure 1.

The value of this approach is immediately obvious. Suppose the distribution in the second group is exponential, while that of the first is normal. An analytic approach to the power that accurately incorporates this difference in outcome distribution is not likely to be available. The choice of test with such distributions might be non-trivial, but the above routine will quickly generate the estimated power regardless of the chosen test; the algorithm above does not even specify a test. If we assume the new second condition will change the scale of the outcome, instead of or in addition to the location, we could easily modify step 3 in the above algorithm, and still generate the desired result.

Next, let us consider a cluster-randomized trial with a baseline observation period. Suppose we have collected baseline data on the presence or absence of an outcome, among individuals at several sites or clusters. We might be able to do this before the study was fully funded by using electronic medical records, for example. Each site may have a different number of subjects. We plan to use the collected data as a baseline against which we will compare data collected on other subjects while an intervention is applied to a random subset of sites. Suppose we need to know how much power we would have, given the intervention increases the odds of an outcome at each site by a factor of 2.