# «Abstract. Over the past several years, the National Institute of Statistical Sciences (NISS) has developed methodology to perform statistical ...»

Journal of Privacy and Conﬁdentiality (2009) 1, Number 2, pp. 197–211

Secure Statistical Analysis of Distributed

Databases, Emphasizing What We Don’t Know

Alan F. Karr∗

Abstract. Over the past several years, the National Institute of Statistical Sciences (NISS) has developed methodology to perform statistical analyses that, in

eﬀect, integrate data in multiple, distributed databases, but without literally bringing the data together in one place. In this paper, we summarize that research, but focus on issues that are not understood. These include inability to perform exploratory analyses and visualizations, protections against dishonest participants, inequities between database owners and lack of measures of risk and utility.

Keywords: data conﬁdentiality, distributed databases, secure multi-party computation 1 Introduction Many government, industrial, and academic investigations require statistical analyses based on data stored in multiple distributed databases, often each with a diﬀerent owner.

**But, barriers to the actual integration of the databases are numerous:**

Conﬁdentiality, as in “oﬃcial statistics” (Karr et al., 2004, 2005b; Sanil et al., 2004, 2007), or homeland security (Karr et al., 2006b) settings.

Proprietary data, as in the chemical database example in §3.2.

Scale: Despite advances in networking technology, the only sure way to move a petabyte of data from one point today to another point tomorrow may be by using FedEx or UPS.

For many analyses (using techniques from computer science known generically as secure multi-party computation), the database owners can share suﬃcient statistics anonymously, but in a way that the analysis can be performed in a statistically valid manner.

The protocols provide the owners protection from one another in the sense that while each owner can compare the global analysis to the same analysis on its own data, it is not able to attribute any characteristics of the discrepancies to other speciﬁc databases.

In this paper, “database” means a ﬂat ﬁle in which rows represent data subjects and columns represent attributes. We term the database owners “agencies,” although in the example in §3.2 they are companies. There are two structured data partitioning models. For horizontally partitioned data, the data subjects are partitioned among the ∗ National Institute of Statistical Sciences, Research Triangle Park, NC, mailto:karr@niss.org c 2009 by the authors http://repository.cmu.edu/jpc databases containing the same attributes.1 For vertically partitioned data, the attributes are partitioned among the databases. More complex partitions are discussed in §5.

This paper summarizes computationalprotocols, but focuses on what is not understood. §2 introduces secure multi-party computation (SMPC) and the protocol we use—secure summation. §3 presents protocols for a variety of analyses for horizontally partitioned data, including regression, contingency tables, and maximum likelihood for exponential families. Briefer discussion of vertically partitioned data appears in §4, along with a similarly brief discussion of complex data partitions in §5. In each of these sections, we discuss the numerous gaps in our knowledge, ranging from the conceptual to the computational. Conclusions appear in §6.

2 Secure Multi-Party Computation Consider K agencies with values v1,..., vK who wish to evaluate a known function f

**at these values subject to four constraints:**

C1: The correct value f (v1,..., vK ) is obtained and known to all agencies.

C2: No agency j learns more about the other agencies’ values V−j = {vk : k = j} than it can deduce from vj and f (v1,..., vK ).

C3: No trusted third party—human or machine—is part of the process.

C4: Semi-honesty. Agencies perform agreed-upon computations correctly using their true data. However, they are permitted to retain the results of intermediate computations.

The computer science literature contains many papers on the theory of SMPC;

general references are Goldwasser (1997) and Yao (1982). There are many fewer implemented algorithms, let alone functioning software systems.

In §3 we employ secure summation (Benaloh, 1987): f (v1,..., vK ) = v1 + · · · + vk,

**denoted by V. The steps are as follows:**

Initialization: Agency 1 generates and retains a very large, complex random number R, adds R to its value v1, and sends R + v1 to agency 2.

Iteration: Agency 2 adds its value v2 to R + v1, sends the result to agency 3, and so on.

Sharing: Finally, agency 1 receives R+v1 +· · ·+vK = R+V from agency K, subtracts R, and shares the result V with the other agencies.

There are issues with secure summation. First, it needs a “good” random number, in particular, one not ending in a string of zeroes, and not recoverable by guessing 1 There are some subtleties associated with “the same;” see §3.4.

the seed. Second, collusion is possible: agencies j − 1 and j + 1, without sharing

**private information, can determine vj.2 Production-quality implementation is subtle:**

the process must be safe from outsiders, masqueraders, and (if there is one) a central server. Finally, secure summation is not a Nash equilibrium: it breaks if semi-honesty fails (Karr et al., 2007).

3 Secure Analysis of Horizontally Partitioned Data The protocols described here are based on one underlying idea: if the analysis uses suﬃcient statistics that are additive across agencies, then the agencies can use secure summation to compute and share the suﬃcient statistics, following that each agency completes the analysis on its own.

or, as in some implementations, by hiding the order from the agencies.

Figure 1: Scatterplots for the example discussed in §3.2. In all plots, the regression coeﬃcients for the four-company regression appear on the x-axis. Left: y-axis contains regression coeﬃcients for company 1 alone. Center: y-axis contains regression coeﬃcients for company 2 alone. Right: y-axis contains regression coeﬃcients for company 4 alone.

Therefore, [X y]T [X y] can be computed entrywise using secure summation, and each ˆ agency can then calculate β using (2).

ˆ Calculation of β is only part of a valid, useful regression. A variety of other objects can be calculated from [X y]T [X y], or using secure summation directly. These include the coeﬃcient of determination R2, the least squares estimate S 2 of the error variance σ 2, and the “hat” matrix H = X(X T X)−1 X T, which can be used to identify outliers (Karr et al., 2005b, 2006b). It is also possible to use the secure data integration algorithm of Karr et al. (2007), together with methods for constructing (privacy-preserving) synthetic residuals in ordinary regressions (Reiter, 2003), in order to create secure synthetic residuals (Karr et al., 2006b).

3.2 Secure Regression: Example We illustrate with a data set of 1,318 chemical compounds (Karr et al., 2005a), in which the response is water solubility and the 91 predictors are a constant and 90 chemical features of the compounds. Four database-owning companies were created whose databases contain 499, 572, 16(!), and 231 compounds, respectively. Mimicking real-world heterogeneity, each company’s database contains compounds with features that are absent from all compounds in all of the other companies’ databases. This increases the incentive for companies to participate, because each can learn about the importance of features for which it has no data. Of course, company 3 has the greatest incentive to participate, since it cannot even do the regression on its own.

Figure 1 summarizes the results. The three panels are scatterplots of the 91 regression coeﬃcients for companies 1, 2 and 4 (y-axis) against the coeﬃcients for the global (four-company) regression (x-axis). Coeﬃcients with y-values of zero correspond to features missing from each company’s database.

3.3 Other Analyses The “additive suﬃcient statistics” idea is broadly applicable, and here we describe several other contexts.

Secure Contingency Tables. The algorithm for secure data integration described in Karr et al. (2007) has an important indirect application—constructing contingency tables containing counts or sums.

Let D be a database containing only categorical attributes A1,..., AJ. The associated contingency table is the J-dimensional array T deﬁned by

** T (a1,..., aJ ) = #{r ∈ D : r1 = a1,..., rJ = aJ }, (4)**

where each ai is a possible value of the categorical attribute Ai and ri is the ith attribute of record r. The J-tuple (a1,..., aJ ) is called the cell coordinates. The table T is a near-universal suﬃcient statistic, for example, for ﬁtting log-linear models (Bishop et al., 1975).

The sparse representation of a table is the data structure of (cell coordinate, cell count) pairs

**To securely build a contingency table from databases D1,..., DK requires the following steps:**

List of Non-Zero Cells. Use secure data integration to build the list L of cells with non-zero counts. The “databases” being integrated are the agencies’ individual lists of cells with non-zero counts. The protocol in Karr et al. (2007) allows each agency not even to reveal in which cells it has data.

Non-Zero Cell Counts. For each cell in L, use secure summation to determine the associated count.

Secure Maximum Likelihood Estimation. Suppose now that the agencies’ databases partition a global database {xi } modeled as samples from an unknown density

**f (θ, ·) belonging to an exponential family:**

where Dk is the database of owner k.

Assuming that the agencies have agreed in advance on the model (5), they can use secure summation to compute each of the L terms within the brackets in (6), and then each can maximize the likelihood function by whatever means it wishes.

3.4 Problems with the Approach It may appear from §3.1 and §3.3 that the secure-summation-based approach is problemfree. This is anything but true.

Pre-speciﬁcation. The process prevents us from being good statisticians. The analysis to be performed must be pre-speciﬁed—not only the model, but also variable transformations. There is no way to do exploratory data analysis or visualization.

No Protection Against Dishonesty. If all agencies but one are semi-honest, then that agency can not only ensure that it gets the right answer, but also that none of the other agencies get the right answer or are even aware when they don’t. To see this, suppose agency j puts an incorrect value [X j y j ]T in (3). Then once what the other agencies think is the correct value of [X y]T is calculated, it can subtract its incorrect value, add the correct value, and perform the regression. Unless the incorrect value is absurd, no other agency can detect that anything has happened.

The concept of partially trusted third parties (PTTPs) introduced in Karr et al.

(2007) reduces incentives to cheat at the expense of introducing a central server performing calculations that the agencies cannot. For instance, in the setting of §3.1, a ˆ PTTP could perform the secure summation to calculate [X y]T, but only share β (and T related quantities derived from [X y] ) with the agencies. Although the PTTP is trusted only with computed quantities, this may still be unacceptable to the agencies. There is no method to defeat a cheater who is content with wrecking the process and being the only one to know that it has been wrecked.

Data Heterogeneity. The notion that SMPC allows participants to learn only what is knowable from their input and the answer is most persuasive when agencies contribute approximately equally to the process. When they do not, the “information” surrendered can vary dramatically.

The simplest instance of this is when agencies have unequal database sizes.3 Figure 1 provides some insight: the larger the database, the more closely the global regression resembles a company’s regression.

A more pointed instance, in the example in §3.2, lies in the question “Should ComTo get a sense of this in a realistic context, populations of US states vary by a factor of 100.

Figure 2: Scatterplot comparing global regression coeﬃcients (x-axis) to those for the regression excluding company 3 (y-axis).

panies 1, 2 and 4 Allow 3 to Participate?” Figure 2 addresses this question by plotting the estimated coeﬃcients from the three-company regression (1, 2 and 4) against those for the global (four-company) regression. The relationship is strong, but not perfect, which leaves the question unanswered. See Risk-Utility Formulation below for related discussion.

A second issue is data heterogeneity across agencies. It is clear that there is something fundamentally diﬀerent between the top panel in Figure 3, where the three agencies possess 2-dimensional data lying in the same region, and the bottom panel, where the x-variable ranges are very diﬀerent. A properly performed process would elucidate a global quadratic structure in the data that is invisible to each agency on its own, but the Pre-speciﬁcation issue above could keep such a model from being considered.

There is a deep point here. In the “top panel of Figure 3” context, the only real beneﬁt to the agencies is increased sample size, whereas in the “bottom panel of Figure 3” context, there is a dramatic—but given current knowledge, unattainable—beneﬁt, if only the right analysis were done.

Problems also arise when there is diﬀerential model ﬁt across agencies. In the regression setting, if the global ﬁt of the global model is good, but the ﬁt of the global model to an agency’s data is poor, then it has potentially learned more about the other agencies’ data than they have about its data.