FREE ELECTRONIC LIBRARY - Abstract, dissertation, book

Pages:   || 2 | 3 |

«Abstract. Over the past several years, the National Institute of Statistical Sciences (NISS) has developed methodology to perform statistical ...»

-- [ Page 1 ] --

Journal of Privacy and Confidentiality (2009) 1, Number 2, pp. 197–211

Secure Statistical Analysis of Distributed

Databases, Emphasizing What We Don’t Know

Alan F. Karr∗

Abstract. Over the past several years, the National Institute of Statistical Sciences (NISS) has developed methodology to perform statistical analyses that, in

effect, integrate data in multiple, distributed databases, but without literally bringing the data together in one place. In this paper, we summarize that research, but focus on issues that are not understood. These include inability to perform exploratory analyses and visualizations, protections against dishonest participants, inequities between database owners and lack of measures of risk and utility.

Keywords: data confidentiality, distributed databases, secure multi-party computation 1 Introduction Many government, industrial, and academic investigations require statistical analyses based on data stored in multiple distributed databases, often each with a different owner.

But, barriers to the actual integration of the databases are numerous:

Confidentiality, as in “official statistics” (Karr et al., 2004, 2005b; Sanil et al., 2004, 2007), or homeland security (Karr et al., 2006b) settings.

Proprietary data, as in the chemical database example in §3.2.

Scale: Despite advances in networking technology, the only sure way to move a petabyte of data from one point today to another point tomorrow may be by using FedEx or UPS.

For many analyses (using techniques from computer science known generically as secure multi-party computation), the database owners can share sufficient statistics anonymously, but in a way that the analysis can be performed in a statistically valid manner.

The protocols provide the owners protection from one another in the sense that while each owner can compare the global analysis to the same analysis on its own data, it is not able to attribute any characteristics of the discrepancies to other specific databases.

In this paper, “database” means a flat file in which rows represent data subjects and columns represent attributes. We term the database owners “agencies,” although in the example in §3.2 they are companies. There are two structured data partitioning models. For horizontally partitioned data, the data subjects are partitioned among the ∗ National Institute of Statistical Sciences, Research Triangle Park, NC, mailto:karr@niss.org c 2009 by the authors http://repository.cmu.edu/jpc databases containing the same attributes.1 For vertically partitioned data, the attributes are partitioned among the databases. More complex partitions are discussed in §5.

This paper summarizes computationalprotocols, but focuses on what is not understood. §2 introduces secure multi-party computation (SMPC) and the protocol we use—secure summation. §3 presents protocols for a variety of analyses for horizontally partitioned data, including regression, contingency tables, and maximum likelihood for exponential families. Briefer discussion of vertically partitioned data appears in §4, along with a similarly brief discussion of complex data partitions in §5. In each of these sections, we discuss the numerous gaps in our knowledge, ranging from the conceptual to the computational. Conclusions appear in §6.

2 Secure Multi-Party Computation Consider K agencies with values v1,..., vK who wish to evaluate a known function f

at these values subject to four constraints:

C1: The correct value f (v1,..., vK ) is obtained and known to all agencies.

C2: No agency j learns more about the other agencies’ values V−j = {vk : k = j} than it can deduce from vj and f (v1,..., vK ).

C3: No trusted third party—human or machine—is part of the process.

C4: Semi-honesty. Agencies perform agreed-upon computations correctly using their true data. However, they are permitted to retain the results of intermediate computations.

The computer science literature contains many papers on the theory of SMPC;

general references are Goldwasser (1997) and Yao (1982). There are many fewer implemented algorithms, let alone functioning software systems.

In §3 we employ secure summation (Benaloh, 1987): f (v1,..., vK ) = v1 + · · · + vk,

denoted by V. The steps are as follows:

Initialization: Agency 1 generates and retains a very large, complex random number R, adds R to its value v1, and sends R + v1 to agency 2.

Iteration: Agency 2 adds its value v2 to R + v1, sends the result to agency 3, and so on.

Sharing: Finally, agency 1 receives R+v1 +· · ·+vK = R+V from agency K, subtracts R, and shares the result V with the other agencies.

There are issues with secure summation. First, it needs a “good” random number, in particular, one not ending in a string of zeroes, and not recoverable by guessing 1 There are some subtleties associated with “the same;” see §3.4.

the seed. Second, collusion is possible: agencies j − 1 and j + 1, without sharing

private information, can determine vj.2 Production-quality implementation is subtle:

the process must be safe from outsiders, masqueraders, and (if there is one) a central server. Finally, secure summation is not a Nash equilibrium: it breaks if semi-honesty fails (Karr et al., 2007).

3 Secure Analysis of Horizontally Partitioned Data The protocols described here are based on one underlying idea: if the analysis uses sufficient statistics that are additive across agencies, then the agencies can use secure summation to compute and share the sufficient statistics, following that each agency completes the analysis on its own.

–  –  –

or, as in some implementations, by hiding the order from the agencies.

Figure 1: Scatterplots for the example discussed in §3.2. In all plots, the regression coefficients for the four-company regression appear on the x-axis. Left: y-axis contains regression coefficients for company 1 alone. Center: y-axis contains regression coefficients for company 2 alone. Right: y-axis contains regression coefficients for company 4 alone.

–  –  –

Therefore, [X y]T [X y] can be computed entrywise using secure summation, and each ˆ agency can then calculate β using (2).

ˆ Calculation of β is only part of a valid, useful regression. A variety of other objects can be calculated from [X y]T [X y], or using secure summation directly. These include the coefficient of determination R2, the least squares estimate S 2 of the error variance σ 2, and the “hat” matrix H = X(X T X)−1 X T, which can be used to identify outliers (Karr et al., 2005b, 2006b). It is also possible to use the secure data integration algorithm of Karr et al. (2007), together with methods for constructing (privacy-preserving) synthetic residuals in ordinary regressions (Reiter, 2003), in order to create secure synthetic residuals (Karr et al., 2006b).

3.2 Secure Regression: Example We illustrate with a data set of 1,318 chemical compounds (Karr et al., 2005a), in which the response is water solubility and the 91 predictors are a constant and 90 chemical features of the compounds. Four database-owning companies were created whose databases contain 499, 572, 16(!), and 231 compounds, respectively. Mimicking real-world heterogeneity, each company’s database contains compounds with features that are absent from all compounds in all of the other companies’ databases. This increases the incentive for companies to participate, because each can learn about the importance of features for which it has no data. Of course, company 3 has the greatest incentive to participate, since it cannot even do the regression on its own.

Figure 1 summarizes the results. The three panels are scatterplots of the 91 regression coefficients for companies 1, 2 and 4 (y-axis) against the coefficients for the global (four-company) regression (x-axis). Coefficients with y-values of zero correspond to features missing from each company’s database.

3.3 Other Analyses The “additive sufficient statistics” idea is broadly applicable, and here we describe several other contexts.

Secure Contingency Tables. The algorithm for secure data integration described in Karr et al. (2007) has an important indirect application—constructing contingency tables containing counts or sums.

Let D be a database containing only categorical attributes A1,..., AJ. The associated contingency table is the J-dimensional array T defined by

T (a1,..., aJ ) = #{r ∈ D : r1 = a1,..., rJ = aJ }, (4)

where each ai is a possible value of the categorical attribute Ai and ri is the ith attribute of record r. The J-tuple (a1,..., aJ ) is called the cell coordinates. The table T is a near-universal sufficient statistic, for example, for fitting log-linear models (Bishop et al., 1975).

The sparse representation of a table is the data structure of (cell coordinate, cell count) pairs

–  –  –

To securely build a contingency table from databases D1,..., DK requires the following steps:

List of Non-Zero Cells. Use secure data integration to build the list L of cells with non-zero counts. The “databases” being integrated are the agencies’ individual lists of cells with non-zero counts. The protocol in Karr et al. (2007) allows each agency not even to reveal in which cells it has data.

Non-Zero Cell Counts. For each cell in L, use secure summation to determine the associated count.

Secure Maximum Likelihood Estimation. Suppose now that the agencies’ databases partition a global database {xi } modeled as samples from an unknown density

f (θ, ·) belonging to an exponential family:

–  –  –

where Dk is the database of owner k.

Assuming that the agencies have agreed in advance on the model (5), they can use secure summation to compute each of the L terms within the brackets in (6), and then each can maximize the likelihood function by whatever means it wishes.

3.4 Problems with the Approach It may appear from §3.1 and §3.3 that the secure-summation-based approach is problemfree. This is anything but true.

Pre-specification. The process prevents us from being good statisticians. The analysis to be performed must be pre-specified—not only the model, but also variable transformations. There is no way to do exploratory data analysis or visualization.

No Protection Against Dishonesty. If all agencies but one are semi-honest, then that agency can not only ensure that it gets the right answer, but also that none of the other agencies get the right answer or are even aware when they don’t. To see this, suppose agency j puts an incorrect value [X j y j ]T in (3). Then once what the other agencies think is the correct value of [X y]T is calculated, it can subtract its incorrect value, add the correct value, and perform the regression. Unless the incorrect value is absurd, no other agency can detect that anything has happened.

The concept of partially trusted third parties (PTTPs) introduced in Karr et al.

(2007) reduces incentives to cheat at the expense of introducing a central server performing calculations that the agencies cannot. For instance, in the setting of §3.1, a ˆ PTTP could perform the secure summation to calculate [X y]T, but only share β (and T related quantities derived from [X y] ) with the agencies. Although the PTTP is trusted only with computed quantities, this may still be unacceptable to the agencies. There is no method to defeat a cheater who is content with wrecking the process and being the only one to know that it has been wrecked.

Data Heterogeneity. The notion that SMPC allows participants to learn only what is knowable from their input and the answer is most persuasive when agencies contribute approximately equally to the process. When they do not, the “information” surrendered can vary dramatically.

The simplest instance of this is when agencies have unequal database sizes.3 Figure 1 provides some insight: the larger the database, the more closely the global regression resembles a company’s regression.

A more pointed instance, in the example in §3.2, lies in the question “Should ComTo get a sense of this in a realistic context, populations of US states vary by a factor of 100.

Figure 2: Scatterplot comparing global regression coefficients (x-axis) to those for the regression excluding company 3 (y-axis).

panies 1, 2 and 4 Allow 3 to Participate?” Figure 2 addresses this question by plotting the estimated coefficients from the three-company regression (1, 2 and 4) against those for the global (four-company) regression. The relationship is strong, but not perfect, which leaves the question unanswered. See Risk-Utility Formulation below for related discussion.

A second issue is data heterogeneity across agencies. It is clear that there is something fundamentally different between the top panel in Figure 3, where the three agencies possess 2-dimensional data lying in the same region, and the bottom panel, where the x-variable ranges are very different. A properly performed process would elucidate a global quadratic structure in the data that is invisible to each agency on its own, but the Pre-specification issue above could keep such a model from being considered.

There is a deep point here. In the “top panel of Figure 3” context, the only real benefit to the agencies is increased sample size, whereas in the “bottom panel of Figure 3” context, there is a dramatic—but given current knowledge, unattainable—benefit, if only the right analysis were done.

Problems also arise when there is differential model fit across agencies. In the regression setting, if the global fit of the global model is good, but the fit of the global model to an agency’s data is poor, then it has potentially learned more about the other agencies’ data than they have about its data.

Pages:   || 2 | 3 |

Similar works:

«Materials 2013, 6, 1138-1158; doi:10.3390/ma6031138 OPEN ACCESS materials ISSN 1996-1944 www.mdpi.com/journal/materials Review Applications of Carbon Nanotubes for Lithium Ion Battery Anodes Zhili Xiong, Young Soo Yun and Hyoung-Joon Jin * Department of Polymer Science and Engineering, Inha University, Incheon 402-751, Korea; E-Mails: chilebear88@inha.edu (Z.X.); ysyun@inha.edu (Y.S.Y.) * Author to whom correspondence should be addressed; E-Mail: hjjin@inha.ac.kr; Tel.: +82-32-860-7483; Fax:...»

«CLIENT PUBLICATION ANTITRUST 10 September 2015 The General Court Judgments in Cathode Ray Tubes: A Reminder of Key Principles Concerning Cartel Enforcement In addition to annulling the fine of EUR28 million imposed If you wish to receive more information on the topics covered in this individually on Toshiba and reducing the fine imposed on publication, you may reach out to your Panasonic, the appeals against the European Commission’s decision regular Shearman & Sterling contact person or any...»

«Key Dates 2015-16 Monday 28 September Semester 1 and Autumn Term begin Tuesday 29 September Undergraduate induction events (see separate schedule) Wednesday 30 September Undergraduate and Postgraduate induction events (see separate schedule) Thursday 1 October Undergraduate induction events (see separate schedule) Friday 11 December Autumn Term ends Monday 11 January Spring Term begins Friday 22 January Semester 1 ends Monday 25 January Semester 2 begins Friday 18 March Spring Term ends Monday...»

«s e n o J i l i e t u vo s l i t e R at Ū R a, 3 0 k n yG a, 2 0 10 i s s n 18 2 23 6 5 6 Сергей Темчин RUTHENICA BOHEMICA: НОВОе иССЛеДОВаНие О ПереВОДах чеШСКих ОриГиНаЛОВ На руСьКу МОВу julia Verkholantsev, Ruthenica Bohemica: Ruthenian Translations from Czech in the Grand Duchy of Lithuania and Poland, (Slavische Sprachgeschichte, Bd. 3), Wien–Berlin: Lit Verlag, 2008, 215 p. ISBN 978-3-7000-0851-4 (Österreich); ISBN...»

«Session Paper Using a Range of Methods to Collect Travel Data The Experience of the British National Travel Survey Stephanie Freeth Office for National Statistics ABSTRACT The National Travel Survey has provided up-to-date information on personal travel in England, Scotland and Wales since 1965. This paper describes the sampling and fieldwork procedures of the survey and highlights how consideration for data quality has shaped the survey’s operations and helped in developing procedures for...»

«Table of Contents Table of Contents Introduction 4 Instrument Cluster 10 Warning and control lights 10 Gauges 15 Entertainment Systems 17 AM/FM stereo cassette with CD 17 CD changer 21 Navigation system 25 Climate Controls 88 Dual automatic temperature control 88 Rear window defroster 91 Lights 92 Headlamps 92 Turn signal control 95 Bulb replacement 96 Driver Controls 101 Windshield wiper/washer control 101 Steering wheel adjustment 102 Power windows 114 Mirrors 115 Speed control 117 Message...»

«EINFLUSS VON PLASMAEXPANDERN AUF ENDOTHELABHÄNGIGE, VASOAKTIVE FAKTOREN AN HERZKRANZGEFÄßEN VON SCHWEINEN MAITE ANN-KATRIN KLISA INAUGURAL-DISSERTATION zur Erlangung des Grades eines Dr. med. vet. beim Fachbereich Veterinärmedizin der Justus-Liebig-Universität Gießen édition scientifique VVB LAUFERSWEILER VERLAG Das Werk ist in allen seinen Teilen urheberrechtlich geschützt. Jede Verwertung ist ohne schriftliche Zustimmung des Autors oder des Verlages unzulässig. Das gilt insbesondere...»

«Volume II, Issue III, July 2014 ISSN 2321-7065 A REVIEW OF INDIAN ENGLISH LITERATURE Nasib Kumari Student J.k. Memorial College of Education Barsana Mor Birhi Kalan Charkhi Dadri Abstract Indian English literature (IEL) refers to the body of work by writers in India who write in the English language and wIhose native or co-native language could be one of the numerous languages of India. It is also associated with the works of members of the Indian diaspora, such as V. S. Naipaul, Kiran Desai,...»

«SLOPE STABILITY ASSESSMENT REPORT Aniseed Valley Road, Lots 1&2 DP 12751 Prepared for: Tasman Consulting Engineers on behalf of Andrew Strange Date: May 2015 Project Reference: 15105 GEOadvice Limited | PO Box 3377 | Richmond 7050 Phone: 022 043 2282 | sigfrid@geoadvice.co.nz | www.geoadvice.co.nz SLOPE STABILITY ASSESSMENT REPORT Aniseed Valley Road, Lots 1&2 DP 12751 Limitations This report has been prepared solely for the benefit of Tasman Consulting Engineers, Andrew Strange, property...»

«Jahresbericht 2014/2015 1  Jahresbericht Berufsschule Rüti 2014/2015 Inhalt Editorial Personelles Schulkommission Lehrerschaft Betrieb und Verwaltung Dienstjubilare Schulangaben Aktivitäten 2014/2015 Bauliches Veranstaltungen 2  Jahresbericht Berufsschule Rüti 2014/2015 3  Jahresbericht Berufsschule Rüti 2014/2015 Editorial Sehr geehrte Dame, sehr geehrter Herr Schön, dass Sie Interesse haben, was an der Berufsschule Rüti im Schuljahr 2014/2015 geschehen ist. Im Rahmen der...»

«Stony Brook University The official electronic file of this thesis or dissertation is maintained by the University Libraries on behalf of The Graduate School at Stony Brook University. © Allll Riightts Reserved by Autthor. © A R gh s Reserved by Au hor “A jest’s prosperity lies in the ear / Of him that hears it, never in the tongue / Of him that makes it” (LLL V.ii.849-51) A Thesis Presented by Laura Konigsberg to The Graduate School in Partial Fulfillment of the Requirements for the...»

«Aachener Forum für Gemeinsame Sicherheitsund Verteidigungspolitik 24. November 2012 Thema des Vortrags: „Gemeinsame europäische Streitkräfte Sachstand und Perspektiven“ Sehr geehrte Damen und Herren, ich bedanke mich für die Einladung und freue mich sehr, heute erstmals beim Aachener Forum für Gemeinsame Sicherheitsund Verteidigungspolitik zu Ihnen sprechen zu dürfen. Sicherheitspolitische Lage Verehrtes Publikum, die Herausforderungen im 21. Jahrhundert sind so weit reichend und...»

<<  HOME   |    CONTACTS
2016 www.abstract.xlibx.info - Free e-library - Abstract, dissertation, book

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.