«Bayes Meets SUSY. e 5-D mSUGRA parameter space has pled using Markov Chain Monte d a prior flat in θM. rior is ounded one te priors etter founded? ...»
Discovery in Complex or Massive Datasets:
Common Statistical Themes
A Workshop funded by the National Science Foundation
October 16-17, 2007
Bayes Meets SUSY
e 5-D mSUGRA parameter space has
pled using Markov Chain Monte
d a prior flat in θM.
L. Roszkowski, PhyStat 2007
Statistical Issues at the LHC
Figure 1: Top: Bayes meets SUSY: A projection to 2 dimensions of a 5-dimensional posterior density, computed for a 5-dimensional model of possible physics beyond the current Standard Model of particle physics. Bottom: Complex data types needed to predict function in the human genome (fragment shows ≥ 106 basepairs.) Executive Summary We report on a workshop, “Discovery in Complex or Massive Data Sets: Common Statistical Themes”, held in Washington, October 16-17, 2007, funded by NSF’s Division of Mathematical Sciences. We connect with a later workshop, “Data Enabled Science in the Mathematical and Physical Sciences” held in Washington, March 29-30, 2010, funded by NSF’s Directorate of Mathematical and Physical Sciences.
Research responding to important scientiﬁc and societal questions now requires the generation and understanding of vast amounts of often highly complex data. The 2007 workshop dealt with crosscutting issues arising in the analysis of such data sets with a particular focus on the role of statistical analysis. This was done through selected examples
matching scientiﬁc and societal interests. In particular there were sessions on:
• Genomics and other areas of the biosciences that play a key role both in fundamental biology and in our current eﬀorts to cure human diseases.
• Computer models with an emphasis on modeling in the atmospheric sciences that plays a critical role in climate change forecasting.
• Finance, economics, and risk management focusing on problems of ﬁnancial and other economic forecasting and also on analysis of the ﬂow of potential new regulatory data.
• Particle and astrophysics pointing to a plethora of needs and issues, including scientiﬁc questions such as solving massive inverse problems as they arise in the study of dark energy, statistical modeling of galactic ﬁlamentary structures, and policy issues such as determining resource allocation among expensive experiments.
• Network modeling pointing to an old type of data appearing with new complexity and size from many sources: the Internet, ecological networks, biochemical pathways, etc.
In addition, there were two cross cutting sessions,
• Sparsity, which reﬂects how simply we can represent information, has been recognized as the key feature that the new massive data sets must have for us to analyze them at all.
Sparsity ﬁgures prominently in compressed sensing, now a major topic as the number and types of detectors and the amount of data they can generate has grown exponentially.
• Machine Learning developed in computer science and statistics to integrate computational considerations with data modeling. Methods such as clustering look for sparsity or more generally structure in the data. The ﬁeld’s principles are entirely statistical. Its methods play an important role in speech recognition, document retrieval, web-search, computer vision, bioinformatics, neuroscience, and many other areas.
The activity of the 2007 Workshop foreshadowed in its treatment of analysis the 2010 Data Enabled Science Workshop1, although the latter examined and gave policy recommendations for all divisions in the directorate, rather than focusing on the nature of the science in one sub discipline. But the same themes came up, with all or most divisional sections stressing the growth in size and complexity of data, interdisciplinary collaboration as key to modern progress, and the need for the development of common large databases for analysis. The use of such existing databases in the biomedical sciences and astrophysics was implicit in the presentations of the 2007 workshop. More broadly, advances in statistics and mathematics will be crucial for developments of DES in other disciplines.
In their respective ways both workshops point to the need to support organization and analysis of our massive and high-dimensional data sets as a key to future advances.
and a 2010 E.U. report “Riding the wave: How Europe can gain from the rising tide of scientiﬁc data”
This document is the report of a Workshop on Discovery in Complex or Massive Datasets:
Common Statistical Themes, held October 16-17, 2007 in Washington, D.C. The idea and funding for the workshop came from Dr. Peter March, Director of the Divison of Mathematical Sciences (DMS) at the National Science Foundation (NSF).
The impetus for the meeting was the observation that interdisciplinary research in statistics engages with so many ﬁelds of science that it is neither possible, nor perhaps appropriate, for DMS to fund all of it, either alone, or through partnerships – though successful examples of the latter certainly exist. At the same time, DMS is the primary disciplinary home for statistics within NSF, and so in particular is the primary locus within the Foundation for workforce develpment eﬀorts in statistics. In such an environment, what ideas might guide DMS in its funding of statistics research?
The workshop and report develop the notion of “intersections” – that part of statistical methods and theory that has, or seems likely to have, impact in multiple scientiﬁc domains.
The intent for the short workshop was to be illustrative rather than encylopedic. It is not, therefore, a report on the ’future of statistics’, and deliberately does not contain formal consensus recommendations. However, we hope that the sampling of research areas in this short report illustrates the existence of these intersectional topics and importance of research into their development.
2 Introduction The amount and complexity of data generated to support contemporary scientiﬁc investigation continues to grow rapidly, following its own type of Moore’s Law . In domains from genomics to climate science, statisticians are actively engaged in interdisciplinary research teams. In some areas, automated processes collect and process huge amounts of information; in others simulations of complex systems are designed to generate information about large scale behavior, and in still other areas, the very sources of data are products of the information age.
There is substantial current activity to develop statistical ideas, methods and software in many of these domains, which include astronomy, genomics, climate science, ﬁnancial market analysis and sensor networks. Statisticians are engaged in (often large) interdisciplinary teams, and frequently receive signiﬁcant research support from the relevant scientiﬁc discipline.
The history of statistics shows that, while frequently initially arising in response to challenges in speciﬁc scientiﬁc domains, statistical methods and associated theory often achieve broader success and power by being subsequently applied to subjects far remote from those of origin. Well known examples include the analysis of variance, proportional hazard models and the application of sparsity ideas in signal recovery.
We see enormous opportunity, then, in advancing the study of the “intersections” arising from statistical research in today’s Age of Information – statistical problems, theories (including probabilistic models), tools and methods that arise in or are relevant to multiple domains of scientiﬁc enquiry, and as such, are moving or should move into the “core”.
The workshop aimed to enumerate some of today’s most intellectually compelling challenges arising out of these intersections, and was guided by the hope of stimulating future research advances that will extend and enhance our data analytic toolkit for scientiﬁc discovery.
In order to have a title that both has some focus, and yet is broadly inclusive, we chose “Discovery from Complex or Massive Datasets: Common Statistical Themes”. Here “massive” means large relative to existing capability in some way, including, but not restricted to, many cases (sample size), many variables (dimension), or many datasets (sensor networks).
The workshop took a broad view of research in statistics, and included researchers who may not identify themselves as statisticians yet who feel that advances in statistics are central to advances in science and society.
The body of the report contains short summaries of each of the sessions at the workshop.
In this introduction, we illustrate three of the themes with brief paragraphs, indicating in parentheses the sessions in which these themes come up explicitly or implicity. We conclude with some reﬂections on national needs that will be served by a focus on statistical intersections.
Sparsity. [§3.1, 3.2, 3.3, 3.4, 3.6] A preference for parsimony in scientiﬁc theories – captured in principles such as “Occam’s razor” – has long inﬂuenced statistical modeling and estimation. The size of contemporary datasets and the number of variables collected makes the search for, and exploitation of, sparsity even more important. For example, out of a huge list of proteins or genes, only an (unknown) few may be active in a particular metabolic or disease process, or sharp changes in a generally smooth signal or image may occur at a small number of points or boundaries The sparsity of representation may be “hidden”: revealed only with the use of new function systems such as wavelets or curvelets.
The theme of sparsity draws upon and stimulates research in many areas of mathematics, statistics and computing: harmonic analysis and approximation theory (for the development and properties of representations), numerical analysis and scientiﬁc computation (the associated algorithms), statistical theory and methods (techniques and properties when applied to noisy data).
Sparsity ideas have recently given birth to a new circle of ideas and technologies known collectively as “Compressed Sensing”. It is common experience that many images can be compressed greatly without signiﬁcant loss of information. So, why not design a data collection, or sensing, mechanism that need collect only roughly the number of bits required for the compressed representation? It has recently be shown that this can be done in a variety of settings, in which sparsity is present, by a judicious introduction of random sampling.
A number of intellectual trends in mathematics and statistics have pointed toward and culminated in the articulation of the Compressed Sensing phenomenon: approximation theory, geometric functional analysis, random matrices and polytopes, robust statistics and statistical decision theory. Once articulated mathematically, CS has stimulated development of new algorithms in ﬁelds ranging from magnetic resonance imaging to analog-to-digital conversion to seismic imaging.
Computer and Simulation-Based Models. [§3.1, 3.3, 3.4, 3.5] Mathematical models intended for computational simulation of complex real-world processes are a crucial ingredient in virtually every ﬁeld of science, engineering, medicine, and business, and in everyday life as well. Cellular telephones attempt to meet a caller’s needs by optimizing a network model that adapts to local data, and people threatened by hurricanes decide whether to stay or ﬂee depending on the predictions of a continuously updated computational model.
Growth in computing power and matching gains in algorithmic speed and accuracy have vastly increased the applicability and reliability of simulation—not only by drastically reducing simulation time, thus permitting solution of larger and larger problems, but also by allowing simulation of previously intractable problems.
The intellectual content of computational modeling comes from a variety of disciplines, including statistics and probability, applied mathematics, operations research, and computer science, and the application areas are remarkably diverse. Despite this diversity of methodology and application, there are a variety of common challenges in developing, evaluating and using complex computer models of processes. In trying to predict reality (with uncertainty bounds), some of the key issues that have arisen are: use of model approximations (emulators) as surrogates for expensive simulators, for calibration/prediction tasks and in optimization or decision support; dealing with high dimensional input spaces;
validation and utilization of computer models in situations with very little data, and/or functional (possibly multivariate) outputs; non-homogeneity, including jumps and phase changes as we move around the input space; implementation and transference methodology to current practice; eﬃcient MCMC algorithms and prior assessments; optimization and design.
Clustering. [§3.1, 3.6, 3.7] Clustering is another important core problem in data analysis. It is analogous to sparsity in that (1) it involves statistically-sound methods for reducing the dimensionality of data, and (2) it is a nexus for the research eﬀorts of multiple overlapping communities. One general motivation for clustering is that there are often limitations on resources available for data analysis, an issue that is particularly pertinent for massive data sets. Most statistical algorithms run in time that is at least proportional to the number of data points, and many algorithms run in quadratic or cubic time (e.g., linear regression).