
Computational Statistics in Data Science
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
In Computational Statistics in Data Science, a team of distinguished mathematicians and statisticians delivers an expert compilation of concepts, theories, techniques, and practices in computational statistics for readers who seek a single, standalone sourcebook on statistics in contemporary data science. The book contains multiple sections devoted to key, specific areas in computational statistics, offering modern and accessible presentations of up-to-date techniques.
Computational Statistics in Data Science provides complimentary access to finalized entries in the Wiley StatsRef: Statistics Reference Online compendium. Readers will also find:
* A thorough introduction to computational statistics relevant and accessible to practitioners and researchers in a variety of data-intensive areas
* Comprehensive explorations of active topics in statistics, including big data, data stream processing, quantitative visualization, and deep learning
Perfect for researchers and scholars working in any field requiring intermediate and advanced computational statistics techniques, Computational Statistics in Data Science will also earn a place in the libraries of scholars researching and developing computational data-scientific technologies and statistical graphics.
More details
Other editions
Additional editions

Persons
WALTER W. PIEGORSCH is Professor of Mathematics at the University of Arizona and Director of Statistical Research & Education at the University's BIO5 Institute. He is also a former Chair of the UArizona Interdisciplinary Program in Statistics, and a past editor of the Journal of the American Statistical Association (Theory & Methods Section). He is a fellow of the American Statistical Association and an elected member of the International Statistical Institute.
RICHARD A. LEVINE is Professor of Statistics at San Diego State University and Faculty Advisor overseeing the Statistical Modeling Group in SDSU Analytic Studies and Institutional Research. He is former Chair of the SDSU Department of Mathematics and Statistics and past Editor of the Journal of Computational and Graphical Statistics. He is Associate Editor for Statistics of the Notices of the American Mathematical Society and is a fellow of the American Statistical Association.
HAO HELEN ZHANG is Professor of Mathematics at the University of Arizona and Chair of the UArizona Interdisciplinary Program in Statistics. She is Editor-in-Chief of STAT (the ISI journal) and Associate Editor of the Journal of the American Statistical Association and the Journal of the Royal Statistical Society. She is a fellow of the American Statistical Association, the Institute of Mathematical Statistics, and an elected member of the International Statistical Institute.
THOMAS C. M. LEE is Professor of Statistics and Associate Dean of the Faculty in Mathematical and Physical Sciences at the University of California, Davis. He is a former Chair of the Department of Statistics at the same institution and a past editor of the Journal of Computational and Graphical Statistics. He is an elected fellow of the American Association for the Advancement of Science, the American Statistical Association, and the Institute of Mathematical Statistics.
Content
List of Contributors xxiii
Preface xxix
Part I Computational Statistics and Data Science 1 1 Computational Statistics and Data Science in the Twenty-first Century 3
Andrew J. Holbrook, Akihiko Nishimura, Xiang Ji, and Marc A. Suchard
1 Introduction 3
2 Core Challenges 1-3 5
3 Model-Specific Advances 8
4 Core Challenges 4 and 5 12
5 Rise of Data Science 16
2 Statistical Software 23
Alfred G. Schissler and Alexander D. Knudson
1 User Development Environments 23
2 Popular Statistical Software 26
3 Noteworthy Statistical Software and Related Tools 30
4 Promising and Emerging Statistical Software 36
5 The Future of Statistical Computing 38
6 Concluding Remarks 39
3 An Introduction to Deep Learning Methods 43
Yao Li, Justin Wang and Thomas C.M. Lee
1 Introduction 43
2 Machine Learning: An Overview 43
3 Feedforward Neural Networks 45
4 Convolutional Neural Networks 48
5 Autoencoders 52
6 Recurrent Neural Networks 54
7 Conclusion 57
4 Streaming Data and Data Streams 59
Taiwo Kolajo, Olawande Daramola, and Ayodele Adebiyi
1 Introduction 59
2 Data Stream Computing 61
3 Issues in Data Stream Mining 61
4 Streaming Data Tools and Technologies 64
5 Streaming Data Pre-Processing: Concept and Implementation 65
6 Streaming Data Algorithms 65
7 Strategies for Processing Data Streams 68
8 Best Practices for Managing Data Streams 69
9 Conclusion and theWay Forward 70
Part II Simulation-Based Methods 79
5 Monte Carlo Simulation: Are We There Yet? 81
Dootika Vats, James M. Flegal, and Galin L. Jones
1 Introduction 81
2 Estimation 83
3 Sampling Distribution 84
4 Estimating S 87
5 Stopping Rules 88
6 Workflow 89
7 Examples 90
6 Sequential Monte Carlo: Particle Filters and Beyond 99
Adam M. Johansen
1 Introduction 99
2 Sequential Importance Sampling and Resampling 99
3 SMC in Statistical Contexts 106
4 Selected Recent Developments 112
7 Markov Chain Monte Carlo Methods, A Survey with Some Frequent Misunderstandings 119
Christian P. Robert and Wu Changye
1 Introduction 119
2 Monte Carlo Methods 121
3 Markov Chain Monte Carlo Methods 128
4 Approximate Bayesian Computation 141
5 Further Reading 145
8 Bayesian Inference with Adaptive Markov Chain Monte Carlo 151
Matti Vihola
1 Introduction 151
2 Random-Walk Metropolis Algorithm 151
3 Adaptation of Random-Walk Metropolis 152
4 Multimodal Targets with Parallel Tempering 156
5 Dynamic Models with Particle Filters 157
6 Discussion 159
9 Advances in Importance Sampling 165
Víctor Elvira and Luca Martino
1 Introduction and Problem Statement 165
2 Importance Sampling 167
3 Multiple Importance Sampling (MIS) 171
4 Adaptive Importance Sampling (AIS) 174
Part III Statistical Learning 183
10 Supervised Learning 185
Weibin Mo and Yufeng Liu
1 Introduction 185
2 Penalized Empirical Risk Minimization 186
3 Linear Regression 190
4 Classification 193
5 Extensions for Complex Data 200
6 Discussion 203
11 Unsupervised and Semisupervised Learning 209
Jia Li and Vincent A. Pisztora
1 Introduction 209
2 Unsupervised Learning 210
3 Semisupervised Learning 219
4 Conclusions 224
12 Random Forest 231
Peter Calhoun, Xiaogang Su, Kelly M. Spoon, Richard A. Levine, and Juanjuan Fan
1 Introduction 231
2 Random Forest (RF) 232
3 Random Forest Extensions 235
4 Random Forests of Interaction Trees (RFIT) 239
5 Random Forest of Interaction Trees for Observational Studies 243
6 Discussion 249
13 Network Analysis 253
Rong Ma and Hongzhe Li
1 Introduction 253
2 Gaussian Graphical Models for Mixed Partial Compositional Data 255
3 Theoretical Properties 257
4 Graphical Model Selection 260
5 Analysis of a Microbiome-Metabolomics Data 260
6 Discussion 261
14 Tensors in Modern Statistical Learning 269
Will Wei Sun, Botao Hao, and Lexin Li
1 Introduction 269
2 Background270
3 Tensor Supervised Learning 272
4 Tensor Unsupervised Learning 276
5 Tensor Reinforcement Learning 282
6 Tensor Deep Learning 286
15 Computational Approaches to Bayesian Additive Regression Trees 297
Hugh Chipman, Edward George, Richard Hahn, Robert McCulloch, Matthew Pratola, and Rodney Sparapani
1 Introduction 297
2 Bayesian CART 298
3 TreeMCMC302
4 The BART Model 308
5 BART Example: Boston Housing Values and Air Pollution 310
6 BARTMCMC311
7 BART Extentions 313
8 Conclusion 320
Part IV High-Dimensional Data Analysis 323
16 Penalized Regression 325
Seung Jun Shin and Yichao Wu
1 Introduction 325
2 Penalization for Smoothness 326
3 Penalization for Sparsity 328
4 Tuning Parameter Selection 330
17 Model Selection in High-Dimensional Regression 333
Hao H. Zhang
1 Model Selection Problem 333
2 Model Selection in High-Dimensional Linear Regression 335
3 Interaction-Effect Selection for High-Dimensional Data 339
4 Model Selection in High-Dimensional Nonparametric Models 342
5 Concluding Remarks 349
18 Sampling Local Scale Parameters in High-Dimensional Regression Models 355
Anirban Bhattacharya and James E. Johndrow
1 Introduction 355
2 A Blocked Gibbs Sampler for the Horseshoe 356
3 Sampling (¿¿¿¿, ¿¿¿¿2, ¿¿¿¿) 359
4 Sampling ¿¿¿¿ 360
5 Appendix: A. Newton-Raphson Steps for the Inverse-cdf Sampler for ¿¿¿¿ 367
19 Factor Modeling for High-Dimensional Time Series 371
Chun Yip Yau
1 Introduction 371
2 Identifiability 372
3 Estimation of High-Dimensional Factor Model 373
4 Determining the Number of Factors 383
Part V Quantitative Visualization 387
20 Visual Communication of Data: It Is Not a Programming Problem, It Is Viewer Perception 389
Edward Mulrow and Nola du Toit
1 Introduction 389
2 Case Studies Part 1 391
3 Let StAR Be Your Guide 393
4 Case Studies Part 2: Using StAR Principles to Develop Better Graphics 394
5 Ask Colleagues Their Opinion 397
6 Case Studies: Part 3 398
7 Iterate 401
8 Final Thoughts 402
21 Uncertainty Visualization 405
Lace Padilla, Matthew Kay, and Jessica Hullman
1 Introduction 405
2 Uncertainty Visualization Theories 408
3 General Discussion 420
22 Big Data Visualization 427
Leland Wilkinson
1 Introduction 427
2 Architecture for Big Data Analytics 428
3 Filtering430
4 Aggregating 430
5 Analyzing 436
6 Big Data Graphics 436
7 Conclusion 440
23 Visualization-Assisted Statistical Learning 443
Catherine B. Hurley and Katarina Domijan
1 Introduction 443
2 Better Visualizations with Seriation 444
3 Visualizing Machine Learning Fits 445
4 Condvis2 Case Studies 447
5 Discussion 453
24 Functional Data Visualization 457
Marc G. Genton and Ying Sun
1 Introduction 457
2 Univariate Functional Data Visualization 458
3 Multivariate Functional Data Visualization 461
4 Conclusions 465
Part VI Numerical Approximation and Optimization 469
25 Gradient-Based Optimizers for Statistics and Machine Learning 471
Cho-Jui Hsieh
1 Introduction 471
2 Convex Versus Nonconvex Optimization 472
3 Gradient Descent 473
4 Proximal Gradient Descent: Handling Nondifferentiable Regularization 475
5 Stochastic Gradient Descent 476
26 Alternating Minimization Algorithms 481
David R. Hunter
1 Introduction 481
2 Coordinate Descent 482
3 EM as Alternating Minimization 484
3.1 Finite Mixture Models 485
4 Matrix Approximation Algorithms 486
5 Conclusion 489
27 A Gentle Introduction to Alternating Direction Method of Multipliers (ADMM) for Statistical Problems 493
Shiqian Ma and Mingyi Hong
1 Introduction 493
2 Two Perfect Examples of ADMM 494
3 Variable Splitting and Linearized ADMM 496
4 Multiblock ADMM 499
5 Nonconvex Problems 501
6 Stopping Criteria 502
7 Convergence Results of ADMM 502
28 Nonconvex Optimization via MM Algorithms: Convergence Theory 509
Kenneth Lange, Joong-Ho Won, Alfonso Landeros, and Hua Zhou
1 Background509
2 Convergence Theorems 510
3 Paracontraction 521
4 Bregman Majorization 523
Part VII High-Performance Computing 535
29 Massive Parallelization 537
Robert B. Gramacy
1 Introduction 537
2 Gaussian Process Regression and Surrogate Modeling 539
3 Divide-and-Conquer GP Regression 542
4 Empirical Results 548
5 Conclusion 552
30 Divide-and-Conquer Methods for Big Data Analysis 559
Xueying Chen, Jerry Q. Cheng, and Min-ge Xie
1 Introduction 559
2 Linear Regression Model 560
3 Parametric Models 561
4 Nonparametric and Semiparametric Models 567
5 Online Sequential Updating 568
6 Splitting the Number of Covariates 569
7 Bayesian Divide-and-Conquer and Median-Based Combining 570
8 Real-World Applications 571
9 Discussion 572
31 Bayesian Aggregation 577
Yuling Yao
1 From Model Selection to Model Combination 577
2 From Bayesian Model Averaging to Bayesian Stacking 580
3 Asymptotic Theories of Stacking 584
4 Stacking in Practice 586
5 Discussion 588
32 Asynchronous Parallel Computing 593
Ming Yan
1 Introduction 593
2 Asynchronous Parallel Coordinate Update 597
3 Asynchronous Parallel Stochastic Approaches 602
4 Doubly Stochastic Coordinate Optimization with Variance Reduction 604
5 Concluding Remarks 605
1
Computational Statistics and Data Science in the Twenty-First Century
Andrew J. Holbrook1, Akihiko Nishimura2, Xiang Ji3, and Marc A. Suchard1
1University of California, Los Angeles, CA, USA
2Johns Hopkins University, Baltimore, MD, USA
3Tulane University, New Orleans, LA, USA
1 Introduction
We are in the midst of the data science revolution. In October 2012, the Harvard Business Review famously declared data scientist the sexiest job of the twenty-first century [1]. By September 2019, Google searches for the term "data science" had multiplied over sevenfold [2], one multiplicative increase for each intervening year. In the United States between the years 2000 and 2018, the number of bachelor's degrees awarded in either statistics or biostatistics increased over 10-fold (382-3964), and the number of doctoral degrees almost tripled (249-688) [3]. In 2020, seemingly every major university has established or is establishing its own data science institute, center, or initiative.
Data science [4, 5] combines multiple preexisting disciplines (e.g., statistics, machine learning, and computer science) with a redirected focus on creating, understanding, and systematizing workflows that turn real-world data into actionable conclusions. The ubiquity of data in all economic sectors and scientific disciplines makes data science eminently relevant to cohorts of researchers for whom the discipline of statistics was previously closed off and esoteric. Data science's emphasis on practical application only enhances the importance of computational statistics, the interface between statistics and computer science primarily concerned with the development of algorithms producing either statistical inference1 or predictions. Since both of these products comprise essential tasks in any data scientific workflow, we believe that the pan-disciplinary nature of data science only increases the number of opportunities for computational statistics to evolve by taking on new applications2 and serving the needs of new groups of researchers.
This is the natural role for a discipline that has increased the breadth of statistical application from the beginning. First put forward by R.A. Fisher in 1936 [6, 7], the permutation test allows the scientist (who owns a computer) to test hypotheses about a broader swath of functionals of a target population while making fewer statistical assumptions [8]. With a computer, the scientist uses the bootstrap [9, 10] to obtain confidence intervals for population functionals and parameters of models too complex for analytic methods. Newton-Raphson optimization and the Fisher scoring algorithm facilitate linear regression for binary, count, and categorical outcomes . More recently, Markov chain Monte Carlo (MCMC) has made Bayesian inference practical for massive, hierarchical, and highly structured models that are useful for the analysis of a significantly wider range of scientific phenomena.
While computational statistics increases the diversity of statistical applications historically, certain central difficulties exist and will continue to remain for the rest of the twenty-first century. In Section 2, we present the first class of Core Challenges, or challenges that are easily quantifiable for generic tasks. Core Challenge 1 is Big , or statistical inference when the number "N" of observations or data points is large; Core Challenge 2 is Big , or statistical inference when the model parameter count "P" is large; and Core Challenge 3 is Big , or statistical inference when the model's objective or density function is multimodal (having many modes "")3. When large, each of these quantities brings its own unique computational difficulty. Since well over 2.5 exabytes (or bytes) of data come into existence each day [15], we are confident that Core Challenge 1 will survive well into the twenty-second century.
But Core Challenges 2 and 3 will also endure: data complexity often increases with size, and researchers strive to understand increasingly complex phenomena. Because many examples of big data become "big" by combining heterogeneous sources, big data often necessitate big models. With the help of two recent examples, Section 3 illustrates how computational statisticians make headway at the intersection of big data and big models with model-specific advances. In Section 3.1, we present recent work in Bayesian inference for big N and big P regression. Beyond the simplified regression setting, data often come with structures (e.g., spatial, temporal, and network), and correct inference must take these structures into account. For this reason, we present novel computational methods for a highly structured and hierarchical model for the analysis of multistructured and epidemiological data in Section 3.2.
The growth of model complexity leads to new inferential challenges. While we define Core Challenges 1-3 in terms of generic target distributions or objective functions, Core Challenge 4 arises from inherent difficulties in treating complex models generically. Core Challenge 4 (Section 4.1) describes the difficulties and trade-offs that must be overcome to create fast, flexible, and friendly "algo-ware". This Core Challenge requires the development of statistical algorithms that maintain efficiency despite model structure and, thus, apply to a wider swath of target distributions or objective functions "out of the box". Such generic algorithms typically require little cleverness or creativity to implement, limiting the amount of time data scientists must spend worrying about computational details. Moreover, they aid the development of flexible statistical software that adapts to complex model structure in a way that users easily understand. But it is not enough that software be flexible and easy to use: mapping computations to computer hardware for optimal implementations remains difficult. In Section 4.2, we argue that Core Challenge 5, effective use of computational resources such as central processing units (CPU), graphics processing units (GPU), and quantum computers, will become increasingly central to the work of the computational statistician as data grow in magnitude.
2 Core Challenges 1-3
Before providing two recent examples of twenty-first century computational statistics (Section 3), we present three easily quantified Core Challenges within computational statistics that we believe will always exist: big , or inference from many observations; big , or inference with high-dimensional models; and big , or inference with nonconvex objective - or multimodal density - functions. In twenty-first century computational statistics, these challenges often co-occur, but we consider them separately in this section.
2.1 Big N
Having a large number of observations makes different computational methods difficult in different ways. A worst case scenario, the exact permutation test requires the production of datasets. Cheaper alternatives, resampling methods such as the Monte Carlo permutation test or the bootstrap, may require anywhere from thousands to hundreds of thousands of randomly produced datasets [8, 10]. When, say, population means are of interest, each Monte Carlo iteration requires summations involving expensive memory accesses. Another example of a computationally intensive model is Gaussian process regression [16, 17]; it is a popular nonparametric approach, but the exact method for fitting the model and predicting future values requires matrix inversions that scale . As the rest of the calculations require relatively negligible computational effort, we say that matrix inversions represent the computational bottleneck for Gaussian process regression.
To speed up a computationally intensive method, one only needs to speed up the method's computational bottleneck. We are interested in performing Bayesian inference [18] based on a large vector of observations . We specify our model for the data with a likelihood function and use a prior distribution with density function to characterize our belief about the value of the -dimensional parameter vector a priori. The target of Bayesian inference is the posterior distribution of conditioned on
(1)The denominator's multidimensional integral quickly becomes impractical as grows large, so we choose to use the MetropolisHastings (M-H) algorithm to generate a Markov chain with stationary distribution [19, 20]. We begin at an arbitrary position and, for each iteration , randomly generate the proposal state from the transition distribution with density . We then accept proposal state with probability
(2)The ratio on the right no longer depends on the denominator in Equation (1), but one must still compute the likelihood and its terms .
It is for this reason that likelihood evaluations are often the computational bottleneck for Bayesian inference. In the best case, these evaluations are , but there are many situations in which they scale [21, 22] or worse. Indeed,...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.