Statistical Foundations of Data Science by Fan, Jianqing -- Read -- Imperial Library of Trantor

Index

Cover Half Title Title Page Copyright Page Dedication Table of Contents Preface 1 Introduction

1.1 Rise of Big Data and Dimensionality

1.1.1 Biological sciences 1.1.2 Health sciences 1.1.3 Computer and information sciences 1.1.4 Economics and finance 1.1.5 Business and program evaluation 1.1.6 Earth sciences and astronomy

1.2 Impact of Big Data 1.3 Impact of Dimensionality

1.3.1 Computation 1.3.2 Noise accumulation 1.3.3 Spurious correlation 1.3.4 Statistical theory

1.4 Aim of High-dimensional Statistical Learning 1.5 What Big Data Can Do 1.6 Scope of the Book

2 Multiple and Nonparametric Regression

2.1 Introduction 2.2 Multiple Linear Regression

2.2.1 The Gauss-Markov theorem 2.2.2 Statistical tests

2.3 Weighted Least-Squares 2.4 Box-Cox Transformation 2.5 Model Building and Basis Expansions

2.5.1 Polynomial regression 2.5.2 Spline regression 2.5.3 Multiple covariates

2.6 Ridge Regression

2.6.1 Bias-variance tradeoff 2.6.2 ℓ2 penalized least squares 2.6.3 Bayesian interpretation 2.6.4 Ridge regression solution path 2.6.5 Kernel ridge regression

2.7 Regression in Reproducing Kernel Hilbert Space 2.8 Leave-one-out and Generalized Cross-validation 2.9 Exercises

3 Introduction to Penalized Least-Squares

3.1 Classical Variable Selection Criteria

3.1.1 Subset selection 3.1.2 Relation with penalized regression 3.1.3 Selection of regularization parameters

3.2 Folded-concave Penalized Least Squares

3.2.1 Orthonormal designs 3.2.2 Penalty functions 3.2.3 Thresholding by SCAD and MCP 3.2.4 Risk properties 3.2.5 Characterization of folded-concave PLS

3.3 Lasso and L1 Regularization

3.3.1 Nonnegative garrote 3.3.2 Lasso 3.3.3 Adaptive Lasso 3.3.4 Elastic Net 3.3.5 Dantzig selector 3.3.6 SLOPE and sorted penalties 3.3.7 Concentration inequalities and uniform convergence 3.3.8 A brief history of model selection

3.4 Bayesian Variable Selection

3.4.1 Bayesian view of the PLS 3.4.2 A Bayesian framework for selection

3.5 Numerical Algorithms

3.5.1 Quadratic programs 3.5.2 Least angle regression* 3.5.3 Local quadratic approximations 3.5.4 Local linear algorithm 3.5.5 Penalized linear unbiased selection* 3.5.6 Cyclic coordinate descent algorithms 3.5.7 Iterative shrinkage-thresholding algorithms 3.5.8 Projected proximal gradient method 3.5.9 ADMM 3.5.10 Iterative local adaptive majorization and minimization 3.5.11 Other methods and timeline

3.6 Regularization Parameters for PLS

3.6.1 Degrees of freedom 3.6.2 Extension of information criteria 3.6.3 Application to PLS estimators

3.7 Residual Variance and Refitted Cross-validation

3.7.1 Residual variance of Lasso 3.7.2 Refitted cross-validation

3.8 Extensions to Nonparametric Modeling

3.8.1 Structured nonparametric models 3.8.2 Group penalty

3.9 Applications 3.10 Bibliographical Notes 3.11 Exercises

4 Penalized Least Squares: Properties

4.1 Performance Benchmarks

4.1.1 Performance measures 4.1.2 Impact of model uncertainty

4.1.2.1 Bayes lower bounds for orthogonal design 4.1.2.2 Minimax lower bounds for general design

4.1.3 Performance goals, sparsity and sub-Gaussian noise

4.2 Penalized L0 Selection 4.3 Lasso and Dantzig Selector

4.3.1 Selection consistency 4.3.2 Prediction and coefficient estimation errors 4.3.3 Model size and least squares after selection 4.3.4 Properties of the Dantzig selector 4.3.5 Regularity conditions on the design matrix

4.4 Properties of Concave PLS

4.4.1 Properties of penalty functions 4.4.2 Local and oracle solutions 4.4.3 Properties of local solutions 4.4.4 Global and approximate global solutions

4.5 Smaller and Sorted Penalties

4.5.1 Sorted concave penalties and their local approximation 4.5.2 Approximate PLS with smaller and sorted penalties 4.5.3 Properties of LLA and LCA

4.6 Bibliographical Notes 4.7 Exercises

5 Generalized Linear Models and Penalized Likelihood

5.1 Generalized Linear Models

5.1.1 Exponential family 5.1.2 Elements of generalized linear models 5.1.3 Maximum likelihood 5.1.4 Computing MLE: Iteratively reweighed least squares 5.1.5 Deviance and analysis of deviance 5.1.6 Residuals

5.2 5 Examples

5.2.1 Bernoulli and binomial models 5.2.2 Models for count responses 5.2.3 Models for nonnegative continuous responses 5.2.4 Normal error models

5.3 Sparest Solution in High Confidence Set

5.3.1 A general setup 5.3.2 Examples 5.3.3 Properties

5.4 Variable Selection via Penalized Likelihood 5.5 Algorithms

5.5.1 Local quadratic approximation 5.5.2 Local linear approximation 5.5.3 Coordinate descent 5.5.4 Iterative local adaptive majorization and minimization

5.6 Tuning Parameter Selection 5.7 An Application 5.8 Sampling Properties in Low-dimension

5.8.1 Notation and regularity conditions 5.8.2 The oracle property 5.8.3 Sampling properties with diverging dimensions 5.8.4 Asymptotic properties of GIC selectors

5.9 Properties under Ultrahigh Dimensions

5.9.1 The Lasso penalized estimator and its risk property 5.9.2 Strong oracle property 5.9.3 Numeric studies

5.10 Risk Properties 5.11 Bibliographical Notes 5.12 Exercises

6 Penalized M-estimators

6.1 Penalized Quantile Regression

6.1.1 Quantile regression 6.1.2 Variable selection in quantile regression 6.1.3 A fast algorithm for penalized quantile regression

6.2 Penalized Composite Quantile Regression 6.3 Variable Selection in Robust Regression

6.3.1 Robust regression 6.3.2 Variable selection in Huber regression

6.4 Rank Regression and Its Variable Selection

6.4.1 Rank regression 6.4.2 Penalized weighted rank regression

6.5 Variable Selection for Survival Data

6.5.1 Partial likelihood 6.5.2 Variable selection via penalized partial likelihood and its properties

6.6 Theory of Folded-concave Penalized M-estimator

6.6.1 Conditions on penalty and restricted strong convexity 6.6.2 Statistical accuracy of penalized M-estimator with folded concave penalties 6.6.3 Computational accuracy

6.7 Bibliographical Notes 6.8 Exercises

7 High Dimensional Inference

7.1 Inference in Linear Regression

7.1.1 Debias of regularized regression estimators 7.1.2 Choices of weights 7.1.3 Inference for the noise level

7.2 Inference in Generalized Linear Models

7.2.1 Desparsified Lasso 7.2.2 Decorrelated score estimator 7.2.3 Test of linear hypotheses 7.2.4 Numerical comparison 7.2.5 An application

7.3 Asymptotic Efficiency*

7.3.1 Statistical efficiency and Fisher information 7.3.2 Linear regression with random design 7.3.3 Partial linear regression

7.4 Gaussian Graphical Models

7.4.1 Inference via penalized least squares 7.4.2 Sample size in regression and graphical models

7.5 General Solutions*

7.5.1 Local semi-LD decomposition 7.5.2 Data swap 7.5.3 Gradient approximation

7.6 Bibliographical Notes 7.7 Exercises

8 Feature Screening

8.1 Correlation Screening

8.1.1 Sure screening property 8.1.2 Connection to multiple comparison 8.1.3 Iterative SIS

8.2 Generalized and Rank Correlation Screening 8.3 Feature Screening for Parametric Models

8.3.1 Generalized linear models 8.3.2 A unified strategy for parametric feature screening 8.3.3 Conditional sure independence screening

8.4 Nonparametric Screening

8.4.1 Additive models 8.4.2 Varying coefficient models 8.4.3 Heterogeneous nonparametric models

8.5 Model-free Feature Screening

8.5.1 Sure independent ranking screening procedure 8.5.2 Feature screening via distance correlation 8.5.3 Feature screening for high-dimensional categorial data

8.6 Screening and Selection

8.6.1 Feature screening via forward regression 8.6.2 Sparse maximum likelihood estimate 8.6.3 Feature screening via partial correlation

8.7 Refitted Cross-Validation

8.7.1 RCV algorithm 8.7.2 RCV in linear models 8.7.3 RCV in nonparametric regression

8.8 An Illustration 8.9 Bibliographical Notes 8.10 Exercises

9 Covariance Regularization and Graphical Models

9.1 Basic Facts about Matrices 9.2 Sparse Covariance Matrix Estimation

9.2.1 Covariance regularization by thresholding and banding 9.2.2 Asymptotic properties 9.2.3 Nearest positive definite matrices

9.3 Robust Covariance Inputs 9.4 Sparse Precision Matrix and Graphical Models

9.4.1 Gaussian graphical models 9.4.2 Penalized likelihood and M-estimation 9.4.3 Penalized least-squares 9.4.4 CLIME and its adaptive version

9.5 Latent Gaussian Graphical Models 9.6 Technical Proofs

9.6.1 Proof of Theorem 9.1 9.6.2 Proof of Theorem 9.3 9.6.3 Proof of Theorem 9.4 9.6.4 Proof of Theorem 9.6

9.7 Bibliographical Notes 9.8 Exercises

10 Covariance Learning and Factor Models

10.1 Principal Component Analysis

10.1.1 Introduction to PCA 10.1.2 Power method

10.2 Factor Models and Structured Covariance Learning

10.2.1 Factor model and high-dimensional PCA 10.2.2 Extracting latent factors and POET 10.2.3 Methods for selecting number of factors

10.3 Covariance and Precision Learning with Known Factors

10.3.1 Factor model with observable factors 10.3.2 Robust initial estimation of covariance matrix

10.4 Augmented Factor Models and Projected PCA 10.5 Asymptotic Properties

10.5.1 Properties for estimating loading matrix 10.5.2 Properties for estimating covariance matrices 10.5.3 Properties for estimating realized latent factors 10.5.4 Properties for estimating idiosyncratic components

10.6 Technical Proofs

10.6.1 Proof of Theorem 10.1 10.6.2 Proof of Theorem 10.2 10.6.3 Proof of Theorem 10.3 10.6.4 Proof of Theorem 10.4

10.7 Bibliographical Notes 10.8 Exercises

11 Applications of Factor Models and PCA

11.1 Factor-adjusted Regularized Model Selection

11.1.1 Importance of factor adjustments 11.1.2 FarmSelect 11.1.3 Application to forecasting bond risk premia 11.1.4 Application to a neuroblastoma data 11.1.5 Asymptotic theory for FarmSelect

11.2 Factor-adjusted Robust Multiple Testing

11.2.1 False discovery rate control 11.2.2 Multiple testing under dependence measurements 11.2.3 Power of factor adjustments 11.2.4 FarmTest 11.2.5 Application to neuroblastoma data

11.3 Factor Augmented Regression Methods

11.3.1 Principal component regression 11.3.2 Augmented principal component regression 11.3.3 Application to forecast bond risk premia

11.4 Applications to Statistical Machine Learning

11.4.1 Community detection 11.4.2 Topic model 11.4.3 Matrix completion 11.4.4 Item ranking 11.4.5 Gaussian mixture models

11.5 Bibliographical Notes 11.6 1 Exercises

12 Supervised Learning

12.1 Model-based Classifiers

12.1.1 Linear and quadratic discriminant analysis 12.1.2 Logistic regression

12.2 Kernel Density Classifiers and Naive Bayes 12.3 Nearest Neighbor Classifiers 12.4 Classification Trees and Ensemble Classifiers

12.4.1 Classification trees 12.4.2 Bagging 12.4.3 Random forests 12.4.4 Boosting

12.5 Support Vector Machines

12.5.1 The standard support vector machine 12.5.2 Generalizations of SVMs

12.6 Sparse Classifiers via Penalized Empirical Loss

12.6.1 The importance of sparsity under high-dimensionality 12.6.2 Sparse support vector machines 12.6.3 Sparse large margin classifiers

12.7 Sparse Discriminant Analysis

12.7.1 Nearest shrunken centroids classifier 12.7.2 Features annealed independent rule 12.7.3 Selection bias of sparse independence rules 12.7.4 Regularized optimal affine discriminant 12.7.5 Linear programming discriminant 12.7.6 Direct sparse discriminant analysis 12.7.7 Solution path equivalence between ROAD and DSDA

12.8 Feature Augmention and Sparse Additive Classifiers

12.8.1 Feature augmentation 12.8.2 Penalized additive logistic regression 12.8.3 Semiparametric sparse discriminant analysis

12.9 Bibliographical Notes 12.10 12.10 Exercises

13 Unsupervised Learning

13.1 Cluster Analysis

13.1.1 K-means clustering 13.1.2 Hierarchical clustering 13.1.3 Model-based clustering 13.1.4 Spectral clustering

13.2 Data-driven Choices of the Number of Clusters 13.3 Variable Selection in Clustering

13.3.1 Sparse clustering 13.3.2 Sparse model-based clustering 13.3.3 Sparse mixture of experts model

13.4 An Introduction to High Dimensional PCA

13.4.1 Inconsistency of the regular PCA 13.4.2 Consistency under sparse eigenvector model

13.5 Sparse Principal Component Analysis

13.5.1 Sparse PCA 13.5.2 An iterative SVD thresholding approach 13.5.3 A penalized matrix decomposition approach 13.5.4 A semidefinite programming approach 13.5.5 A generalized power method

13.6 Bibliographical Notes 13.7 Exercises

14 An Introduction to Deep Learning

14.1 Rise of Deep Learning 14.2 Feed-forward Neural Networks

14.2.1 Model setup 14.2.2 Back-propagation in computational graphs

14.3 Popular Models

14.3.1 Convolutional neural networks 14.3.2 Recurrent neural networks

14.3.2.1 Vanilla RNNs 14.3.2.2 GRUs and LSTM 14.3.2.3 Multilayer RNNs

14.3.3 Modules

14.4 Deep Unsupervised Learning

14.4.1 Autoencoders 14.4.2 Generative adversarial networks

14.4.2.1 Sampling view of GANs 14.4.2.2 Minimum distance view of GANs

14.5 Training deep neural nets

14.5.1 Stochastic gradient descent

14.5.1.1 Mini-batch SGD 14.5.1.2 Momentum-based SGD 14.5.1.3 SGD with adaptive learning rates

14.5.2 Easing numerical instability

14.5.2.1 ReLU activation function 14.5.2.2 Skip connections 14.5.2.3 Batch normalization

14.5.3 Regularization techniques

14.5.3.1 Weight decay 14.5.3.2 Dropout 14.5.3.3 Data augmentation

14.6 Example: Image Classification 14.7 Additional Examples using FensorFlow and R 14.8 Bibliography Notes

References Author Index Index

← Prev
Back
Next →

← Prev
Back
Next →