Statistics for Data Science by Miller, James D. -- Read -- Imperial Library of Trantor

Index

Title Page Copyright

Statistics for Data Science

Credits About the Author About the Reviewer www.PacktPub.com

Why subscribe?

Customer Feedback Preface

What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support

Downloading the example code Downloading the color images of this book Errata Piracy Questions

Transitioning from Data Developer to Data Scientist

Data developer thinking Objectives of a data developer

Querying or mining

Data quality or data cleansing Data modeling Issue or insights Thought process

Developer versus scientist

New data, new source Quality questions Querying and mining Performance Financial reporting Visualizing Tools of the trade

Advantages of thinking like a data scientist

Developing a better approach to understanding data Using statistical thinking during program or database designing Adding to your personal toolbox Increased marketability Perpetual learning Seeing the future

Transitioning to a data scientist

Let's move ahead

Summary

Declaring the Objectives

Key objectives of data science

Collecting data Processing data Exploring and visualizing data Analyzing the data and/or applying machine learning to the data Deciding (or planning) based upon acquired insight

Thinking like a data scientist Bringing statistics into data science Common terminology

Statistical population Probability False positives Statistical inference Regression Fitting Categorical data Classification Clustering Statistical comparison Coding Distributions Data mining Decision trees Machine learning Munging and wrangling Visualization D3 Regularization Assessment Cross-validation Neural networks Boosting Lift Mode Outlier Predictive modeling Big Data Confidence interval Writing

Summary

A Developer's Approach to Data Cleaning

Understanding basic data cleaning

Common data issues Contextual data issues Cleaning techniques

R and common data issues

Outliers

Step 1 – Profiling the data Step 2 – Addressing the outliers

Domain expertise Validity checking Enhancing data Harmonization Standardization

Transformations Deductive correction Deterministic imputation Summary

Data Mining and the Database Developer

Data mining

Common techniques Visualization

Cluster analysis Correlation analysis Discriminant analysis Factor analysis Regression analysis Logistic analysis Purpose

Mining versus querying

Choosing R for data mining Visualizations

Current smokers

Missing values A cluster analysis

Dimensional reduction

Calculating statistical significance

Frequent patterning

Frequent item-setting

Sequence mining Summary

Statistical Analysis for the Database Developer

Data analysis

Looking closer

Statistical analysis Summarization

Comparing groups

Samples Group comparison conclusions

Summarization modeling

Establishing the nature of data Successful statistical analysis R and statistical analysis Summary

Database Progression to Database Regression

Introducing statistical regression

Techniques and approaches for regression

Choosing your technique Does it fit?

Identifying opportunities for statistical regression

Summarizing data Exploring relationships Testing significance of differences

Project profitability R and statistical regression A working example

Establishing the data profile

The graphical analysis

Predicting with our linear model

Step 1: Chunking the data Step 2: Creating the model on the training data Step 3: Predicting the projected profit on test data Step 4: Reviewing the model Step 4: Accuracy and error

Summary

Regularization for Database Improvement

Statistical regularization

Various statistical regularization methods Ridge Lasso Least angles Opportunities for regularization

Collinearity Sparse solutions High-dimensional data Classification

Using data to understand statistical regularization Improving data or a data model

Simplification Relevance Speed Transformation Variation of coefficients Casual inference Back to regularization Reliability

Using R for statistical regularization

Parameter Setup

Summary

Database Development and Assessment

Assessment and statistical assessment

Objectives Baselines Planning for assessment Evaluation

Development versus assessment

Planning

Data assessment and data quality assurance

Categorizing quality Relevance Cross-validation

Preparing data

R and statistical assessment

Questions to ask Learning curves

Example of a learning curve

Summary

Databases and Neural Networks

Ask any data scientist

Defining neural network

Nodes Layers Training Solution Understanding the concepts

Neural network models and database models

No single or main node Not serial No memory address to store results

R-based neural networks

References Data prep and preprocessing Data splitting Model parameters Cross-validation R packages for ANN development ANN ANN2 NNET Black boxes

A use case

Popular use cases Character recognition Image compression Stock market prediction Fraud detection Neuroscience

Summary

Boosting your Database

Definition and purpose

Bias

Categorizing bias Causes of bias Bias data collection Bias sample selection

Variance

ANOVA

Noise

Noisy data

Weak and strong learners

Weak to strong Model bias Training and prediction time Complexity Which way?

Back to boosting

How it started AdaBoost

What you can learn from boosting (to help) your database

Using R to illustrate boosting methods

Prepping the data Training Ready for boosting

Example results

Summary

Database Classification using Support Vector Machines

Database classification

Data classification in statistics Guidelines for classifying data

Common guidelines

Definitions

Definition and purpose of an SVM

The trick Feature space and cheap computations Drawing the line More than classification Downside Reference resources Predicting credit scores

Using R and an SVM to classify data in a database

Moving on

Summary

Database Structures and Machine Learning

Data structures and data models

Data structures Data models

What's the difference? Relationships

Machine learning

Overview of machine learning concepts Key elements of machine learning

Representation Evaluation Optimization

Types of machine learning

Supervised learning Unsupervised learning Semi-supervised learning Reinforcement learning