Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Title Page
Copyright
Statistics for Data Science
Credits
About the Author
About the Reviewer
www.PacktPub.com
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Transitioning from Data Developer to Data Scientist
Data developer thinking
Objectives of a data developer
Querying or mining
Data quality or data cleansing
Data modeling
Issue or insights
Thought process
Developer versus scientist
New data, new source
Quality questions
Querying and mining
Performance
Financial reporting
Visualizing
Tools of the trade
Advantages of thinking like a data scientist
Developing a better approach to understanding data
Using statistical thinking during program or database designing
Adding to your personal toolbox
Increased marketability
Perpetual learning
Seeing the future
Transitioning to a data scientist
Let's move ahead
Summary
Declaring the Objectives
Key objectives of data science
Collecting data
Processing data
Exploring and visualizing data
Analyzing the data and/or applying machine learning to the data
Deciding (or planning) based upon acquired insight
Thinking like a data scientist
Bringing statistics into data science
Common terminology
Statistical population
Probability
False positives
Statistical inference
Regression
Fitting
Categorical data
Classification
Clustering
Statistical comparison
Coding
Distributions
Data mining
Decision trees
Machine learning
Munging and wrangling
Visualization
D3
Regularization
Assessment
Cross-validation
Neural networks
Boosting
Lift
Mode
Outlier
Predictive modeling
Big Data
Confidence interval
Writing
Summary
A Developer's Approach to Data Cleaning
Understanding basic data cleaning
Common data issues
Contextual data issues
Cleaning techniques
R and common data issues
Outliers
Step 1 – Profiling the data
Step 2 – Addressing the outliers
Domain expertise
Validity checking
Enhancing data
Harmonization
Standardization
Transformations
Deductive correction
Deterministic imputation
Summary
Data Mining and the Database Developer
Data mining
Common techniques
Visualization
Cluster analysis
Correlation analysis
Discriminant analysis
Factor analysis
Regression analysis
Logistic analysis
Purpose
Mining versus querying
Choosing R for data mining
Visualizations
Current smokers
Missing values
A cluster analysis
Dimensional reduction
Calculating statistical significance
Frequent patterning
Frequent item-setting
Sequence mining
Summary
Statistical Analysis for the Database Developer
Data analysis
Looking closer
Statistical analysis
Summarization
Comparing groups
Samples
Group comparison conclusions
Summarization modeling
Establishing the nature of data
Successful statistical analysis
R and statistical analysis
Summary
Database Progression to Database Regression
Introducing statistical regression
Techniques and approaches for regression
Choosing your technique
Does it fit?
Identifying opportunities for statistical regression
Summarizing data
Exploring relationships
Testing significance of differences
Project profitability
R and statistical regression
A working example
Establishing the data profile
The graphical analysis
Predicting with our linear model
Step 1: Chunking the data
Step 2: Creating the model on the training data
Step 3: Predicting the projected profit on test data
Step 4: Reviewing the model
Step 4: Accuracy and error
Summary
Regularization for Database Improvement
Statistical regularization
Various statistical regularization methods
Ridge
Lasso
Least angles
Opportunities for regularization
Collinearity
Sparse solutions
High-dimensional data
Classification
Using data to understand statistical regularization
Improving data or a data model
Simplification
Relevance
Speed
Transformation
Variation of coefficients
Casual inference
Back to regularization
Reliability
Using R for statistical regularization
Parameter Setup
Summary
Database Development and Assessment
Assessment and statistical assessment
Objectives
Baselines
Planning for assessment
Evaluation
Development versus assessment
Planning
Data assessment and data quality assurance
Categorizing quality
Relevance
Cross-validation
Preparing data
R and statistical assessment
Questions to ask
Learning curves
Example of a learning curve
Summary
Databases and Neural Networks
Ask any data scientist
Defining neural network
Nodes
Layers
Training
Solution
Understanding the concepts
Neural network models and database models
No single or main node
Not serial
No memory address to store results
R-based neural networks
References
Data prep and preprocessing
Data splitting
Model parameters
Cross-validation
R packages for ANN development
ANN
ANN2
NNET
Black boxes
A use case
Popular use cases
Character recognition
Image compression
Stock market prediction
Fraud detection
Neuroscience
Summary
Boosting your Database
Definition and purpose
Bias
Categorizing bias
Causes of bias
Bias data collection
Bias sample selection
Variance
ANOVA
Noise
Noisy data
Weak and strong learners
Weak to strong
Model bias
Training and prediction time
Complexity
Which way?
Back to boosting
How it started
AdaBoost
What you can learn from boosting (to help) your database
Using R to illustrate boosting methods
Prepping the data
Training
Ready for boosting
Example results
Summary
Database Classification using Support Vector Machines
Database classification
Data classification in statistics
Guidelines for classifying data
Common guidelines
Definitions
Definition and purpose of an SVM
The trick
Feature space and cheap computations
Drawing the line
More than classification
Downside
Reference resources
Predicting credit scores
Using R and an SVM to classify data in a database
Moving on
Summary
Database Structures and Machine Learning
Data structures and data models
Data structures
Data models
What's the difference?
Relationships
Machine learning
Overview of machine learning concepts
Key elements of machine learning
Representation
Evaluation
Optimization
Types of machine learning
Supervised learning
Unsupervised learning
Semi-supervised learning
Reinforcement learning
Most popular
Applications of machine learning
Machine learning in practice
Understanding
Preparation
Learning
Interpretation
Deployment
Iteration
Using R to apply machine learning techniques to a database
Understanding the data
Preparing
Data developer
Understanding the challenge
Cross-tabbing and plotting
Summary
← Prev
Back
Next →
← Prev
Back
Next →