R · Unleash Machine Learning Techniques by Lesmeister, Cory -- Read -- Imperial Library of Trantor

Index

R: Unleash Machine Learning Techniques

Table of Contents R: Unleash Machine Learning Techniques R: Unleash Machine Learning Techniques Credits Preface

What this learning path covers What you need for this learning path Who this learning path is for Reader feedback Customer support

Downloading the example code Errata Piracy Questions

I. Module 1

1. Getting Started with R and Machine Learning

Delving into the basics of R

Using R as a scientific calculator Operating on vectors Special values

Data structures in R

Vectors

Creating vectors Indexing and naming vectors

Arrays and matrices

Creating arrays and matrices Names and dimensions Matrix operations

Lists

Creating and indexing lists Combining and converting lists

Data frames

Creating data frames Operating on data frames

Working with functions

Built-in functions User-defined functions Passing functions as arguments

Controlling code flow

Working with if, if-else, and ifelse Working with switch Loops

Advanced constructs

lapply and sapply apply tapply mapply

Next steps with R

Getting help Handling packages

Machine learning basics

Machine learning – what does it really mean? Machine learning – how is it used in the world? Types of machine learning algorithms

Supervised machine learning algorithms Unsupervised machine learning algorithms Popular machine learning packages in R

Summary

2. Let's Help Machines Learn

Understanding machine learning Algorithms in machine learning

Perceptron

Families of algorithms

Supervised learning algorithms

Linear regression K-Nearest Neighbors (KNN)

Collecting and exploring data Normalizing data Creating training and test data sets Learning from data/training the model Evaluating the model

Unsupervised learning algorithms

Apriori algorithm K-Means

Summary

3. Predicting Customer Shopping Trends with Market Basket Analysis

Detecting and predicting trends Market basket analysis

What does market basket analysis actually mean? Core concepts and definitions Techniques used for analysis Making data driven decisions

Evaluating a product contingency matrix

Getting the data Analyzing and visualizing the data Global recommendations Advanced contingency matrices

Frequent itemset generation

Getting started Data retrieval and transformation Building an itemset association matrix Creating a frequent itemsets generation workflow Detecting shopping trends

Association rule mining

Loading dependencies and data Exploratory analysis Detecting and predicting shopping trends Visualizing association rules

Summary

4. Building a Product Recommendation System

Understanding recommendation systems Issues with recommendation systems Collaborative filters

Core concepts and definitions The collaborative filtering algorithm

Predictions Recommendations Similarity

Building a recommender engine

Matrix factorization Implementation Result interpretation

Production ready recommender engines

Extract, transform, and analyze Model preparation and prediction Model evaluation

Summary

5. Credit Risk Detection and Prediction – Descriptive Analytics

Types of analytics Our next challenge What is credit risk? Getting the data Data preprocessing

Dealing with missing values Datatype conversions

Data analysis and transformation

Building analysis utilities Analyzing the dataset Saving the transformed dataset

Next steps

Feature sets Machine learning algorithms

Summary

6. Credit Risk Detection and Prediction – Predictive Analytics

Predictive analytics How to predict credit risk Important concepts in predictive modeling

Preparing the data Building predictive models Evaluating predictive models

Getting the data Data preprocessing Feature selection Modeling using logistic regression Modeling using support vector machines Modeling using decision trees Modeling using random forests Modeling using neural networks Model comparison and selection Summary

7. Social Media Analysis – Analyzing Twitter Data

Social networks (Twitter) Data mining @social networks

Mining social network data Data and visualization

Word clouds Treemaps Pixel-oriented maps Other visualizations

Getting started with Twitter APIs

Overview Registering the application Connect/authenticate Extracting sample tweets

Twitter data mining

Frequent words and associations Popular devices Hierarchical clustering Topic modeling

Challenges with social network data mining References Summary

8. Sentiment Analysis of Twitter Data

Understanding Sentiment Analysis

Key concepts of sentiment analysis

Subjectivity Sentiment polarity Opinion summarization Feature extraction

Approaches Applications Challenges

Sentiment analysis upon Tweets

Polarity analysis Classification-based algorithms

Labeled dataset Support Vector Machines Ensemble methods

Boosting Cross-validation

Summary

II. Module 2

1. Introducing Machine Learning

The origins of machine learning Uses and abuses of machine learning

Machine learning successes The limits of machine learning Machine learning ethics

How machines learn

Data storage Abstraction Generalization Evaluation

Machine learning in practice

Types of input data Types of machine learning algorithms Matching input data to algorithms

Machine learning with R

Installing R packages Loading and unloading R packages

Summary

2. Managing and Understanding Data

R data structures

Vectors Factors Lists Data frames Matrixes and arrays

Managing data with R

Saving, loading, and removing R data structures Importing and saving data from CSV files

Exploring and understanding data

Exploring the structure of data Exploring numeric variables

Measuring the central tendency – mean and median Measuring spread – quartiles and the five-number summary Visualizing numeric variables – boxplots Visualizing numeric variables – histograms Understanding numeric data – uniform and normal distributions Measuring spread – variance and standard deviation

Exploring categorical variables

Measuring the central tendency – the mode

Exploring relationships between variables

Visualizing relationships – scatterplots Examining relationships – two-way cross-tabulations

Summary

3. Lazy Learning – Classification Using Nearest Neighbors

Understanding nearest neighbor classification

The k-NN algorithm

Measuring similarity with distance Choosing an appropriate k Preparing data for use with k-NN

Why is the k-NN algorithm lazy?

Example – diagnosing breast cancer with the k-NN algorithm

Step 1 – collecting data Step 2 – exploring and preparing the data

Transformation – normalizing numeric data Data preparation – creating training and test datasets

Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance

Transformation – z-score standardization Testing alternative values of k

Summary

4. Probabilistic Learning – Classification Using Naive Bayes

Understanding Naive Bayes

Basic concepts of Bayesian methods

Understanding probability Understanding joint probability Computing conditional probability with Bayes' theorem

The Naive Bayes algorithm

Classification with Naive Bayes The Laplace estimator Using numeric features with Naive Bayes

Example – filtering mobile phone spam with the Naive Bayes algorithm

Step 1 – collecting data Step 2 – exploring and preparing the data

Data preparation – cleaning and standardizing text data Data preparation – splitting text documents into words Data preparation – creating training and test datasets Visualizing text data – word clouds Data preparation – creating indicator features for frequent words

Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance

Summary

5. Divide and Conquer – Classification Using Decision Trees and Rules

Understanding decision trees

Divide and conquer The C5.0 decision tree algorithm

Choosing the best split Pruning the decision tree

Example – identifying risky bank loans using C5.0 decision trees

Step 1 – collecting data Step 2 – exploring and preparing the data

Data preparation – creating random training and test datasets

Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance

Boosting the accuracy of decision trees Making mistakes more costlier than others

Understanding classification rules

Separate and conquer The 1R algorithm The RIPPER algorithm Rules from decision trees What makes trees and rules greedy?

Example – identifying poisonous mushrooms with rule learners

Step 1 – collecting data Step 2 – exploring and preparing the data Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance

Summary

6. Forecasting Numeric Data – Regression Methods

Understanding regression

Simple linear regression Ordinary least squares estimation Correlations Multiple linear regression

Example – predicting medical expenses using linear regression

Step 1 – collecting data Step 2 – exploring and preparing the data

Exploring relationships among features – the correlation matrix Visualizing relationships among features – the scatterplot matrix

Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance

Model specification – adding non-linear relationships Transformation – converting a numeric variable to a binary indicator Model specification – adding interaction effects Putting it all together – an improved regression model

Understanding regression trees and model trees

Adding regression to trees

Example – estimating the quality of wines with regression trees and model trees

Step 1 – collecting data Step 2 – exploring and preparing the data Step 3 – training a model on the data

Visualizing decision trees

Step 4 – evaluating model performance

Measuring performance with the mean absolute error

Step 5 – improving model performance

Summary

7. Black Box Methods – Neural Networks and Support Vector Machines

Understanding neural networks

From biological to artificial neurons Activation functions Network topology

The number of layers The direction of information travel The number of nodes in each layer

Training neural networks with backpropagation

Example – Modeling the strength of concrete with ANNs

Step 1 – collecting data Step 2 – exploring and preparing the data Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance

Understanding Support Vector Machines

Classification with hyperplanes

The case of linearly separable data The case of nonlinearly separable data

Using kernels for non-linear spaces

Example – performing OCR with SVMs

Step 1 – collecting data Step 2 – exploring and preparing the data Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance

Summary

8. Finding Patterns – Market Basket Analysis Using Association Rules

Understanding association rules

The Apriori algorithm for association rule learning Measuring rule interest – support and confidence Building a set of rules with the Apriori principle

Example – identifying frequently purchased groceries with association rules

Step 1 – collecting data Step 2 – exploring and preparing the data

Data preparation – creating a sparse matrix for transaction data Visualizing item support – item frequency plots Visualizing the transaction data – plotting the sparse matrix

Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance

Sorting the set of association rules Taking subsets of association rules Saving association rules to a file or data frame

Summary

9. Finding Groups of Data – Clustering with k-means

Understanding clustering

Clustering as a machine learning task The k-means clustering algorithm

Using distance to assign and update clusters Choosing the appropriate number of clusters

Example – finding teen market segments using k-means clustering

Step 1 – collecting data Step 2 – exploring and preparing the data

Data preparation – dummy coding missing values Data preparation – imputing the missing values

Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance

Summary

10. Evaluating Model Performance

Measuring performance for classification

Working with classification prediction data in R A closer look at confusion matrices Using confusion matrices to measure performance Beyond accuracy – other measures of performance

The kappa statistic Sensitivity and specificity Precision and recall The F-measure

Visualizing performance trade-offs

ROC curves

Estimating future performance

The holdout method

Cross-validation Bootstrap sampling

Summary

11. Improving Model Performance

Tuning stock models for better performance

Using caret for automated parameter tuning

Creating a simple tuned model Customizing the tuning process

Improving model performance with meta-learning

Understanding ensembles Bagging Boosting Random forests

Training random forests Evaluating random forest performance

Summary

12. Specialized Machine Learning Topics

Working with proprietary files and databases

Reading from and writing to Microsoft Excel, SAS, SPSS, and Stata files Querying data in SQL databases

Working with online data and services

Downloading the complete text of web pages Scraping data from web pages

Parsing XML documents Parsing JSON from web APIs

Working with domain-specific data

Analyzing bioinformatics data Analyzing and visualizing network data

Improving the performance of R

Managing very large datasets

Generalizing tabular data structures with dplyr Making data frames faster with data.table Creating disk-based data frames with ff Using massive matrices with bigmemory

Learning faster with parallel computing

Measuring execution time Working in parallel with multicore and snow Taking advantage of parallel with foreach and doParallel Parallel cloud computing with MapReduce and Hadoop

GPU computing Deploying optimized learning algorithms

Building bigger regression models with biglm Growing bigger and faster random forests with bigrf Training and evaluating models in parallel with caret

Summary

III. Module 3

1. A Process for Success

The process Business understanding

Identify the business objective Assess the situation Determine the analytical goals Produce a project plan

Data understanding Data preparation Modeling Evaluation Deployment Algorithm flowchart Summary

2. Linear Regression – The Blocking and Tackling of Machine Learning

Univariate linear regression

Business understanding

Multivariate linear regression

Business understanding Data understanding and preparation Modeling and evaluation

Other linear model considerations

Qualitative feature Interaction term

Summary

3. Logistic Regression and Discriminant Analysis

Classification methods and linear regression Logistic regression

Business understanding Data understanding and preparation Modeling and evaluation

The logistic regression model Logistic regression with cross-validation

Discriminant analysis overview Discriminant analysis application

Model selection Summary

4. Advanced Feature Selection in Linear Models

Regularization in a nutshell

Ridge regression LASSO Elastic net

Business case

Business understanding Data understanding and preparation

Modeling and evaluation

Best subsets Ridge regression LASSO Elastic net Cross-validation with glmnet

Model selection Summary

5. More Classification Techniques – K-Nearest Neighbors and Support Vector Machines

K-Nearest Neighbors Support Vector Machines Business case

Business understanding Data understanding and preparation Modeling and evaluation

KNN modeling SVM modeling

Model selection

Feature selection for SVMs Summary

6. Classification and Regression Trees

Introduction An overview of the techniques

Regression trees Classification trees Random forest Gradient boosting

Business case

Modeling and evaluation

Regression tree Classification tree Random forest regression Random forest classification Gradient boosting regression Gradient boosting classification

Model selection

Summary

7. Neural Networks

Neural network Deep learning, a not-so-deep overview Business understanding Data understanding and preparation Modeling and evaluation An example of deep learning

H2O background Data preparation and uploading it to H2O Create train and test datasets Modeling

Summary

8. Cluster Analysis

Hierarchical clustering

Distance calculations

K-means clustering Gower and partitioning around medoids

Gower PAM Business understanding

Data understanding and preparation Modeling and evaluation

Hierarchical clustering K-means clustering Clustering with mixed data

Summary

9. Principal Components Analysis

An overview of the principal components

Rotation Business understanding Data understanding and preparation

Modeling and evaluation

Component extraction Orthogonal rotation and interpretation Creating factor scores from the components Regression analysis

Summary

10. Market Basket Analysis and Recommendation Engines

An overview of a market basket analysis Business understanding Data understanding and preparation Modeling and evaluation An overview of a recommendation engine

User-based collaborative filtering Item-based collaborative filtering Singular value decomposition and principal components analysis

Business understanding and recommendations Data understanding, preparation, and recommendations Modeling, evaluation, and recommendations Summary

11. Time Series and Causality

Univariate time series analysis

Bivariate regression Granger causality Business understanding Data understanding and preparation

Modeling and evaluation

Univariate time series forecasting Time series regression Examining the causality

Summary

12. Text Mining

Text mining framework and methods Topic models

Other quantitative analyses Business understanding Data understanding and preparation

Modeling and evaluation

Word frequency and topic models Additional quantitative analysis

Summary

A. R Fundamentals

Introduction Getting R up and running Using R Data frames and matrices Summary stats Installing and loading the R packages Summary

A. Bibliography Index

← Prev
Back
Next →

← Prev
Back
Next →