R · Data Analysis and Visualization by Fischetti, Tony -- Read -- Imperial Library of Trantor

Index

R: Data Analysis and Visualization

Table of Contents R: Data Analysis and Visualization

Meet Your Course Guide Course Structure Course journey The Course Roadmap and Timeline

I. Module 1: Data Analysis with R

1. RefresheR

Navigating the basics

Arithmetic and assignment Logicals and characters Flow of control

Getting help in R Vectors

Subsetting Vectorized functions Advanced subsetting Recycling

Functions Matrices Loading data into R Working with packages

2. The Shape of Data

Univariate data Frequency distributions Central tendency Spread Populations, samples, and estimation Probability distributions Visualization methods

3. Describing Relationships

Multivariate data Relationships between a categorical and a continuous variable Relationships between two categorical variables The relationship between two continuous variables

Covariance Correlation coefficients Comparing multiple correlations

Visualization methods

Categorical and continuous variables Two categorical variables Two continuous variables More than two continuous variables

4. Probability

Basic probability A tale of two interpretations Sampling from distributions

Parameters The binomial distribution

The normal distribution

The three-sigma rule and using z-tables

5. Using Data to Reason About the World

Estimating means The sampling distribution Interval estimation

How did we get 1.96?

Smaller samples

6. Testing Hypotheses

Null Hypothesis Significance Testing

One and two-tailed tests When things go wrong A warning about significance A warning about p-values

Testing the mean of one sample

Assumptions of the one sample t-test

Testing two means

Don't be fooled! Assumptions of the independent samples t-test

Testing more than two means

Assumptions of ANOVA

Testing independence of proportions What if my assumptions are unfounded?

7. Bayesian Methods

The big idea behind Bayesian analysis Choosing a prior Who cares about coin flips Enter MCMC – stage left Using JAGS and runjags Fitting distributions the Bayesian way The Bayesian independent samples t-test

8. Predicting Continuous Variables

Linear models Simple linear regression Simple linear regression with a binary predictor

A word of warning

Multiple regression Regression with a non-binary predictor Kitchen sink regression The bias-variance trade-off

Cross-validation Striking a balance

Linear regression diagnostics

Second Anscombe relationship Third Anscombe relationship Fourth Anscombe relationship

Advanced topics

9. Predicting Categorical Variables

k-Nearest Neighbors

Using k-NN in R

Confusion matrices Limitations of k-NN

Logistic regression

Using logistic regression in R

Decision trees Random forests Choosing a classifier

The vertical decision boundary The diagonal decision boundary The crescent decision boundary The circular decision boundary

10. Sources of Data

Relational Databases

Why didn't we just do that in SQL?

Using JSON XML Other data formats Online repositories

11. Dealing with Messy Data

Analysis with missing data

Visualizing missing data Types of missing data

So which one is it?

Unsophisticated methods for dealing with missing data

Complete case analysis Pairwise deletion Mean substitution Hot deck imputation Regression imputation Stochastic regression imputation

Multiple imputation

So how does mice come up with the imputed values?

Methods of imputation

Multiple imputation in practice

Analysis with unsanitized data

Checking for out-of-bounds data Checking the data type of a column Checking for unexpected categories Checking for outliers, entry errors, or unlikely data points Chaining assertions

Other messiness

OpenRefine Regular expressions tidyr

12. Dealing with Large Data

Wait to optimize Using a bigger and faster machine Be smart about your code

Allocation of memory Vectorization

Using optimized packages Using another R implementation Use parallelization

Getting started with parallel R An example of (some) substance

Using Rcpp Be smarter about your code

13. Reproducibility and Best Practices

R Scripting

RStudio Running R scripts An example script Scripting and reproducibility

R projects Version control Communicating results

II. Module 2: R Graphs

1. R Graphics

Base graphics using the default package Trellis graphs using lattice Graphs inspired by Grammar of Graphics

2. Basic Graph Functions

Introduction Creating basic scatter plots

Getting ready How to do it... How it works... There's more...

A note on R's built-in datasets

See also

Creating line graphs

Getting ready How to do it... How it works... There's more... See also

Creating bar charts

Getting ready How to do it... How it works... There's more... See also

Creating histograms and density plots

How to do it... How it works... There's more... See also

Creating box plots

Getting ready How to do it... How it works... There's more... See also

Adjusting x and y axes' limits

How to do it... How it works... There's more... See also

Creating heat maps

How to do it... How it works... There's more... See also

Creating pairs plots

How to do it... How it works... There's more... See also

Creating multiple plot matrix layouts

How to do it... How it works... There's more... See also

Adding and formatting legends

Getting ready How to do it... How it works... There's more... See also

Creating graphs with maps

Getting ready How to do it... How it works... There's more... See also

Saving and exporting graphs

How to do it... How it works... There's more... See also

3. Beyond the Basics – Adjusting Key Parameters

Introduction Setting colors of points, lines, and bars

Getting ready How to do it... How it works... There's more... See also

Setting plot background colors

Getting ready How to do it... How it works... There's more...

Setting colors for text elements – axis annotations, labels, plot titles, and legends

Getting ready How to do it... How it works... There's more...

Choosing color combinations and palettes

Getting ready How to do it... How it works... There's more... See also

Setting fonts for annotations and titles

Getting ready How to do it... How it works... There's more... See also

Choosing plotting point symbol styles and sizes

Getting ready How to do it... How it works... There's more... See also

Choosing line styles and width

Getting ready How to do it... How it works... See also

Choosing box styles

Getting ready How to do it... How it works... There's more...

Adjusting axis annotations and tick marks

Getting ready How to do it... How it works... There's more... See also

Formatting log axes

Getting ready How to do it... How it works... There's more...

Setting graph margins and dimensions

Getting ready How to do it... How it works... See also

4. Creating Scatter Plots

Introduction Grouping data points within a scatter plot

Getting ready How to do it... How it works... There's more... See also

Highlighting grouped data points by size and symbol type

Getting ready How to do it... How it works...

Labeling data points

Getting ready How to do it... How it works... There's more...

Correlation matrix using pairs plots

Getting ready How to do it... How it works...

Adding error bars

Getting ready How to do it... How it works... There's more...

Using jitter to distinguish closely packed data points

Getting ready How to do it... How it works...

Adding linear model lines

Getting ready How to do it... How it works...

Adding nonlinear model curves

Getting ready How to do it... How it works...

Adding nonparametric model curves with lowess

Getting ready How to do it... How it works...

Creating three-dimensional scatter plots

Getting ready How to do it... How it works... There's more...

Creating Quantile-Quantile plots

Getting ready How to do it... How it works... There's more...

Displaying the data density on axes

Getting ready How to do it... How it works... There's more...

Creating scatter plots with a smoothed density representation

Getting ready How to do it... How it works... There's more...

5. Creating Line Graphs and Time Series Charts

Introduction Adding customized legends for multiple-line graphs

Getting ready How to do it... How it works... There's more... See also

Using margin labels instead of legends for multiple-line graphs

Getting ready How to do it... How it works... There's more...

Adding horizontal and vertical grid lines

Getting ready How to do it... How it works... There's more... See also

Adding marker lines at specific x and y values using abline

Getting ready How to do it... How it works... There's more...

Creating sparklines

Getting ready How to do it... How it works...

Plotting functions of a variable in a dataset

Getting ready How to do it... How it works... There's more...

Formatting time series data for plotting

Getting ready How to do it... How it works... There's more...

Plotting the date or time variable on the x axis

Getting ready How to do it... How it works... There's more...

Annotating axis labels in different human-readable time formats

Getting ready How to do it... How it works... There's more...

Adding vertical markers to indicate specific time events

Getting ready How to do it... How it works... There's more...

Plotting data with varying time-averaging periods

Getting ready How to do it... How it works...

Creating stock charts

Getting ready How to do it... How it works... There's more...

6. Creating Bar, Dot, and Pie Charts

Introduction Creating bar charts with more than one factor variable

Getting ready How to do it... How it works... See also

Creating stacked bar charts

Getting ready How to do it... How it works... There's more...

Adjusting the orientation of bars – horizontal and vertical

Getting ready How to do it... How it works... There's more...

Adjusting bar widths, spacing, colors, and borders

Getting ready How to do it... How it works... There's more...

Displaying values on top of or next to the bars

Getting ready How to do it... How it works... There's more... See also

Placing labels inside bars

Getting ready How to do it... How it works... There's more...

Creating bar charts with vertical error bars

Getting ready How to do it... How it works... There's more...

Modifying dot charts by grouping variables

Getting ready How to do it... How it works...

Making better, readable pie charts with clockwise-ordered slices

Getting ready How to do it... How it works... See also

Labeling a pie chart with percentage values for each slice

Getting ready How it works... There's more... See also

Adding a legend to a pie chart

Getting ready How to do it... How it works... There's more...

7. Creating Histograms

Introduction Visualizing distributions as count frequencies or probability densities

Getting ready How to do it... How it works... There's more

Setting the bin size and the number of breaks

Getting ready How to do it... How it works... There's more

Adjusting histogram styles – bar colors, borders, and axes

Getting ready How to do it... How it works... There's more

Overlaying a density line over a histogram

Getting ready How to do it... How it works...

Multiple histograms along the diagonal of a pairs plot

Getting ready How to do it... How it works...

Histograms in the margins of line and scatter plots

Getting ready How to do it... How it works...

8. Box and Whisker Plots

Introduction Creating box plots with narrow boxes for a small number of variables

Getting ready How to do it... How it works... There's more See also

Grouping over a variable

Getting ready How to do it... How it works... There's more See also

Varying box widths by the number of observations

Getting ready How to do it... How it works...

Creating box plots with notches

Getting ready How to do it... How it works... There's more

Including or excluding outliers

Getting ready How to do it... How it works... See also

Creating horizontal box plots

Getting ready How to do it... How it works...

Changing the box styling

Getting ready How to do it... How it works... There's more

Adjusting the extent of plot whiskers outside the box

Getting ready How to do it... How it works... There's more

Showing the number of observations

Getting ready How to do it... How it works... There's more

Splitting a variable at arbitrary values into subsets

Getting ready How to do it... How it works... There's more

9. Creating Heat Maps and Contour Plots

Introduction Creating heat maps of a single Z variable with a scale

Getting ready How to do it... How it works... There's more See also

Creating correlation heat maps

Getting ready How to do it... How it works... There's more

Summarizing multivariate data in a single heat map

Getting ready How to do it... How it works... There's more

Creating contour plots

Getting ready How to do it... How it works... There's more See also

Creating filled contour plots

Getting ready How to do it... How it works... There's more See also

Creating three-dimensional surface plots

Getting ready How to do it... How it works... There's more

Visualizing time series as calendar heat maps

Getting ready How to do it... How it works... There's more

10. Creating Maps

Introduction Plotting global data by countries on a world map

Getting ready How to do it... How it works... There's more See also

Creating graphs with regional maps

Getting ready How to do it... How it works... There's more

Plotting data on Google maps

Getting ready How to do it... How it works... There's more See also

Creating and reading KML data

Getting ready How to do it... How it works... See Also

Working with ESRI shapefiles

Getting ready How to do it... How it works... There's more

11. Data Visualization Using Lattice

Introduction Creating bar charts

Getting ready How to do it… How it works… There's more… See also

Creating stacked bar charts

Getting ready How to do it… How it works… There's more… See also

Creating bar charts to visualize cross-tabulation

Getting ready How to do it… How it works… There's more…

Creating a conditional histogram

Getting ready How to do it… How it works… There's more… See also

Visualizing distributions through a kernel-density plot

Getting ready How to do it… How it works… There's more…

Creating a normal Q-Q plot

Getting ready How to do it… How it works… There's more…

Visualizing an empirical Cumulative Distribution Function

Getting ready How to do it… How it works… There's more…

Creating a boxplot

Getting ready How to do it… How it works… There's more…

Creating a conditional scatter plot

Getting ready How to do it… How it works… There's more…

12. Data Visualization Using ggplot2

Introduction Creating bar charts

Getting ready How to do it… How it works… There's more… See also

Creating multiple bar charts

Getting ready How to do it… How it works… There's more… See also

Creating a bar chart with error bars

Getting ready How to do it… How it works… There's more…

Visualizing the density of a numeric variable

Getting ready How to do it... How it works… There's more...

Creating a box plot

Getting ready How to do it... How it works…

Creating a layered plot with a scatter plot and fitted line

Getting ready How to do it... How it works… There's more...

Creating a line chart

Getting ready How to do it... How it works… There's more...

Graph annotation with ggplot

Getting ready How to do it... How it works...

13. Inspecting Large Datasets

Introduction Multivariate continuous data visualization

Getting ready How to do it… How it works… There's more… See also

Multivariate categorical data visualization

Getting ready How to do it… How it works… There's more…

Visualizing mixed data

Getting ready How to do it…

Zooming and filtering

Getting ready How to do it... How it works… There's more...

14. Three-dimensional Visualizations

Introduction Three-dimensional scatter plots

Getting ready How to do it… How it works… There's more… See also...

Three-dimensional scatter plots with a regression plane

Getting ready How to do it… How it works… There's more…

Three-dimensional bar charts

Getting ready How to do it… How it works…

Three-dimensional density plots

Getting ready How to do it... How it works…

15. Finalizing Graphs for Publications and Presentations

Introduction Exporting graphs in high-resolution image formats – PNG, JPEG, BMP, and TIFF

Getting ready How to do it... How it works... There's more See also

Exporting graphs in vector formats – SVG, PDF, and PS

Getting ready How to do it... How it works... There's more

Adding mathematical and scientific notations (typesetting)

Getting ready How to do it... How it works... There's more

Adding text descriptions to graphs

Getting ready How to do it... How it works... There's more

Using graph templates

Getting ready How to do it... How it works... There's more

Choosing font families and styles under Windows, Mac OS X, and Linux

Getting ready How to do it... How it works... There's more See also

Choosing fonts for PostScripts and PDFs

Getting ready How to do it... How it works... There's more

III. Module 3: Learning Data Mining with R

1. Warming Up

Big data

Scalability and efficiency

Data source Data mining

Feature extraction Summarization The data mining process

CRISP-DM SEMMA

Social network mining

Social network

Text mining

Information retrieval and text mining Mining text for prediction

Web data mining Why R?

What are the disadvantages of R?

Statistics

Statistics and data mining Statistics and machine learning Statistics and R The limitations of statistics on data mining

Machine learning

Approaches to machine learning Machine learning architecture

Data attributes and description

Numeric attributes Categorical attributes Data description Data measuring

Data cleaning

Missing values Junk, noisy data, or outlier

Data integration Data dimension reduction

Eigenvalues and Eigenvectors Principal-Component Analysis Singular-value decomposition CUR decomposition

Data transformation and discretization

Data transformation Normalization data transformation methods Data discretization

Visualization of results

Visualization with R

2. Mining Frequent Patterns, Associations, and Correlations

An overview of associations and patterns

Patterns and pattern discovery

The frequent itemset The frequent subsequence The frequent substructures

Relationship or rules discovery

Association rules Correlation rules

Market basket analysis

The market basket model A-Priori algorithms

Input data characteristics and data structure The A-Priori algorithm The R implementation A-Priori algorithm variants

The Eclat algorithm

The R implementation

The FP-growth algorithm

Input data characteristics and data structure The FP-growth algorithm The R implementation

The GenMax algorithm with maximal frequent itemsets

The R implementation

The Charm algorithm with closed frequent itemsets

The R implementation

The algorithm to generate association rules

The R implementation

Hybrid association rules mining

Mining multilevel and multidimensional association rules Constraint-based frequent pattern mining

Mining sequence dataset

Sequence dataset The GSP algorithm

The R implementation

The SPADE algorithm

The R implementation

Rule generation from sequential patterns

High-performance algorithms

3. Classification

Classification Generic decision tree induction

Attribute selection measures Tree pruning General algorithm for the decision tree generation The R implementation

High-value credit card customers classification using ID3

The ID3 algorithm The R implementation Web attack detection High-value credit card customers classification

Web spam detection using C4.5

The C4.5 algorithm The R implementation A parallel version with MapReduce Web spam detection

Web key resource page judgment using CART

The CART algorithm The R implementation Web key resource page judgment

Trojan traffic identification method and Bayes classification

Estimating

Prior probability estimation Likelihood estimation

The Bayes classification The R implementation Trojan traffic identification method

Identify spam e-mail and Naïve Bayes classification

The Naïve Bayes classification The R implementation Identify spam e-mail

Rule-based classification of player types in computer games and rule-based classification

Transformation from decision tree to decision rules Rule-based classification Sequential covering algorithm The RIPPER algorithm

The R implementation

Rule-based classification of player types in computer games

4. Advanced Classification

Ensemble (EM) methods

The bagging algorithm The boosting and AdaBoost algorithms The Random forests algorithm The R implementation Parallel version with MapReduce

Biological traits and the Bayesian belief network

The Bayesian belief network (BBN) algorithm The R implementation Biological traits

Protein classification and the k-Nearest Neighbors algorithm

The kNN algorithm The R implementation

Document retrieval and Support Vector Machine

The SVM algorithm The R implementation Parallel version with MapReduce Document retrieval

Classification using frequent patterns

The associative classification

CBA

Discriminative frequent pattern-based classification The R implementation Text classification using sentential frequent itemsets

Classification using the backpropagation algorithm

The BP algorithm The R implementation Parallel version with MapReduce

5. Cluster Analysis

Search engines and the k-means algorithm

The k-means clustering algorithm The kernel k-means algorithm The k-modes algorithm The R implementation Parallel version with MapReduce Search engine and web page clustering

Automatic abstraction of document texts and the k-medoids algorithm

The PAM algorithm The R implementation Automatic abstraction and summarization of document text

The CLARA algorithm

The CLARA algorithm The R implementation

CLARANS

The CLARANS algorithm The R implementation

Unsupervised image categorization and affinity propagation clustering

Affinity propagation clustering The R implementation Unsupervised image categorization The spectral clustering algorithm The R implementation

News categorization and hierarchical clustering

Agglomerative hierarchical clustering The BIRCH algorithm The chameleon algorithm The Bayesian hierarchical clustering algorithm The probabilistic hierarchical clustering algorithm The R implementation News categorization

6. Advanced Cluster Analysis

Customer categorization analysis of e-commerce and DBSCAN

The DBSCAN algorithm Customer categorization analysis of e-commerce

Clustering web pages and OPTICS

The OPTICS algorithm The R implementation Clustering web pages

Visitor analysis in the browser cache and DENCLUE

The DENCLUE algorithm The R implementation Visitor analysis in the browser cache

Recommendation system and STING

The STING algorithm The R implementation Recommendation systems

Web sentiment analysis and CLIQUE

The CLIQUE algorithm The R implementation Web sentiment analysis

Opinion mining and WAVE clustering

The WAVE cluster algorithm The R implementation Opinion mining

User search intent and the EM algorithm

The EM algorithm The R implementation The user search intent

Customer purchase data analysis and clustering high-dimensional data

The MAFIA algorithm The SURFING algorithm The R implementation Customer purchase data analysis

SNS and clustering graph and network data

The SCAN algorithm The R implementation Social networking service (SNS)

7. Outlier Detection

Credit card fraud detection and statistical methods

The likelihood-based outlier detection algorithm The R implementation Credit card fraud detection

Activity monitoring – the detection of fraud involving mobile phones and proximity-based methods

The NL algorithm The FindAllOutsM algorithm The FindAllOutsD algorithm The distance-based algorithm The Dolphin algorithm The R implementation Activity monitoring and the detection of mobile fraud

Intrusion detection and density-based methods

The OPTICS-OF algorithm The High Contrast Subspace algorithm The R implementation Intrusion detection

Intrusion detection and clustering-based methods

Hierarchical clustering to detect outliers The k-means-based algorithm The ODIN algorithm The R implementation

Monitoring the performance of the web server and classification-based methods

The OCSVM algorithm The one-class nearest neighbor algorithm The R implementation Monitoring the performance of the web server

Detecting novelty in text, topic detection, and mining contextual outliers

The conditional anomaly detection (CAD) algorithm The R implementation Detecting novelty in text and topic detection

Collective outliers on spatial data

The route outlier detection (ROD) algorithm The R implementation Characteristics of collective outliers

Outlier detection in high-dimensional data

The brute-force algorithm The HilOut algorithm The R implementation

8. Mining Stream, Time-series, and Sequence Data

The credit card transaction flow and STREAM algorithm

The STREAM algorithm The single-pass-any-time clustering algorithm The R implementation The credit card transaction flow

Predicting future prices and time-series analysis

The ARIMA algorithm Predicting future prices

Stock market data and time-series clustering and classification

The hError algorithm Time-series classification with the 1NN classifier The R implementation Stock market data

Web click streams and mining symbolic sequences

The TECNO-STREAMS algorithm The R implementation Web click streams

Mining sequence patterns in transactional databases

The PrefixSpan algorithm The R implementation

9. Graph Mining and Network Analysis

Graph mining

Graph Graph mining algorithms

Mining frequent subgraph patterns

The gPLS algorithm The GraphSig algorithm The gSpan algorithm Rightmost path extensions and their supports The subgraph isomorphism enumeration algorithm The canonical checking algorithm The R implementation

Social network mining

Community detection and the shingling algorithm The node classification and iterative classification algorithms The R implementation

10. Mining Text and Web Data

Text mining and TM packages Text summarization

Topic representation The multidocument summarization algorithm The Maximal Marginal Relevance algorithm The R implementation

The question answering system Genre categorization of web pages Categorizing newspaper articles and newswires into topics

The N-gram-based text categorization The R implementation

Web usage mining with web logs

The FCA-based association rule mining algorithm The R implementation

IV. Module 4: Mastering R for Quantitative Finance

1. Time Series Analysis

Multivariate time series analysis

Cointegration Vector autoregressive models

VAR implementation example

Cointegrated VAR and VECM

Volatility modeling

GARCH modeling with the rugarch package

The standard GARCH model The Exponential GARCH model (EGARCH) The Threshold GARCH model (TGARCH)

Simulation and forecasting

References and reading list

2. Factor Models

Arbitrage pricing theory

Implementation of APT Fama-French three-factor model

Modeling in R

Data selection Estimation of APT with principal component analysis Estimation of the Fama-French model

References

3. Forecasting Volume

Motivation The intensity of trading The volume forecasting model Implementation in R

The data

Loading the data The seasonal component AR(1) estimation and forecasting SETAR estimation and forecasting Interpreting the results

References

4. Big Data – Advanced Analytics

Getting data from open sources Introduction to big data analysis in R K-means clustering on big data

Loading big matrices Big data K-means clustering analysis

Big data linear regression analysis

Loading big data Fitting a linear regression model on large datasets

References

5. FX Derivatives

Terminology and notations Currency options Exchange options

Two-dimensional Wiener processes The Margrabe formula Application in R

Quanto options

Pricing formula for a call quanto Pricing a call quanto in R

References

6. Interest Rate Derivatives and Models

The Black model

Pricing a cap with Black's model

The Vasicek model The Cox-Ingersoll-Ross model Parameter estimation of interest rate models Using the SMFI5 package References

7. Exotic Options

A general pricing approach The role of dynamic hedging How R can help a lot A glance beyond vanillas Greeks – the link back to the vanilla world Pricing the Double-no-touch option Another way to price the Double-no-touch option The life of a Double-no-touch option – a simulation Exotic options embedded in structured products References

8. Optimal Hedging

Hedging of derivatives

Market risk of derivatives Static delta hedge Dynamic delta hedge Comparing the performance of delta hedging

Hedging in the presence of transaction costs

Optimization of the hedge Optimal hedging in the case of absolute transaction costs Optimal hedging in the case of relative transaction costs

Further extensions References

9. Fundamental Analysis

The basics of fundamental analysis Collecting data Revealing connections Including multiple variables Separating investment targets Setting classification rules Backtesting Industry-specific investment References

10. Technical Analysis, Neural Networks, and Logoptimal Portfolios

Market efficiency Technical analysis

The TA toolkit Markets Plotting charts - bitcoin Built-in indicators

SMA and EMA RSI MACD

Candle patterns: key reversal Evaluating the signals and managing the position A word on money management Wraping up

Neural networks

Forecasting bitcoin prices

Evaluation of the strategy

Logoptimal portfolios

A universally consistent, non-parametric investment strategy Evaluation of the strategy

References

11. Asset and Liability Management

Data preparation

Data source at first glance Cash-flow generator functions Preparing the cash-flow

Interest rate risk measurement Liquidity risk measurement Modeling non-maturity deposits

A Model of deposit interest rate development Static replication of non-maturity deposits

References

12. Capital Adequacy

Principles of the Basel Accords

Basel I Basel II

Minimum capital requirements Supervisory review Transparency

Basel III

Risk measures

Analytical VaR Historical VaR Monte-Carlo simulation

Risk categories

Market risk Credit risk Operational risk

References

13. Systemic Risks

Systemic risk in a nutshell The dataset used in our examples Core-periphery decomposition

Implementation in R Results

The simulation method

The simulation Implementation in R Results

Possible interpretations and suggestions References

V. Module 5: Machine Learning with R module

1. Introducing Machine Learning

The origins of machine learning Uses and abuses of machine learning

Machine learning successes The limits of machine learning Machine learning ethics

How machines learn

Data storage Abstraction Generalization Evaluation

Machine learning in practice

Types of input data Types of machine learning algorithms Matching input data to algorithms

Machine learning with R

Installing R packages Loading and unloading R packages

2. Managing and Understanding Data

R data structures

Vectors Factors Lists Data frames Matrixes and arrays

Managing data with R

Saving, loading, and removing R data structures Importing and saving data from CSV files

Exploring and understanding data

Exploring the structure of data Exploring numeric variables

Measuring the central tendency – mean and median Measuring spread – quartiles and the five-number summary Visualizing numeric variables – boxplots Visualizing numeric variables – histograms Understanding numeric data – uniform and normal distributions Measuring spread – variance and standard deviation

Exploring categorical variables

Measuring the central tendency – the mode

Exploring relationships between variables

Visualizing relationships – scatterplots Examining relationships – two-way cross-tabulations

3. Lazy Learning – Classification Using Nearest Neighbors

Understanding nearest neighbor classification

The k-NN algorithm

Measuring similarity with distance Choosing an appropriate k Preparing data for use with k-NN

Why is the k-NN algorithm lazy?

Example – diagnosing breast cancer with the k-NN algorithm

Step 1 – collecting data Step 2 – exploring and preparing the data

Transformation – normalizing numeric data Data preparation – creating training and test datasets

Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance

Transformation – z-score standardization Testing alternative values of k

4. Probabilistic Learning – Classification Using Naive Bayes

Understanding Naive Bayes

Basic concepts of Bayesian methods

Understanding probability Understanding joint probability Computing conditional probability with Bayes' theorem

The Naive Bayes algorithm

Classification with Naive Bayes The Laplace estimator Using numeric features with Naive Bayes

Example – filtering mobile phone spam with the Naive Bayes algorithm

Step 1 – collecting data Step 2 – exploring and preparing the data

Data preparation – cleaning and standardizing text data Data preparation – splitting text documents into words Data preparation – creating training and test datasets Visualizing text data – word clouds Data preparation – creating indicator features for frequent words

Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance

5. Divide and Conquer – Classification Using Decision Trees and Rules

Understanding decision trees

Divide and conquer The C5.0 decision tree algorithm

Choosing the best split Pruning the decision tree

Example – identifying risky bank loans using C5.0 decision trees

Step 1 – collecting data Step 2 – exploring and preparing the data

Data preparation – creating random training and test datasets

Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance

Boosting the accuracy of decision trees Making mistakes more costlier than others

Understanding classification rules

Separate and conquer The 1R algorithm The RIPPER algorithm Rules from decision trees What makes trees and rules greedy?

Example – identifying poisonous mushrooms with rule learners

Step 1 – collecting data Step 2 – exploring and preparing the data Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance

6. Forecasting Numeric Data – Regression Methods

Understanding regression

Simple linear regression Ordinary least squares estimation Correlations Multiple linear regression

Example – predicting medical expenses using linear regression

Step 1 – collecting data Step 2 – exploring and preparing the data

Exploring relationships among features – the correlation matrix Visualizing relationships among features – the scatterplot matrix

Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance

Model specification – adding non-linear relationships Transformation – converting a numeric variable to a binary indicator Model specification – adding interaction effects Putting it all together – an improved regression model

Understanding regression trees and model trees

Adding regression to trees

Example – estimating the quality of wines with regression trees and model trees

Step 1 – collecting data Step 2 – exploring and preparing the data Step 3 – training a model on the data

Visualizing decision trees

Step 4 – evaluating model performance

Measuring performance with the mean absolute error

Step 5 – improving model performance

7. Black Box Methods – Neural Networks and Support Vector Machines

Understanding neural networks

From biological to artificial neurons Activation functions Network topology

The number of layers The direction of information travel The number of nodes in each layer

Training neural networks with backpropagation

Example – Modeling the strength of concrete with ANNs

Step 1 – collecting data Step 2 – exploring and preparing the data Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance

Understanding Support Vector Machines

Classification with hyperplanes

The case of linearly separable data The case of nonlinearly separable data

Using kernels for non-linear spaces

Example – performing OCR with SVMs

Step 1 – collecting data Step 2 – exploring and preparing the data Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance

8. Finding Patterns – Market Basket Analysis Using Association Rules

Understanding association rules

The Apriori algorithm for association rule learning Measuring rule interest – support and confidence Building a set of rules with the Apriori principle

Example – identifying frequently purchased groceries with association rules

Step 1 – collecting data Step 2 – exploring and preparing the data

Data preparation – creating a sparse matrix for transaction data Visualizing item support – item frequency plots Visualizing the transaction data – plotting the sparse matrix

Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance

Sorting the set of association rules Taking subsets of association rules Saving association rules to a file or data frame

9. Finding Groups of Data – Clustering with k-means

Understanding clustering

Clustering as a machine learning task The k-means clustering algorithm

Using distance to assign and update clusters Choosing the appropriate number of clusters

Example – finding teen market segments using k-means clustering

Step 1 – collecting data Step 2 – exploring and preparing the data

Data preparation – dummy coding missing values Data preparation – imputing the missing values

Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance

10. Evaluating Model Performance

Measuring performance for classification

Working with classification prediction data in R A closer look at confusion matrices Using confusion matrices to measure performance Beyond accuracy – other measures of performance

The kappa statistic Sensitivity and specificity Precision and recall The F-measure

Visualizing performance trade-offs

ROC curves

Estimating future performance

The holdout method

Cross-validation Bootstrap sampling

11. Improving Model Performance

Tuning stock models for better performance

Using caret for automated parameter tuning

Creating a simple tuned model Customizing the tuning process

Improving model performance with meta-learning

Understanding ensembles Bagging Boosting Random forests

Training random forests Evaluating random forest performance

12. Specialized Machine Learning Topics

Working with proprietary files and databases

Reading from and writing to Microsoft Excel, SAS, SPSS, and Stata files Querying data in SQL databases

Working with online data and services

Downloading the complete text of web pages Scraping data from web pages

Parsing XML documents Parsing JSON from web APIs

Working with domain-specific data

Analyzing bioinformatics data Analyzing and visualizing network data

Improving the performance of R

Managing very large datasets

Generalizing tabular data structures with dplyr Making data frames faster with data.table Creating disk-based data frames with ff Using massive matrices with bigmemory

Learning faster with parallel computing

Measuring execution time Working in parallel with multicore and snow Taking advantage of parallel with foreach and doParallel Parallel cloud computing with MapReduce and Hadoop

GPU computing Deploying optimized learning algorithms

Building bigger regression models with biglm Growing bigger and faster random forests with bigrf Training and evaluating models in parallel with caret

A. Reflect and Test Yourself Answers

Module 1: Data Analysis with R

Chapter 1: RefresheR Chapter 2: The Shape of Data Chapter 3: Describing Relationships Chapter 4: Probability Chapter 5: Using Data to Reason About the World Chapter 6: Testing Hypotheses Chapter 7: Bayesian Methods Chapter 8: Predicting Continuous Variables Chapter 9: Predicting Categorical Variables Chapter 10: Sources of Data Chapter 11: Dealing with Messy Data Chapter 12: Dealing with Large Data

Module 2: R Graphs

Chapter 1: R Graphics Chapter 2: Basic Graph Functions Chapter 3: Beyond the Basics – Adjusting Key Parameters Chapter 4: Creating Scatter Plots Chapter 5: Creating Line Graphs and Time Series Charts Chapter 6: Creating Bar, Dot, and Pie Charts Chapter 7: Creating Histograms Chapter 8: Box and Whisker Plots Chapter 9: Creating Heat Maps and Contour Plots

Module 4: Mastering R for Quantitative Finance

Chapter 1: Time Series Analysis Chapter 3: Forecasting Volume Chapter 4: Big Data – Advanced Analytics Chapter 5: FX Derivatives Chapter 6: Interest Rate Derivatives and Models Chapter 7: Exotic Options Chapter 8: Optimal Hedging Chapter 9: Fundamental Analysis

Module 5: Machine Learning with R

Chapter 1: Introducing Machine Learning Chapter 2: Managing and Understanding Data Chapter 3: Lazy Learning – Classification Using Nearest Neighbors Chapter 4: Probabilistic Learning – Classification Using Naive Bayes Chapter 5: Divide and Conquer – Classification Using Decision Trees and Rules Chapter 6: Forecasting Numeric Data – Regression Methods Chapter 7: Black Box Methods – Neural Networks and Support Vector Machines Chapter 8: Finding Patterns – Market Basket Analysis Using Association Rules

B. Bibliography Index

← Prev
Back
Next →

← Prev
Back
Next →