Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
R: Data Analysis and Visualization
Table of Contents R: Data Analysis and Visualization
Meet Your Course Guide Course Structure Course journey The Course Roadmap and Timeline
I. Module 1: Data Analysis with R
1. RefresheR
Navigating the basics
Arithmetic and assignment Logicals and characters Flow of control
Getting help in R Vectors
Subsetting Vectorized functions Advanced subsetting Recycling
Functions Matrices Loading data into R Working with packages
2. The Shape of Data
Univariate data Frequency distributions Central tendency Spread Populations, samples, and estimation Probability distributions Visualization methods
3. Describing Relationships
Multivariate data Relationships between a categorical and a continuous variable Relationships between two categorical variables The relationship between two continuous variables
Covariance Correlation coefficients Comparing multiple correlations
Visualization methods
Categorical and continuous variables Two categorical variables Two continuous variables More than two continuous variables
4. Probability
Basic probability A tale of two interpretations Sampling from distributions
Parameters The binomial distribution
The normal distribution
The three-sigma rule and using z-tables
5. Using Data to Reason About the World
Estimating means The sampling distribution Interval estimation
How did we get 1.96?
Smaller samples
6. Testing Hypotheses
Null Hypothesis Significance Testing
One and two-tailed tests When things go wrong A warning about significance A warning about p-values
Testing the mean of one sample
Assumptions of the one sample t-test
Testing two means
Don't be fooled! Assumptions of the independent samples t-test
Testing more than two means
Assumptions of ANOVA
Testing independence of proportions What if my assumptions are unfounded?
7. Bayesian Methods
The big idea behind Bayesian analysis Choosing a prior Who cares about coin flips Enter MCMC – stage left Using JAGS and runjags Fitting distributions the Bayesian way The Bayesian independent samples t-test
8. Predicting Continuous Variables
Linear models Simple linear regression Simple linear regression with a binary predictor
A word of warning
Multiple regression Regression with a non-binary predictor Kitchen sink regression The bias-variance trade-off
Cross-validation Striking a balance
Linear regression diagnostics
Second Anscombe relationship Third Anscombe relationship Fourth Anscombe relationship
Advanced topics
9. Predicting Categorical Variables
k-Nearest Neighbors
Using k-NN in R
Confusion matrices Limitations of k-NN
Logistic regression
Using logistic regression in R
Decision trees Random forests Choosing a classifier
The vertical decision boundary The diagonal decision boundary The crescent decision boundary The circular decision boundary
10. Sources of Data
Relational Databases
Why didn't we just do that in SQL?
Using JSON XML Other data formats Online repositories
11. Dealing with Messy Data
Analysis with missing data
Visualizing missing data Types of missing data
So which one is it?
Unsophisticated methods for dealing with missing data
Complete case analysis Pairwise deletion Mean substitution Hot deck imputation Regression imputation Stochastic regression imputation
Multiple imputation
So how does mice come up with the imputed values?
Methods of imputation
Multiple imputation in practice
Analysis with unsanitized data
Checking for out-of-bounds data Checking the data type of a column Checking for unexpected categories Checking for outliers, entry errors, or unlikely data points Chaining assertions
Other messiness
OpenRefine Regular expressions tidyr
12. Dealing with Large Data
Wait to optimize Using a bigger and faster machine Be smart about your code
Allocation of memory Vectorization
Using optimized packages Using another R implementation Use parallelization
Getting started with parallel R An example of (some) substance
Using Rcpp Be smarter about your code
13. Reproducibility and Best Practices
R Scripting
RStudio Running R scripts An example script Scripting and reproducibility
R projects Version control Communicating results
II. Module 2: R Graphs
1. R Graphics
Base graphics using the default package Trellis graphs using lattice Graphs inspired by Grammar of Graphics
2. Basic Graph Functions
Introduction Creating basic scatter plots
Getting ready How to do it... How it works... There's more...
A note on R's built-in datasets
See also
Creating line graphs
Getting ready How to do it... How it works... There's more... See also
Creating bar charts
Getting ready How to do it... How it works... There's more... See also
Creating histograms and density plots
How to do it... How it works... There's more... See also
Creating box plots
Getting ready How to do it... How it works... There's more... See also
Adjusting x and y axes' limits
How to do it... How it works... There's more... See also
Creating heat maps
How to do it... How it works... There's more... See also
Creating pairs plots
How to do it... How it works... There's more... See also
Creating multiple plot matrix layouts
How to do it... How it works... There's more... See also
Adding and formatting legends
Getting ready How to do it... How it works... There's more... See also
Creating graphs with maps
Getting ready How to do it... How it works... There's more... See also
Saving and exporting graphs
How to do it... How it works... There's more... See also
3. Beyond the Basics – Adjusting Key Parameters
Introduction Setting colors of points, lines, and bars
Getting ready How to do it... How it works... There's more... See also
Setting plot background colors
Getting ready How to do it... How it works... There's more...
Setting colors for text elements – axis annotations, labels, plot titles, and legends
Getting ready How to do it... How it works... There's more...
Choosing color combinations and palettes
Getting ready How to do it... How it works... There's more... See also
Setting fonts for annotations and titles
Getting ready How to do it... How it works... There's more... See also
Choosing plotting point symbol styles and sizes
Getting ready How to do it... How it works... There's more... See also
Choosing line styles and width
Getting ready How to do it... How it works... See also
Choosing box styles
Getting ready How to do it... How it works... There's more...
Adjusting axis annotations and tick marks
Getting ready How to do it... How it works... There's more... See also
Formatting log axes
Getting ready How to do it... How it works... There's more...
Setting graph margins and dimensions
Getting ready How to do it... How it works... See also
4. Creating Scatter Plots
Introduction Grouping data points within a scatter plot
Getting ready How to do it... How it works... There's more... See also
Highlighting grouped data points by size and symbol type
Getting ready How to do it... How it works...
Labeling data points
Getting ready How to do it... How it works... There's more...
Correlation matrix using pairs plots
Getting ready How to do it... How it works...
Adding error bars
Getting ready How to do it... How it works... There's more...
Using jitter to distinguish closely packed data points
Getting ready How to do it... How it works...
Adding linear model lines
Getting ready How to do it... How it works...
Adding nonlinear model curves
Getting ready How to do it... How it works...
Adding nonparametric model curves with lowess
Getting ready How to do it... How it works...
Creating three-dimensional scatter plots
Getting ready How to do it... How it works... There's more...
Creating Quantile-Quantile plots
Getting ready How to do it... How it works... There's more...
Displaying the data density on axes
Getting ready How to do it... How it works... There's more...
Creating scatter plots with a smoothed density representation
Getting ready How to do it... How it works... There's more...
5. Creating Line Graphs and Time Series Charts
Introduction Adding customized legends for multiple-line graphs
Getting ready How to do it... How it works... There's more... See also
Using margin labels instead of legends for multiple-line graphs
Getting ready How to do it... How it works... There's more...
Adding horizontal and vertical grid lines
Getting ready How to do it... How it works... There's more... See also
Adding marker lines at specific x and y values using abline
Getting ready How to do it... How it works... There's more...
Creating sparklines
Getting ready How to do it... How it works...
Plotting functions of a variable in a dataset
Getting ready How to do it... How it works... There's more...
Formatting time series data for plotting
Getting ready How to do it... How it works... There's more...
Plotting the date or time variable on the x axis
Getting ready How to do it... How it works... There's more...
Annotating axis labels in different human-readable time formats
Getting ready How to do it... How it works... There's more...
Adding vertical markers to indicate specific time events
Getting ready How to do it... How it works... There's more...
Plotting data with varying time-averaging periods
Getting ready How to do it... How it works...
Creating stock charts
Getting ready How to do it... How it works... There's more...
6. Creating Bar, Dot, and Pie Charts
Introduction Creating bar charts with more than one factor variable
Getting ready How to do it... How it works... See also
Creating stacked bar charts
Getting ready How to do it... How it works... There's more...
Adjusting the orientation of bars – horizontal and vertical
Getting ready How to do it... How it works... There's more...
Adjusting bar widths, spacing, colors, and borders
Getting ready How to do it... How it works... There's more...
Displaying values on top of or next to the bars
Getting ready How to do it... How it works... There's more... See also
Placing labels inside bars
Getting ready How to do it... How it works... There's more...
Creating bar charts with vertical error bars
Getting ready How to do it... How it works... There's more...
Modifying dot charts by grouping variables
Getting ready How to do it... How it works...
Making better, readable pie charts with clockwise-ordered slices
Getting ready How to do it... How it works... See also
Labeling a pie chart with percentage values for each slice
Getting ready How it works... There's more... See also
Adding a legend to a pie chart
Getting ready How to do it... How it works... There's more...
7. Creating Histograms
Introduction Visualizing distributions as count frequencies or probability densities
Getting ready How to do it... How it works... There's more
Setting the bin size and the number of breaks
Getting ready How to do it... How it works... There's more
Adjusting histogram styles – bar colors, borders, and axes
Getting ready How to do it... How it works... There's more
Overlaying a density line over a histogram
Getting ready How to do it... How it works...
Multiple histograms along the diagonal of a pairs plot
Getting ready How to do it... How it works...
Histograms in the margins of line and scatter plots
Getting ready How to do it... How it works...
8. Box and Whisker Plots
Introduction Creating box plots with narrow boxes for a small number of variables
Getting ready How to do it... How it works... There's more See also
Grouping over a variable
Getting ready How to do it... How it works... There's more See also
Varying box widths by the number of observations
Getting ready How to do it... How it works...
Creating box plots with notches
Getting ready How to do it... How it works... There's more
Including or excluding outliers
Getting ready How to do it... How it works... See also
Creating horizontal box plots
Getting ready How to do it... How it works...
Changing the box styling
Getting ready How to do it... How it works... There's more
Adjusting the extent of plot whiskers outside the box
Getting ready How to do it... How it works... There's more
Showing the number of observations
Getting ready How to do it... How it works... There's more
Splitting a variable at arbitrary values into subsets
Getting ready How to do it... How it works... There's more
9. Creating Heat Maps and Contour Plots
Introduction Creating heat maps of a single Z variable with a scale
Getting ready How to do it... How it works... There's more See also
Creating correlation heat maps
Getting ready How to do it... How it works... There's more
Summarizing multivariate data in a single heat map
Getting ready How to do it... How it works... There's more
Creating contour plots
Getting ready How to do it... How it works... There's more See also
Creating filled contour plots
Getting ready How to do it... How it works... There's more See also
Creating three-dimensional surface plots
Getting ready How to do it... How it works... There's more
Visualizing time series as calendar heat maps
Getting ready How to do it... How it works... There's more
10. Creating Maps
Introduction Plotting global data by countries on a world map
Getting ready How to do it... How it works... There's more See also
Creating graphs with regional maps
Getting ready How to do it... How it works... There's more
Plotting data on Google maps
Getting ready How to do it... How it works... There's more See also
Creating and reading KML data
Getting ready How to do it... How it works... See Also
Working with ESRI shapefiles
Getting ready How to do it... How it works... There's more
11. Data Visualization Using Lattice
Introduction Creating bar charts
Getting ready How to do it… How it works… There's more… See also
Creating stacked bar charts
Getting ready How to do it… How it works… There's more… See also
Creating bar charts to visualize cross-tabulation
Getting ready How to do it… How it works… There's more…
Creating a conditional histogram
Getting ready How to do it… How it works… There's more… See also
Visualizing distributions through a kernel-density plot
Getting ready How to do it… How it works… There's more…
Creating a normal Q-Q plot
Getting ready How to do it… How it works… There's more…
Visualizing an empirical Cumulative Distribution Function
Getting ready How to do it… How it works… There's more…
Creating a boxplot
Getting ready How to do it… How it works… There's more…
Creating a conditional scatter plot
Getting ready How to do it… How it works… There's more…
12. Data Visualization Using ggplot2
Introduction Creating bar charts
Getting ready How to do it… How it works… There's more… See also
Creating multiple bar charts
Getting ready How to do it… How it works… There's more… See also
Creating a bar chart with error bars
Getting ready How to do it… How it works… There's more…
Visualizing the density of a numeric variable
Getting ready How to do it... How it works… There's more...
Creating a box plot
Getting ready How to do it... How it works…
Creating a layered plot with a scatter plot and fitted line
Getting ready How to do it... How it works… There's more...
Creating a line chart
Getting ready How to do it... How it works… There's more...
Graph annotation with ggplot
Getting ready How to do it... How it works...
13. Inspecting Large Datasets
Introduction Multivariate continuous data visualization
Getting ready How to do it… How it works… There's more… See also
Multivariate categorical data visualization
Getting ready How to do it… How it works… There's more…
Visualizing mixed data
Getting ready How to do it…
Zooming and filtering
Getting ready How to do it... How it works… There's more...
14. Three-dimensional Visualizations
Introduction Three-dimensional scatter plots
Getting ready How to do it… How it works… There's more… See also...
Three-dimensional scatter plots with a regression plane
Getting ready How to do it… How it works… There's more…
Three-dimensional bar charts
Getting ready How to do it… How it works…
Three-dimensional density plots
Getting ready How to do it... How it works…
15. Finalizing Graphs for Publications and Presentations
Introduction Exporting graphs in high-resolution image formats – PNG, JPEG, BMP, and TIFF
Getting ready How to do it... How it works... There's more See also
Exporting graphs in vector formats – SVG, PDF, and PS
Getting ready How to do it... How it works... There's more
Adding mathematical and scientific notations (typesetting)
Getting ready How to do it... How it works... There's more
Adding text descriptions to graphs
Getting ready How to do it... How it works... There's more
Using graph templates
Getting ready How to do it... How it works... There's more
Choosing font families and styles under Windows, Mac OS X, and Linux
Getting ready How to do it... How it works... There's more See also
Choosing fonts for PostScripts and PDFs
Getting ready How to do it... How it works... There's more
III. Module 3: Learning Data Mining with R
1. Warming Up
Big data
Scalability and efficiency
Data source Data mining
Feature extraction Summarization The data mining process
CRISP-DM SEMMA
Social network mining
Social network
Text mining
Information retrieval and text mining Mining text for prediction
Web data mining Why R?
What are the disadvantages of R?
Statistics
Statistics and data mining Statistics and machine learning Statistics and R The limitations of statistics on data mining
Machine learning
Approaches to machine learning Machine learning architecture
Data attributes and description
Numeric attributes Categorical attributes Data description Data measuring
Data cleaning
Missing values Junk, noisy data, or outlier
Data integration Data dimension reduction
Eigenvalues and Eigenvectors Principal-Component Analysis Singular-value decomposition CUR decomposition
Data transformation and discretization
Data transformation Normalization data transformation methods Data discretization
Visualization of results
Visualization with R
2. Mining Frequent Patterns, Associations, and Correlations
An overview of associations and patterns
Patterns and pattern discovery
The frequent itemset The frequent subsequence The frequent substructures
Relationship or rules discovery
Association rules Correlation rules
Market basket analysis
The market basket model A-Priori algorithms
Input data characteristics and data structure The A-Priori algorithm The R implementation A-Priori algorithm variants
The Eclat algorithm
The R implementation
The FP-growth algorithm
Input data characteristics and data structure The FP-growth algorithm The R implementation
The GenMax algorithm with maximal frequent itemsets
The R implementation
The Charm algorithm with closed frequent itemsets
The R implementation
The algorithm to generate association rules
The R implementation
Hybrid association rules mining
Mining multilevel and multidimensional association rules Constraint-based frequent pattern mining
Mining sequence dataset
Sequence dataset The GSP algorithm
The R implementation
The SPADE algorithm
The R implementation
Rule generation from sequential patterns
High-performance algorithms
3. Classification
Classification Generic decision tree induction
Attribute selection measures Tree pruning General algorithm for the decision tree generation The R implementation
High-value credit card customers classification using ID3
The ID3 algorithm The R implementation Web attack detection High-value credit card customers classification
Web spam detection using C4.5
The C4.5 algorithm The R implementation A parallel version with MapReduce Web spam detection
Web key resource page judgment using CART
The CART algorithm The R implementation Web key resource page judgment
Trojan traffic identification method and Bayes classification
Estimating
Prior probability estimation Likelihood estimation
The Bayes classification The R implementation Trojan traffic identification method
Identify spam e-mail and Naïve Bayes classification
The Naïve Bayes classification The R implementation Identify spam e-mail
Rule-based classification of player types in computer games and rule-based classification
Transformation from decision tree to decision rules Rule-based classification Sequential covering algorithm The RIPPER algorithm
The R implementation
Rule-based classification of player types in computer games
4. Advanced Classification
Ensemble (EM) methods
The bagging algorithm The boosting and AdaBoost algorithms The Random forests algorithm The R implementation Parallel version with MapReduce
Biological traits and the Bayesian belief network
The Bayesian belief network (BBN) algorithm The R implementation Biological traits
Protein classification and the k-Nearest Neighbors algorithm
The kNN algorithm The R implementation
Document retrieval and Support Vector Machine
The SVM algorithm The R implementation Parallel version with MapReduce Document retrieval
Classification using frequent patterns
The associative classification
CBA
Discriminative frequent pattern-based classification The R implementation Text classification using sentential frequent itemsets
Classification using the backpropagation algorithm
The BP algorithm The R implementation Parallel version with MapReduce
5. Cluster Analysis
Search engines and the k-means algorithm
The k-means clustering algorithm The kernel k-means algorithm The k-modes algorithm The R implementation Parallel version with MapReduce Search engine and web page clustering
Automatic abstraction of document texts and the k-medoids algorithm
The PAM algorithm The R implementation Automatic abstraction and summarization of document text
The CLARA algorithm
The CLARA algorithm The R implementation
CLARANS
The CLARANS algorithm The R implementation
Unsupervised image categorization and affinity propagation clustering
Affinity propagation clustering The R implementation Unsupervised image categorization The spectral clustering algorithm The R implementation
News categorization and hierarchical clustering
Agglomerative hierarchical clustering The BIRCH algorithm The chameleon algorithm The Bayesian hierarchical clustering algorithm The probabilistic hierarchical clustering algorithm The R implementation News categorization
6. Advanced Cluster Analysis
Customer categorization analysis of e-commerce and DBSCAN
The DBSCAN algorithm Customer categorization analysis of e-commerce
Clustering web pages and OPTICS
The OPTICS algorithm The R implementation Clustering web pages
Visitor analysis in the browser cache and DENCLUE
The DENCLUE algorithm The R implementation Visitor analysis in the browser cache
Recommendation system and STING
The STING algorithm The R implementation Recommendation systems
Web sentiment analysis and CLIQUE
The CLIQUE algorithm The R implementation Web sentiment analysis
Opinion mining and WAVE clustering
The WAVE cluster algorithm The R implementation Opinion mining
User search intent and the EM algorithm
The EM algorithm The R implementation The user search intent
Customer purchase data analysis and clustering high-dimensional data
The MAFIA algorithm The SURFING algorithm The R implementation Customer purchase data analysis
SNS and clustering graph and network data
The SCAN algorithm The R implementation Social networking service (SNS)
7. Outlier Detection
Credit card fraud detection and statistical methods
The likelihood-based outlier detection algorithm The R implementation Credit card fraud detection
Activity monitoring – the detection of fraud involving mobile phones and proximity-based methods
The NL algorithm The FindAllOutsM algorithm The FindAllOutsD algorithm The distance-based algorithm The Dolphin algorithm The R implementation Activity monitoring and the detection of mobile fraud
Intrusion detection and density-based methods
The OPTICS-OF algorithm The High Contrast Subspace algorithm The R implementation Intrusion detection
Intrusion detection and clustering-based methods
Hierarchical clustering to detect outliers The k-means-based algorithm The ODIN algorithm The R implementation
Monitoring the performance of the web server and classification-based methods
The OCSVM algorithm The one-class nearest neighbor algorithm The R implementation Monitoring the performance of the web server
Detecting novelty in text, topic detection, and mining contextual outliers
The conditional anomaly detection (CAD) algorithm The R implementation Detecting novelty in text and topic detection
Collective outliers on spatial data
The route outlier detection (ROD) algorithm The R implementation Characteristics of collective outliers
Outlier detection in high-dimensional data
The brute-force algorithm The HilOut algorithm The R implementation
8. Mining Stream, Time-series, and Sequence Data
The credit card transaction flow and STREAM algorithm
The STREAM algorithm The single-pass-any-time clustering algorithm The R implementation The credit card transaction flow
Predicting future prices and time-series analysis
The ARIMA algorithm Predicting future prices
Stock market data and time-series clustering and classification
The hError algorithm Time-series classification with the 1NN classifier The R implementation Stock market data
Web click streams and mining symbolic sequences
The TECNO-STREAMS algorithm The R implementation Web click streams
Mining sequence patterns in transactional databases
The PrefixSpan algorithm The R implementation
9. Graph Mining and Network Analysis
Graph mining
Graph Graph mining algorithms
Mining frequent subgraph patterns
The gPLS algorithm The GraphSig algorithm The gSpan algorithm Rightmost path extensions and their supports The subgraph isomorphism enumeration algorithm The canonical checking algorithm The R implementation
Social network mining
Community detection and the shingling algorithm The node classification and iterative classification algorithms The R implementation
10. Mining Text and Web Data
Text mining and TM packages Text summarization
Topic representation The multidocument summarization algorithm The Maximal Marginal Relevance algorithm The R implementation
The question answering system Genre categorization of web pages Categorizing newspaper articles and newswires into topics
The N-gram-based text categorization The R implementation
Web usage mining with web logs
The FCA-based association rule mining algorithm The R implementation
IV. Module 4: Mastering R for Quantitative Finance
1. Time Series Analysis
Multivariate time series analysis
Cointegration Vector autoregressive models
VAR implementation example
Cointegrated VAR and VECM
Volatility modeling
GARCH modeling with the rugarch package
The standard GARCH model The Exponential GARCH model (EGARCH) The Threshold GARCH model (TGARCH)
Simulation and forecasting
References and reading list
2. Factor Models
Arbitrage pricing theory
Implementation of APT Fama-French three-factor model
Modeling in R
Data selection Estimation of APT with principal component analysis Estimation of the Fama-French model
References
3. Forecasting Volume
Motivation The intensity of trading The volume forecasting model Implementation in R
The data
Loading the data The seasonal component AR(1) estimation and forecasting SETAR estimation and forecasting Interpreting the results
References
4. Big Data – Advanced Analytics
Getting data from open sources Introduction to big data analysis in R K-means clustering on big data
Loading big matrices Big data K-means clustering analysis
Big data linear regression analysis
Loading big data Fitting a linear regression model on large datasets
References
5. FX Derivatives
Terminology and notations Currency options Exchange options
Two-dimensional Wiener processes The Margrabe formula Application in R
Quanto options
Pricing formula for a call quanto Pricing a call quanto in R
References
6. Interest Rate Derivatives and Models
The Black model
Pricing a cap with Black's model
The Vasicek model The Cox-Ingersoll-Ross model Parameter estimation of interest rate models Using the SMFI5 package References
7. Exotic Options
A general pricing approach The role of dynamic hedging How R can help a lot A glance beyond vanillas Greeks – the link back to the vanilla world Pricing the Double-no-touch option Another way to price the Double-no-touch option The life of a Double-no-touch option – a simulation Exotic options embedded in structured products References
8. Optimal Hedging
Hedging of derivatives
Market risk of derivatives Static delta hedge Dynamic delta hedge Comparing the performance of delta hedging
Hedging in the presence of transaction costs
Optimization of the hedge Optimal hedging in the case of absolute transaction costs Optimal hedging in the case of relative transaction costs
Further extensions References
9. Fundamental Analysis
The basics of fundamental analysis Collecting data Revealing connections Including multiple variables Separating investment targets Setting classification rules Backtesting Industry-specific investment References
10. Technical Analysis, Neural Networks, and Logoptimal Portfolios
Market efficiency Technical analysis
The TA toolkit Markets Plotting charts - bitcoin Built-in indicators
SMA and EMA RSI MACD
Candle patterns: key reversal Evaluating the signals and managing the position A word on money management Wraping up
Neural networks
Forecasting bitcoin prices
Evaluation of the strategy
Logoptimal portfolios
A universally consistent, non-parametric investment strategy Evaluation of the strategy
References
11. Asset and Liability Management
Data preparation
Data source at first glance Cash-flow generator functions Preparing the cash-flow
Interest rate risk measurement Liquidity risk measurement Modeling non-maturity deposits
A Model of deposit interest rate development Static replication of non-maturity deposits
References
12. Capital Adequacy
Principles of the Basel Accords
Basel I Basel II
Minimum capital requirements Supervisory review Transparency
Basel III
Risk measures
Analytical VaR Historical VaR Monte-Carlo simulation
Risk categories
Market risk Credit risk Operational risk
References
13. Systemic Risks
Systemic risk in a nutshell The dataset used in our examples Core-periphery decomposition
Implementation in R Results
The simulation method
The simulation Implementation in R Results
Possible interpretations and suggestions References
V. Module 5: Machine Learning with R module
1. Introducing Machine Learning
The origins of machine learning Uses and abuses of machine learning
Machine learning successes The limits of machine learning Machine learning ethics
How machines learn
Data storage Abstraction Generalization Evaluation
Machine learning in practice
Types of input data Types of machine learning algorithms Matching input data to algorithms
Machine learning with R
Installing R packages Loading and unloading R packages
2. Managing and Understanding Data
R data structures
Vectors Factors Lists Data frames Matrixes and arrays
Managing data with R
Saving, loading, and removing R data structures Importing and saving data from CSV files
Exploring and understanding data
Exploring the structure of data Exploring numeric variables
Measuring the central tendency – mean and median Measuring spread – quartiles and the five-number summary Visualizing numeric variables – boxplots Visualizing numeric variables – histograms Understanding numeric data – uniform and normal distributions Measuring spread – variance and standard deviation
Exploring categorical variables
Measuring the central tendency – the mode
Exploring relationships between variables
Visualizing relationships – scatterplots Examining relationships – two-way cross-tabulations
3. Lazy Learning – Classification Using Nearest Neighbors
Understanding nearest neighbor classification
The k-NN algorithm
Measuring similarity with distance Choosing an appropriate k Preparing data for use with k-NN
Why is the k-NN algorithm lazy?
Example – diagnosing breast cancer with the k-NN algorithm
Step 1 – collecting data Step 2 – exploring and preparing the data
Transformation – normalizing numeric data Data preparation – creating training and test datasets
Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance
Transformation – z-score standardization Testing alternative values of k
4. Probabilistic Learning – Classification Using Naive Bayes
Understanding Naive Bayes
Basic concepts of Bayesian methods
Understanding probability Understanding joint probability Computing conditional probability with Bayes' theorem
The Naive Bayes algorithm
Classification with Naive Bayes The Laplace estimator Using numeric features with Naive Bayes
Example – filtering mobile phone spam with the Naive Bayes algorithm
Step 1 – collecting data Step 2 – exploring and preparing the data
Data preparation – cleaning and standardizing text data Data preparation – splitting text documents into words Data preparation – creating training and test datasets Visualizing text data – word clouds Data preparation – creating indicator features for frequent words
Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance
5. Divide and Conquer – Classification Using Decision Trees and Rules
Understanding decision trees
Divide and conquer The C5.0 decision tree algorithm
Choosing the best split Pruning the decision tree
Example – identifying risky bank loans using C5.0 decision trees
Step 1 – collecting data Step 2 – exploring and preparing the data
Data preparation – creating random training and test datasets
Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance
Boosting the accuracy of decision trees Making mistakes more costlier than others
Understanding classification rules
Separate and conquer The 1R algorithm The RIPPER algorithm Rules from decision trees What makes trees and rules greedy?
Example – identifying poisonous mushrooms with rule learners
Step 1 – collecting data Step 2 – exploring and preparing the data Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance
6. Forecasting Numeric Data – Regression Methods
Understanding regression
Simple linear regression Ordinary least squares estimation Correlations Multiple linear regression
Example – predicting medical expenses using linear regression
Step 1 – collecting data Step 2 – exploring and preparing the data
Exploring relationships among features – the correlation matrix Visualizing relationships among features – the scatterplot matrix
Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance
Model specification – adding non-linear relationships Transformation – converting a numeric variable to a binary indicator Model specification – adding interaction effects Putting it all together – an improved regression model
Understanding regression trees and model trees
Adding regression to trees
Example – estimating the quality of wines with regression trees and model trees
Step 1 – collecting data Step 2 – exploring and preparing the data Step 3 – training a model on the data
Visualizing decision trees
Step 4 – evaluating model performance
Measuring performance with the mean absolute error
Step 5 – improving model performance
7. Black Box Methods – Neural Networks and Support Vector Machines
Understanding neural networks
From biological to artificial neurons Activation functions Network topology
The number of layers The direction of information travel The number of nodes in each layer
Training neural networks with backpropagation
Example – Modeling the strength of concrete with ANNs
Step 1 – collecting data Step 2 – exploring and preparing the data Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance
Understanding Support Vector Machines
Classification with hyperplanes
The case of linearly separable data The case of nonlinearly separable data
Using kernels for non-linear spaces
Example – performing OCR with SVMs
Step 1 – collecting data Step 2 – exploring and preparing the data Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance
8. Finding Patterns – Market Basket Analysis Using Association Rules
Understanding association rules
The Apriori algorithm for association rule learning Measuring rule interest – support and confidence Building a set of rules with the Apriori principle
Example – identifying frequently purchased groceries with association rules
Step 1 – collecting data Step 2 – exploring and preparing the data
Data preparation – creating a sparse matrix for transaction data Visualizing item support – item frequency plots Visualizing the transaction data – plotting the sparse matrix
Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance
Sorting the set of association rules Taking subsets of association rules Saving association rules to a file or data frame
9. Finding Groups of Data – Clustering with k-means
Understanding clustering
Clustering as a machine learning task The k-means clustering algorithm
Using distance to assign and update clusters Choosing the appropriate number of clusters
Example – finding teen market segments using k-means clustering
Step 1 – collecting data Step 2 – exploring and preparing the data
Data preparation – dummy coding missing values Data preparation – imputing the missing values
Step 3 – training a model on the data Step 4 – evaluating model performance Step 5 – improving model performance
10. Evaluating Model Performance
Measuring performance for classification
Working with classification prediction data in R A closer look at confusion matrices Using confusion matrices to measure performance Beyond accuracy – other measures of performance
The kappa statistic Sensitivity and specificity Precision and recall The F-measure
Visualizing performance trade-offs
ROC curves
Estimating future performance
The holdout method
Cross-validation Bootstrap sampling
11. Improving Model Performance
Tuning stock models for better performance
Using caret for automated parameter tuning
Creating a simple tuned model Customizing the tuning process
Improving model performance with meta-learning
Understanding ensembles Bagging Boosting Random forests
Training random forests Evaluating random forest performance
12. Specialized Machine Learning Topics
Working with proprietary files and databases
Reading from and writing to Microsoft Excel, SAS, SPSS, and Stata files Querying data in SQL databases
Working with online data and services
Downloading the complete text of web pages Scraping data from web pages
Parsing XML documents Parsing JSON from web APIs
Working with domain-specific data
Analyzing bioinformatics data Analyzing and visualizing network data
Improving the performance of R
Managing very large datasets
Generalizing tabular data structures with dplyr Making data frames faster with data.table Creating disk-based data frames with ff Using massive matrices with bigmemory
Learning faster with parallel computing
Measuring execution time Working in parallel with multicore and snow Taking advantage of parallel with foreach and doParallel Parallel cloud computing with MapReduce and Hadoop
GPU computing Deploying optimized learning algorithms
Building bigger regression models with biglm Growing bigger and faster random forests with bigrf Training and evaluating models in parallel with caret
A. Reflect and Test Yourself Answers
Module 1: Data Analysis with R
Chapter 1: RefresheR Chapter 2: The Shape of Data Chapter 3: Describing Relationships Chapter 4: Probability Chapter 5: Using Data to Reason About the World Chapter 6: Testing Hypotheses Chapter 7: Bayesian Methods Chapter 8: Predicting Continuous Variables Chapter 9: Predicting Categorical Variables Chapter 10: Sources of Data Chapter 11: Dealing with Messy Data Chapter 12: Dealing with Large Data
Module 2: R Graphs
Chapter 1: R Graphics Chapter 2: Basic Graph Functions Chapter 3: Beyond the Basics – Adjusting Key Parameters Chapter 4: Creating Scatter Plots Chapter 5: Creating Line Graphs and Time Series Charts Chapter 6: Creating Bar, Dot, and Pie Charts Chapter 7: Creating Histograms Chapter 8: Box and Whisker Plots Chapter 9: Creating Heat Maps and Contour Plots
Module 4: Mastering R for Quantitative Finance
Chapter 1: Time Series Analysis Chapter 3: Forecasting Volume Chapter 4: Big Data – Advanced Analytics Chapter 5: FX Derivatives Chapter 6: Interest Rate Derivatives and Models Chapter 7: Exotic Options Chapter 8: Optimal Hedging Chapter 9: Fundamental Analysis
Module 5: Machine Learning with R
Chapter 1: Introducing Machine Learning Chapter 2: Managing and Understanding Data Chapter 3: Lazy Learning – Classification Using Nearest Neighbors Chapter 4: Probabilistic Learning – Classification Using Naive Bayes Chapter 5: Divide and Conquer – Classification Using Decision Trees and Rules Chapter 6: Forecasting Numeric Data – Regression Methods Chapter 7: Black Box Methods – Neural Networks and Support Vector Machines Chapter 8: Finding Patterns – Market Basket Analysis Using Association Rules
B. Bibliography Index
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion