Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Title Page
Copyright
R Data Mining
Credits
About the Author
About the Reviewers
www.PacktPub.com
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Why to Choose R for Your Data Mining and Where to Start
What is R?
A bit of history
R's points of strength
Open source inside
Plugin ready
Data visualization friendly
Installing R and writing R code
Downloading R
R installation for Windows and macOS
R installation for Linux OS
Main components of a base R installation
Possible alternatives to write and run R code
RStudio (all OSs)
The Jupyter Notebook (all OSs)
Visual Studio (Windows users only)
R foundational notions
A preliminary R session
Executing R interactively through the R console
Creating an R script
Executing an R script
Vectors
Lists
Creating lists
Subsetting lists
Data frames
Functions
R's weaknesses and how to overcome them
Learning R effectively and minimizing the effort
The tidyverse
Leveraging the R community to learn R
Where to find the R community
Engaging with the community to learn R
Handling large datasets with R
Further references
Summary
A First Primer on Data Mining Analysing Your Bank Account Data
Acquiring and preparing your banking data
Data model
Summarizing your data with pivot-like tables
A gentle introduction to the pipe operator
An even more gentle introduction to the dplyr package
Installing the necessary packages and loading your data into R
Installing and loading the necessary packages
Importing your data into R
Defining the monthly and daily sum of expenses
Visualizing your data with ggplot2
Basic data visualization principles
Less but better
Not every chart is good for your message
Scatter plot
Line chart
Bar plot
Other advanced charts
Colors have to be chosen carefully
A bit of theory - chromatic circle, hue, and luminosity
Visualizing your data with ggplot
One more gentle introduction – the grammar of graphics
A layered grammar of graphics – ggplot2
Visualizing your banking movements with ggplot2
Visualizing the number of movements per day of the week
Further references
Summary
The Data Mining Process - CRISP-DM Methodology
The Crisp-DM methodology data mining cycle
Business understanding
Data understanding
Data collection
How to perform data collection with R
Data import from TXT and CSV files
Data import from different types of format already structured as tables
Data import from unstructured sources
Data description
How to perform data description with R
Data exploration
What to use in R to perform this task
The summary() function
Box plot
Histograms
Data preparation
Modelling
Defining a data modeling strategy
How similar problems were solved in the past
Emerging techniques
Classification of modeling problems
How to perform data modeling with R
Evaluation
Clustering evaluation
Classification evaluation
Regression evaluation
How to judge the adequacy of a model's performance
What to use in R to perform this task
Deployment
Deployment plan development
Maintenance plan development
Summary
Keeping the House Clean – The Data Mining Architecture
A general overview
Data sources
Types of data sources
Unstructured data sources
Structured data sources
Key issues of data sources
Databases and data warehouses
The third wheel – the data mart
One-level database
Two-level database
Three-level database
Technologies
SQL
MongoDB
Hadoop
The data mining engine
The interpreter
The interface between the engine and the data warehouse
The data mining algorithms
User interface
Clarity
Clarity and mystery
Clarity and simplicity
Efficiency
Consistency
Syntax highlight
Auto-completion
How to build a data mining architecture in R
Data sources
The data warehouse
The data mining engine
The interface between the engine and the data warehouse
The data mining algorithms
The user interface
Further references
Summary
How to Address a Data Mining Problem – Data Cleaning and Validation
On a quiet day
Data cleaning
Tidy data
Analysing the structure of our data
The str function
The describe function
head, tail, and View functions
Evaluating your data tidiness
Every row is a record
Every column shows an attribute
Every table represents an observational unit
Tidying our data
The tidyr package
Long versus wide data
The spread function
The gather function
The separate function
Applying tidyr to our dataset
Validating our data
Fitness for use
Conformance to standards
Data quality controls
Consistency checks
Data type checks
Logical checks
Domain checks
Uniqueness checks
Performing data validation on our data
Data type checks with str()
Domain checks
The final touch — data merging
left_join function
moving beyond left_join
Further references
Summary
Looking into Your Data Eyes – Exploratory Data Analysis
Introducing summary EDA
Describing the population distribution
Quartiles and Median
Mean
The mean and phenomenon going on within sub populations
The mean being biased by outlier values
Computing the mean of our population
Variance
Standard deviation
Skewness
Measuring the relationship between variables
Correlation
The Pearson correlation coefficient
Distance correlation
Weaknesses of summary EDA - the Anscombe quartet
Graphical EDA
Visualizing a variable distribution
Histogram
Reporting date histogram
Geographical area histogram
Cash flow histogram
Boxplot
Checking for outliers
Visualizing relationships between variables
Scatterplots
Adding title, subtitle, and caption to the plot
Setting axis and legend
Adding explicative text to the plot
Final touches on colors
Further references
Summary
Our First Guess – a Linear Regression
Defining a data modelling strategy
Data modelling notions
Supervised learning
Unsupervised learning
The modeling strategy
Applying linear regression to our data
The intuition behind linear regression
The math behind the linear regression
Ordinary least squares technique
Model requirements – what to look for before applying the model
Residuals' uncorrelation
Residuals' homoscedasticity
How to apply linear regression in R
Fitting the linear regression model
Validating model assumption
Visualizing fitted values
Preparing the data for visualization
Developing the data visualization
Further references
Summary
A Gentle Introduction to Model Performance Evaluation
Defining model performance
Fitting versus interpretability
Making predictions with models
Measuring performance in regression models
Mean squared error
R-squared
R-squared meaning and interpretation
R-squared computation in R
Adjusted R-squared
R-squared misconceptions
The R-squared doesn't measure the goodness of fit
A low R-squared doesn't mean your model is not statistically significant
Measuring the performance in classification problems
The confusion matrix
Confusion matrix in R
Accuracy
How to compute accuracy in R
Sensitivity
How to compute sensitivity in R
Specificity
How to compute specificity in R
How to choose the right performance statistics
A final general warning – training versus test datasets
Further references
Summary
Don't Give up – Power up Your Regression Including Multiple Variables
Moving from simple to multiple linear regression
Notation
Assumptions
Variables' collinearity
Tolerance
Variance inflation factors
Addressing collinearity
Dimensionality reduction
Stepwise regression
Backward stepwise regression
From the full model to the n-1 model
Forward stepwise regression
Double direction stepwise regression
Principal component regression
Fitting a multiple linear model with R
Model fitting
Variable assumptions validation
Residual assumptions validation
Dimensionality reduction
Principal component regression
Stepwise regression
Linear model cheat sheet
Further references
Summary
A Different Outlook to Problems with Classification Models
What is classification and why do we need it?
Linear regression limitations for categorical variables
Common classification algorithms and models
Logistic regression
The intuition behind logistic regression
The logistic function estimates a response variable enclosed within an upper and lower bound
The logistic function estimates the probability of an observation pertaining to one of the two available categories
The math behind logistic regression
Maximum likelihood estimator
Model assumptions
Absence of multicollinearity between variables
Linear relationship between explanatory variables and log odds
Large enough sample size
How to apply logistic regression in R
Fitting the model
Reading the glm() estimation output
The level of statistical significance of the association between the explanatory variable and the response variable
The AIC performance metric
Validating model assumptions
Fitting quadratic and cubic models to test for linearity of log odds
Visualizing and interpreting logistic regression results
Visualizing results
Interpreting results
Logistic regression cheat sheet
Support vector machines
The intuition behind support vector machines
The hyperplane
Maximal margin classifier
Support vector and support vector machines
Model assumptions
Independent and identically distributed random variables
Independent variables
Identically distributed
Applying support vector machines in R
The svm() function
Applying the svm function to our data
Interpreting support vector machine results
Understanding the meaning of hyperplane weights
Support Vector Machine cheat sheet
References
Summary
The Final Clash – Random Forests and Ensemble Learning
Random forest
Random forest building blocks – decision trees introduction
The intuition behind random forests
How to apply random forests in R
Evaluating the results of the model
Performance of the model
OOB estimate error rate
Confusion matrix
Importance of predictors
Mean decrease in accuracy
Gini index
Plotting relative importance of predictors
Random forest cheat sheet
Ensemble learning
Basic ensemble learning techniques
Applying ensemble learning to our data in R
The R caret package
Computing a confusion matrix with the caret package
Interpreting confusion matrix results
Applying a weighted majority vote to our data
Applying estimated models on new data
predict.glm() for prediction from the logistic model
predict.randomForest() for prediction from random forests
predict.svm() for prediction from support vector machines
A more structured approach to predictive analytics
Applying the majority vote ensemble technique on predicted data
Further references
Summary
Looking for the Culprit – Text Data Mining with R
Extracting data from a PDF file in R
Getting a list of documents in a folder
Reading PDF files into R via pdf_text()
Iteratively extracting text from a set of documents with a for loop
Sentiment analysis
Developing wordclouds from text
Looking for context in text – analyzing document n-grams
Performing network analysis on textual data
Obtaining an hedge list from a data frame
Visualizing a network with the ggraph package
Tuning the appearance of nodes and edges
Computing the degree parameter in a network to highlight relevant nodes
Further references
Summary
Sharing Your Stories with Your Stakeholders through R Markdown
Principles of a good data mining report
Clearly state the objectives
Clearly state assumptions
Make the data treatments clear
Show consistent data
Provide data lineage
Set up an rmarkdown report
Develop an R markdown report in RStudio
A brief introduction to markdown
Inserting a chunk of code
How to show readable tables in rmarkdwon reports
Reproducing R code output within text through inline code
Introduction to Shiny and the reactivity framework
Employing input and output to deal with changes in Shiny app parameters
Adding an interactive data lineage module
Adding an input panel to an R markdown report
Adding a data table to your report
Expanding Shiny beyond the basics
Rendering and sharing an R markdown report
Rendering an R markdown report
Sharing an R Markdown report
Render a static markdown report into different file formats
Render interactive Shiny apps on dedicated servers
Sharing a Shiny app through shinyapps.io
Further references
Summary
Epilogue
Dealing with Dates, Relative Paths and Functions
Dealing with dates in R
Working directories and relative paths in R
Conditional statements
← Prev
Back
Next →
← Prev
Back
Next →