R Data Mining by Cirillo, Andrea -- Read -- Imperial Library of Trantor

Log In

Or create an account ->

Imperial Library

Home
About
News
Upload
Forum

Help

Login/SignUp

Index

Title Page Copyright

R Data Mining

Credits About the Author About the Reviewers www.PacktPub.com

Why subscribe?

Customer Feedback Preface

What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support

Downloading the example code Downloading the color images of this book Errata Piracy Questions

Why to Choose R for Your Data Mining and Where to Start

What is R? A bit of history R's points of strength

Open source inside Plugin ready Data visualization friendly

Installing R and writing R code

Downloading R

R installation for Windows and macOS R installation for Linux OS

Main components of a base R installation

Possible alternatives to write and run R code

RStudio (all OSs) The Jupyter Notebook (all OSs) Visual Studio (Windows users only)

R foundational notions

A preliminary R session

Executing R interactively through the R console Creating an R script Executing an R script

Vectors Lists

Creating lists Subsetting lists

Data frames Functions

R's weaknesses and how to overcome them

Learning R effectively and minimizing the effort

The tidyverse Leveraging the R community to learn R

Where to find the R community Engaging with the community to learn R

Handling large datasets with R

Further references Summary

A First Primer on Data Mining Analysing Your Bank Account Data

Acquiring and preparing your banking data

Data model

Summarizing your data with pivot-like tables

A gentle introduction to the pipe operator An even more gentle introduction to the dplyr package Installing the necessary packages and loading your data into R

Installing and loading the necessary packages Importing your data into R

Defining the monthly and daily sum of expenses

Visualizing your data with ggplot2

Basic data visualization principles

Less but better Not every chart is good for your message

Scatter plot Line chart Bar plot Other advanced charts

Colors have to be chosen carefully

A bit of theory - chromatic circle, hue, and luminosity

Visualizing your data with ggplot

One more gentle introduction – the grammar of graphics A layered grammar of graphics – ggplot2 Visualizing your banking movements with ggplot2

Visualizing the number of movements per day of the week

Further references Summary

The Data Mining Process - CRISP-DM Methodology

The Crisp-DM methodology data mining cycle Business understanding Data understanding

Data collection

How to perform data collection with R

Data import from TXT and CSV files Data import from different types of format already structured as tables Data import from unstructured sources

Data description

How to perform data description with R

Data exploration

What to use in R to perform this task

The summary() function Box plot Histograms

Data preparation Modelling

Defining a data modeling strategy

How similar problems were solved in the past

Emerging techniques

Classification of modeling problems How to perform data modeling with R

Evaluation

Clustering evaluation Classification evaluation Regression evaluation How to judge the adequacy of a model's performance

What to use in R to perform this task

Deployment

Deployment plan development Maintenance plan development

Summary

Keeping the House Clean – The Data Mining Architecture

A general overview Data sources

Types of data sources

Unstructured data sources Structured data sources Key issues of data sources

Databases and data warehouses

The third wheel – the data mart One-level database Two-level database Three-level database Technologies

SQL MongoDB Hadoop

The data mining engine

The interpreter The interface between the engine and the data warehouse The data mining algorithms

User interface

Clarity

Clarity and mystery Clarity and simplicity Efficiency Consistency

Syntax highlight Auto-completion

How to build a data mining architecture in R

Data sources The data warehouse The data mining engine

The interface between the engine and the data warehouse The data mining algorithms

The user interface

Further references Summary

How to Address a Data Mining Problem – Data Cleaning and Validation

On a quiet day Data cleaning

Tidy data Analysing the structure of our data

The str function The describe function head, tail, and View functions Evaluating your data tidiness

Every row is a record Every column shows an attribute Every table represents an observational unit

Tidying our data

The tidyr package

Long versus wide data The spread function The gather function The separate function

Applying tidyr to our dataset

Validating our data

Fitness for use Conformance to standards Data quality controls

Consistency checks Data type checks Logical checks Domain checks Uniqueness checks

Performing data validation on our data

Data type checks with str() Domain checks

The final touch — data merging

left_join function moving beyond left_join

Further references Summary

Looking into Your Data Eyes – Exploratory Data Analysis

Introducing summary EDA

Describing the population distribution

Quartiles and Median Mean

The mean and phenomenon going on within sub populations The mean being biased by outlier values Computing the mean of our population

Variance Standard deviation Skewness

Measuring the relationship between variables

Correlation

The Pearson correlation coefficient Distance correlation

Weaknesses of summary EDA - the Anscombe quartet

Graphical EDA

Visualizing a variable distribution

Histogram

Reporting date histogram Geographical area histogram Cash flow histogram

Boxplot Checking for outliers

Visualizing relationships between variables

Scatterplots

Adding title, subtitle, and caption to the plot Setting axis and legend Adding explicative text to the plot Final touches on colors

Further references Summary

Our First Guess – a Linear Regression

Defining a data modelling strategy

Data modelling notions

Supervised learning Unsupervised learning The modeling strategy

Applying linear regression to our data

The intuition behind linear regression The math behind the linear regression

Ordinary least squares technique Model requirements – what to look for before applying the model

Residuals' uncorrelation Residuals' homoscedasticity

How to apply linear regression in R

Fitting the linear regression model Validating model assumption Visualizing fitted values

Preparing the data for visualization Developing the data visualization

Further references Summary

A Gentle Introduction to Model Performance Evaluation

Defining model performance

Fitting versus interpretability Making predictions with models

Measuring performance in regression models

Mean squared error R-squared

R-squared meaning and interpretation R-squared computation in R Adjusted R-squared R-squared misconceptions

The R-squared doesn't measure the goodness of fit A low R-squared doesn't mean your model is not statistically significant

Measuring the performance in classification problems

The confusion matrix

Confusion matrix in R

Accuracy

How to compute accuracy in R

Sensitivity

How to compute sensitivity in R

Specificity

How to compute specificity in R

How to choose the right performance statistics

A final general warning – training versus test datasets Further references Summary

Don't Give up – Power up Your Regression Including Multiple Variables

Moving from simple to multiple linear regression

Notation Assumptions

Variables' collinearity

Tolerance Variance inflation factors Addressing collinearity

Dimensionality reduction

Stepwise regression

Backward stepwise regression

From the full model to the n-1 model

Forward stepwise regression Double direction stepwise regression

Principal component regression

Fitting a multiple linear model with R

Model fitting Variable assumptions validation Residual assumptions validation Dimensionality reduction

Principal component regression Stepwise regression

Linear model cheat sheet

Further references Summary

A Different Outlook to Problems with Classification Models

What is classification and why do we need it?

Linear regression limitations for categorical variables Common classification algorithms and models

Logistic regression

The intuition behind logistic regression

The logistic function estimates a response variable enclosed within an upper and lower bound The logistic function estimates the probability of an observation pertaining to one of the two available categories

The math behind logistic regression

Maximum likelihood estimator Model assumptions

Absence of multicollinearity between variables Linear relationship between explanatory variables and log odds Large enough sample size

How to apply logistic regression in R

Fitting the model

Reading the glm() estimation output The level of statistical significance of the association between the explanatory variable and the response variable The AIC performance metric

Validating model assumptions

Fitting quadratic and cubic models to test for linearity of log odds

Visualizing and interpreting logistic regression results

Visualizing results Interpreting results

Logistic regression cheat sheet

Support vector machines

The intuition behind support vector machines

The hyperplane Maximal margin classifier Support vector and support vector machines Model assumptions Independent and identically distributed random variables

Independent variables Identically distributed

Applying support vector machines in R

The svm() function Applying the svm function to our data

Interpreting support vector machine results

Understanding the meaning of hyperplane weights

Support Vector Machine cheat sheet

References Summary

The Final Clash – Random Forests and Ensemble Learning

Random forest

Random forest building blocks – decision trees introduction The intuition behind random forests How to apply random forests in R Evaluating the results of the model

Performance of the model

OOB estimate error rate Confusion matrix

Importance of predictors

Mean decrease in accuracy Gini index Plotting relative importance of predictors Random forest cheat sheet

Ensemble learning

Basic ensemble learning techniques Applying ensemble learning to our data in R

The R caret package Computing a confusion matrix with the caret package Interpreting confusion matrix results Applying a weighted majority vote to our data

Applying estimated models on new data

predict.glm() for prediction from the logistic model predict.randomForest() for prediction from random forests predict.svm() for prediction from support vector machines

A more structured approach to predictive analytics Applying the majority vote ensemble technique on predicted data Further references Summary

Looking for the Culprit – Text Data Mining with R

Extracting data from a PDF file in R

Getting a list of documents in a folder Reading PDF files into R via pdf_text() Iteratively extracting text from a set of documents with a for loop

Sentiment analysis Developing wordclouds from text Looking for context in text – analyzing document n-grams Performing network analysis on textual data

Obtaining an hedge list from a data frame Visualizing a network with the ggraph package

Tuning the appearance of nodes and edges Computing the degree parameter in a network to highlight relevant nodes

Further references Summary

Sharing Your Stories with Your Stakeholders through R Markdown

Principles of a good data mining report

Clearly state the objectives Clearly state assumptions Make the data treatments clear Show consistent data Provide data lineage

Set up an rmarkdown report Develop an R markdown report in RStudio

A brief introduction to markdown Inserting a chunk of code

How to show readable tables in rmarkdwon reports

Reproducing R code output within text through inline code Introduction to Shiny and the reactivity framework

Employing input and output to deal with changes in Shiny app parameters

Adding an interactive data lineage module

Adding an input panel to an R markdown report Adding a data table to your report Expanding Shiny beyond the basics

Rendering and sharing an R markdown report

Rendering an R markdown report Sharing an R Markdown report

Render a static markdown report into different file formats Render interactive Shiny apps on dedicated servers

Sharing a Shiny app through shinyapps.io

Further references Summary

Epilogue Dealing with Dates, Relative Paths and Functions

Dealing with dates in R Working directories and relative paths in R Conditional statements

← Prev
Back
Next →

← Prev
Back
Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion