Clojure for Data Science by Garner, Henry -- Read -- Imperial Library of Trantor

Index

Clojure for Data Science

Table of Contents Clojure for Data Science Credits About the Author Acknowledgments About the Reviewer www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe? Free access for Packt account holders

Preface

What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support

Downloading the example code Downloading the color images of this book Errata Piracy Questions

1. Statistics

Downloading the sample code Running the examples Downloading the data Inspecting the data Data scrubbing Descriptive statistics

The mean Interpreting mathematical notation The median

Variance Quantiles Binning data Histograms The normal distribution

The central limit theorem

Poincaré's baker

Generating distributions

Skewness

Quantile-quantile plots

Comparative visualizations

Box plots Cumulative distribution functions

The importance of visualizations

Visualizing electorate data

Adding columns

Adding derived columns

Comparative visualizations of electorate data Visualizing the Russian election data Comparative visualizations

Probability mass functions Scatter plots Scatter transparency

Summary

2. Inference

Introducing AcmeContent Download the sample code Load and inspect the data Visualizing the dwell times The exponential distribution

The distribution of daily means

The central limit theorem Standard error Samples and populations Confidence intervals

Sample comparisons Bias

Visualizing different populations Hypothesis testing

Significance

Testing a new site design

Performing a z-test Student's t-distribution Degrees of freedom

The t-statistic Performing the t-test

Two-tailed tests

One-sample t-test Resampling Testing multiple designs

Calculating sample means

Multiple comparisons

Introducing the simulation Compile the simulation

The browser simulation jStat B1

Scalable Vector Graphics

Plotting probability densities State and Reagent

Updating state Binding the interface

Simulating multiple tests The Bonferroni correction Analysis of variance The F-distribution The F-statistic The F-test Effect size

Cohen's d

Summary

3. Correlation

About the data Inspecting the data Visualizing the data The log-normal distribution

Visualizing correlation Jittering

Covariance Pearson's correlation

Sample r and population rho

Hypothesis testing Confidence intervals Regression

Linear equations Residuals

Ordinary least squares

Slope and intercept Interpretation Visualization Assumptions

Goodness-of-fit and R-square Multiple linear regression Matrices

Dimensions Vectors Construction Addition and scalar multiplication Matrix-vector multiplication Matrix-matrix multiplication Transposition The identity matrix Inversion

The normal equation

More features

Multiple R-squared Adjusted R-squared

Incanter's linear model

The F-test of model significance

Categorical and dummy variables Relative power

Collinearity

Multicollinearity

Prediction

The confidence interval of a prediction Model scope The final model

Summary

4. Classification

About the data Inspecting the data Comparisons with relative risk and odds The standard error of a proportion

Estimation using bootstrapping

The binomial distribution

The standard error of a proportion formula

Significance testing proportions

Adjusting standard errors for large samples

Chi-squared multiple significance testing

Visualizing the categories The chi-squared test The chi-squared statistic The chi-squared test

Classification with logistic regression

The sigmoid function The logistic regression cost function Parameter optimization with gradient descent Gradient descent with Incanter Convexity

Implementing logistic regression with Incanter

Creating a feature matrix Evaluating the logistic regression classifier The confusion matrix The kappa statistic

Probability

Bayes theorem Bayes theorem with multiple predictors

Naive Bayes classification

Implementing a naive Bayes classifier Evaluating the naive Bayes classifier

Comparing the logistic regression and naive Bayes approaches

Decision trees

Information Entropy Information gain Using information gain to identify the best predictor Recursively building a decision tree Using the decision tree for classification Evaluating the decision tree classifier

Classification with clj-ml

Loading data with clj-ml Building a decision tree in clj-ml

Bias and variance

Overfitting Cross-validation Addressing high bias

Ensemble learning and random forests

Bagging and boosting

Saving the classifier to a file Summary

5. Big Data

Downloading the code and data

Inspecting the data Counting the records

The reducers library

Parallel folds with reducers Loading large files with iota Creating a reducers processing pipeline Curried reductions with reducers Statistical folds with reducers Associativity Calculating the mean using fold Calculating the variance using fold

Mathematical folds with Tesser

Calculating covariance with Tesser Commutativity Simple linear regression with Tesser Calculating a correlation matrix

Multiple regression with gradient descent

The gradient descent update rule The gradient descent learning rate Feature scaling Feature extraction Creating a custom Tesser fold

Creating a matrix-sum fold

Calculating the total model error

Creating a matrix-mean fold

Applying a single step of gradient descent Running iterative gradient descent

Scaling gradient descent with Hadoop

Gradient descent on Hadoop with Tesser and Parkour

Parkour distributed sources and sinks Running a feature scale fold with Hadoop Running gradient descent with Hadoop Preparing our code for a Hadoop cluster Building an uberjar Submitting the uberjar to Hadoop

Stochastic gradient descent

Stochastic gradient descent with Parkour

Defining a mapper Parkour shaping functions Defining a reducer Specifying Hadoop jobs with Parkour graph Chaining mappers and reducers with Parkour graph

Summary

6. Clustering

Downloading the data Extracting the data Inspecting the data Clustering text

Set-of-words and the Jaccard index Tokenizing the Reuters files

Applying the Jaccard index to documents The bag-of-words and Euclidean distance

Representing text as vectors Creating a dictionary

Creating term frequency vectors

The vector space model and cosine distance Removing stop words Stemming

Clustering with k-means and Incanter

Clustering the Reuters documents

Better clustering with TF-IDF

Zipf's law Calculating the TF-IDF weight k-means clustering with TF-IDF Better clustering with n-grams

Large-scale clustering with Mahout

Converting text documents to a sequence file Using Parkour to create Mahout vectors Creating distributed unique IDs Distributed unique IDs with Hadoop Sharing data with the distributed cache Building Mahout vectors from input documents

Running k-means clustering with Mahout

Viewing k-means clustering results Interpreting the clustered output

Cluster evaluation measures

Inter-cluster density Intra-cluster density Calculating the root mean square error with Parkour

Loading clustered points and centroids

Calculating the cluster RMSE Determining optimal k with the elbow method Determining optimal k with the Dunn index Determining optimal k with the Davies-Bouldin index

The drawbacks of k-means

The Mahalanobis distance measure

The curse of dimensionality Summary

7. Recommender Systems

Download the code and data Inspect the data Parse the data Types of recommender systems

Collaborative filtering

Item-based and user-based recommenders Slope One recommenders

Calculating the item differences Making recommendations Practical considerations for user and item recommenders

Building a user-based recommender with Mahout k-nearest neighbors Recommender evaluation with Mahout

Evaluating distance measures

The Pearson correlation similarity Spearman's rank similarity

Determining optimum neighborhood size Information retrieval statistics

Precision Recall

Mahout's information retrieval evaluator

F-measure and the harmonic mean Fall-out Normalized discounted cumulative gain Plotting the information retrieval results

Recommendation with Boolean preferences

Implicit versus explicit feedback

Probabilistic methods for large sets

Testing set membership with Bloom filters

Jaccard similarity for large sets with MinHash

Reducing pair comparisons with locality-sensitive hashing

Bucketing signatures

Dimensionality reduction

Plotting the Iris dataset Principle component analysis Singular value decomposition

Large-scale machine learning with Apache Spark and MLlib

Loading data with Sparkling Mapping data Distributed datasets and tuples Filtering data Persistence and caching

Machine learning on Spark with MLlib

Movie recommendations with alternating least squares ALS with Spark and MLlib Making predictions with ALS Evaluating ALS Calculating the sum of squared errors

Summary

8. Network Analysis

Download the data

Inspecting the data Visualizing graphs with Loom

Graph traversal with Loom

The seven bridges of Königsberg

Breadth-first and depth-first search Finding the shortest path

Minimum spanning trees Subgraphs and connected components SCC and the bow-tie structure of the web

Whole-graph analysis Scale-free networks Distributed graph computation with GraphX

Creating RDGs with Glittering Measuring graph density with triangle counting

GraphX partitioning strategies

Running the built-in triangle counting algorithm Implement triangle counting with Glittering

Step one – collecting neighbor IDs Steps two, three, and four – aggregate messages Step five – dividing the counts

Running the custom triangle counting algorithm The Pregel API Connected components with the Pregel API

Step one – map vertices Steps two and three – the message function Step four – update the attributes Step five – iterate to convergence

Running connected components Calculating the size of the largest connected component Detecting communities with label propagation

Step one – map vertices Step two – send the vertex attribute Step three – aggregate value Step four – vertex function Step five – set the maximum iterations count

Running label propagation Measuring community influence using PageRank The flow formulation

Implementing PageRank with Glittering Sort by highest influence

Running PageRank to determine community influencers

Summary

9. Time Series

About the data

Loading the Longley data

Fitting curves with a linear model Time series decomposition

Inspecting the airline data

Visualizing the airline data

Stationarity De-trending and differencing

Discrete time models

Random walks Autoregressive models Determining autocorrelation in AR models Moving-average models Determining autocorrelation in MA models Combining the AR and MA models Calculating partial autocorrelation

Autocovariance PACF with Durbin-Levinson recursion Plotting partial autocorrelation Determining ARMA model order with ACF and PACF

ACF and PACF of airline data Removing seasonality with differencing

Maximum likelihood estimation

Calculating the likelihood Estimating the maximum likelihood

Nelder-Mead optimization with Apache Commons Math

Identifying better models with Akaike Information Criterion

Time series forecasting

Forecasting with Monte Carlo simulation

Summary

10. Visualization

Download the code and data Exploratory data visualization

Representing a two-dimensional histogram

Using Quil for visualization

Drawing to the sketch window Quil's coordinate system Plotting the grid Specifying the fill color Color and fill Outputting an image file

Visualization for communication

Visualizing wealth distribution Bringing data to life with Quil Drawing bars of differing widths Adding a title and axis labels Improving the clarity with illustrations Adding text to the bars Incorporating additional data Drawing complex shapes Drawing curves Plotting compound charts Output to PDF

Summary

Index

← Prev
Back
Next →

← Prev
Back
Next →