Data Science with Julia

Index

Aitken acceleration 142

Alternating expectation-conditional maximization (AECM) algorithm 152–153, 154–156, 159

Arrays 23, 45, 47, 48, 49

built-in functions 24–25

counting, in Julia 26

fast computations 23

generating 23

slicing 25

ASCII 19

Bagging 112, 113

Bar charts 69, 70

Bayesian information criterion (BIC) 146, 147, 163

Bernoulli distribution 168

Big data 4–5

Bivariate plots 83–84

Boolean expression 29–30

Boolean values 16, 45, 58

Bootstrap 93, 108–111

Boxplots 67, 70, 74, 75, 83

Brier score 95

C 5, 6

C# 59

Cairo.jl package 92

CART. See classification and regression trees (CART)

Categorical data 47–49

Char 19

Chi-square test 167

Classification and regression trees (CART) 93

classification trees 103–106

overview 102

regression trees 106–107

Clustering.jl 129

Compound expressions 28

Conditional evaluation 29–30

Constant matrix 44, 130

Covariance matrix 119, 130, 131, 141

Cross-entropy 106

CSV.jl library 48–49, 59

Data analysis 3, 4

Data matrices 43

Data science

computing, role of 3

history of 3

statistics, role of 1–2, 3

Data, big See big data

Data, labelled 94

Data, unlabelled 94

Dataframes

arrays (see arrays)

columns 44

defining 43

features 44

functions 54–55

matrix form 43

slicing 45

sorting 58

Datasets 6

beer data 7, 50, 60, 75, 83, 86, 116–117, 120

coffee data 7, 161–162, 165, 166, 171–172, 174, 176

crabs data 7, 134, 146, 147, 167

food preferences data 10, 51, 165, 176

iris data 10

wine data 166

x2 data 10

DataStreams.jl 60

Deviance 95

Dictionaries 26, 27

Dot charts 69, 70

Eigenvalues 131

Eigenvectors 131

Empirical cumulative distribution function (ECDF) plot 80, 82, 83

Error bars 90

Error() function 35

Exception handling 33–35

Expectation-maximization (EM) algorithm

E-step 138–139

implementing, for PPCA 142, 144

initialization 141

M-step 139–140

overview 137

stopping rule 141

Woodbury identity 140–141

Facets 91

Factor analysis 135

Floating points 61

Floats 16, 17–18, 44

Fortran 5

Functions

anonymous 38

defining 36

inputs 36–38

naming conventions 36

series of 40

writing 40

GadFly.jl 67, 68–69, 86, 88, 90

Gaussian mixture model 148

ggplot2 67

Gini index 106

Gradient boosting

beer data example 116–117, 120

food preferences example 121–123

overview 113–115

Grammar of Graphics (GoG) 67–68, 77, 86

Hexbin plots 85, 86

Histograms 67, 71–72, 85, 86

IEEE 754 standard 18

Inter quartile range (IQR) 74

Julia

interoperability 5–6

syntax 5

K-fold cross-validation 97–98, 183

K-means clustering 148–149, 159, 161

K-nearest neighbours (kNN) 99–100

Kernel density 71–72

Language-INtegrated Query (LINQ) 59

Loess model 85

Log-likelihood 142, 144, 153

Logistic regression model 90

Loops

continue keyword 33

for loop 30, 32

overview 30

termination of 32

while loop 31, 32

MASS package 7

Mean squared error (MSE) 94

Median absolute error (MAE) 94, 121, 125, 177–178, 180, 181, 182

Missing values 45–46

Mixture package 7

MLBase.jl 116

MM algorithm 39

Numeric literals 15

Numeric primitives 15

ODBS 59

Operators 14–15

Pareto distributions 80

Parsimonious Gaussian mixture models (PGMMs) 172–173, 174

Perl 21

Plots, saving 92

Principal components analysis (PCA) 132–134, 135, 147

Probabilistic principal components analysis (PPCA) 123–125, 132–134, 142, 144

AECM algorithm for mixture model 159

mixture models 151–152

parameter estimation 152–153

Pseudocode 113

Python 5, 6, 14, 19, 37

QQ-plots 68, 77, 80, 84

Query.jl package

descending() function 63–64

@group statement 65

@join statement 64

@let 62

@orderby 63

@select statement 60

arrays 60, 61

overview 59, 60

query statement 59

syntax 60, 61

R 5, 6, 14, 19, 38, 49, 55, 56, 67, 166–171

Random forests 112, 113, 176–184

Random matrix 130

Random vector 43, 44, 130, 131

RCall 166–167

RDatasets.jl 165, 166

Regexes 21

Root mean squared error (RMSE) 95, 121, 125, 181

Scatterplots 83–84, 85

Scientific notation 16, 19

Split-apply-combine (SAC) strategy 56–58

SQLite 59

Statistical modelling, cultures within 2–3

Statistics, role of, in data science 1–2, 4

Strings 19–21, 47, 49

Supervised learning 93–96

Ternary operators 30

Training-test paradigm 97

Trees, combining 112

Try-catch statements 35

Tuples 22, 61

Type promotion 40

Unicode 13, 19, 36

Unsigned integers 16

Unsupervised learning 129, 130, 162

UTF-8 13, 19

Validation set 97

Variables

global 40

names 13, 44

symbols 44

Velocity 4, 5

Violin plots 75, 76, 77, 83, 183

Visualizations, data

custom 70

GadFly.jl (see GadFly.jl)

overview 69–72

VSCC technique 175

Woodbury identity 140–141, 144

XGBoost 93, 115, 116, 121, 122