A
Aitken acceleration 142
Alternating expectation-conditional maximization (AECM) algorithm 152–153, 154–156, 159
counting, in Julia 26
fast computations 23
generating 23
slicing 25
ASCII 19
B
Bayesian information criterion (BIC) 146, 147, 163
Bernoulli distribution 168
Brier score 95
C
C# 59
Cairo.jl package 92
CART. See classification and regression trees (CART)
Char 19
Chi-square test 167
Classification and regression trees (CART) 93
overview 102
Clustering.jl 129
Compound expressions 28
Covariance matrix 119, 130, 131, 141
Cross-entropy 106
D
Data matrices 43
Data science
computing, role of 3
history of 3
Data, big See big data
Data, labelled 94
Data, unlabelled 94
Dataframes
arrays (see arrays)
columns 44
defining 43
features 44
matrix form 43
slicing 45
sorting 58
Datasets 6
beer data 7, 50, 60, 75, 83, 86, 116–117, 120
coffee data 7, 161–162, 165, 166, 171–172, 174, 176
crabs data 7, 134, 146, 147, 167
food preferences data 10, 51, 165, 176
iris data 10
wine data 166
x2 data 10
DataStreams.jl 60
Deviance 95
E
Eigenvalues 131
Eigenvectors 131
Empirical cumulative distribution function (ECDF) plot 80, 82, 83
Error bars 90
Error() function 35
Expectation-maximization (EM) algorithm
implementing, for PPCA 142, 144
initialization 141
overview 137
stopping rule 141
F
Facets 91
Factor analysis 135
Floating points 61
Fortran 5
Functions
anonymous 38
defining 36
naming conventions 36
series of 40
writing 40
G
GadFly.jl 67, 68–69, 86, 88, 90
Gaussian mixture model 148
ggplot2 67
Gini index 106
Gradient boosting
beer data example 116–117, 120
food preferences example 121–123
Grammar of Graphics (GoG) 67–68, 77, 86
H
I
IEEE 754 standard 18
Inter quartile range (IQR) 74
J
Julia
syntax 5
K
K-fold cross-validation 97–98, 183
K-means clustering 148–149, 159, 161
K-nearest neighbours (kNN) 99–100
L
Language-INtegrated Query (LINQ) 59
Loess model 85
Logistic regression model 90
Loops
continue keyword 33
overview 30
termination of 32
M
MASS package 7
Mean squared error (MSE) 94
Median absolute error (MAE) 94, 121, 125, 177–178, 180, 181, 182
Mixture package 7
MLBase.jl 116
MM algorithm 39
N
Numeric literals 15
Numeric primitives 15
O
ODBS 59
P
Pareto distributions 80
Parsimonious Gaussian mixture models (PGMMs) 172–173, 174
Perl 21
Plots, saving 92
Principal components analysis (PCA) 132–134, 135, 147
Probabilistic principal components analysis (PPCA) 123–125, 132–134, 142, 144
AECM algorithm for mixture model 159
Pseudocode 113
Q
Query.jl package
@group statement 65
@join statement 64
@let 62
@orderby 63
@select statement 60
query statement 59
R
R 5, 6, 14, 19, 38, 49, 55, 56, 67, 166–171
Random forests 112, 113, 176–184
Random matrix 130
Random vector 43, 44, 130, 131
Regexes 21
Root mean squared error (RMSE) 95, 121, 125, 181
S
Split-apply-combine (SAC) strategy 56–58
SQLite 59
Statistical modelling, cultures within 2–3
Statistics, role of, in data science 1–2, 4
T
Ternary operators 30
Training-test paradigm 97
Trees, combining 112
Try-catch statements 35
Type promotion 40
U
Unsigned integers 16
Unsupervised learning 129, 130, 162
V
Validation set 97
Variables
global 40
symbols 44
Violin plots 75, 76, 77, 83, 183
Visualizations, data
custom 70
GadFly.jl (see GadFly.jl)
VSCC technique 175
W
Woodbury identity 140–141, 144
X