Index

A

Aitken acceleration 142

Alternating expectation-conditional maximization (AECM) algorithm 152153, 154156, 159

Arrays 23, 45, 47, 48, 49

built-in functions 2425

counting, in Julia 26

fast computations 23

generating 23

slicing 25

ASCII 19

B

Bagging 112, 113

Bar charts 69, 70

Bayesian information criterion (BIC) 146, 147, 163

Bernoulli distribution 168

Big data 45

Bivariate plots 8384

Boolean expression 2930

Boolean values 16, 45, 58

Bootstrap 93, 108111

Boxplots 67, 70, 74, 75, 83

Brier score 95

C

C 5, 6

C# 59

Cairo.jl package 92

CART. See classification and regression trees (CART)

Categorical data 4749

Char 19

Chi-square test 167

Classification and regression trees (CART) 93

classification trees 103106

overview 102

regression trees 106107

Clustering.jl 129

Compound expressions 28

Conditional evaluation 2930

Constant matrix 44, 130

Covariance matrix 119, 130, 131, 141

Cross-entropy 106

CSV.jl library 4849, 59

D

Data analysis 3, 4

Data matrices 43

Data science

computing, role of 3

history of 3

statistics, role of 12, 3

Data, big See big data

Data, labelled 94

Data, unlabelled 94

Dataframes

arrays (see arrays)

columns 44

defining 43

features 44

functions 5455

matrix form 43

slicing 45

sorting 58

Datasets 6

beer data 7, 50, 60, 75, 83, 86, 116117, 120

coffee data 7, 161162, 165, 166, 171172, 174, 176

crabs data 7, 134, 146, 147, 167

food preferences data 10, 51, 165, 176

iris data 10

wine data 166

x2 data 10

DataStreams.jl 60

Deviance 95

Dictionaries 26, 27

Dot charts 69, 70

E

Eigenvalues 131

Eigenvectors 131

Empirical cumulative distribution function (ECDF) plot 80, 82, 83

Error bars 90

Error() function 35

Exception handling 3335

Expectation-maximization (EM) algorithm

E-step 138139

implementing, for PPCA 142, 144

initialization 141

M-step 139140

overview 137

stopping rule 141

Woodbury identity 140141

F

Facets 91

Factor analysis 135

Floating points 61

Floats 16, 1718, 44

Fortran 5

Functions

anonymous 38

defining 36

inputs 3638

naming conventions 36

series of 40

writing 40

G

GadFly.jl 67, 6869, 86, 88, 90

Gaussian mixture model 148

ggplot2 67

Gini index 106

Gradient boosting

beer data example 116117, 120

food preferences example 121123

overview 113115

Grammar of Graphics (GoG) 6768, 77, 86

H

Hexbin plots 85, 86

Histograms 67, 7172, 85, 86

I

IEEE 754 standard 18

Inter quartile range (IQR) 74

J

Julia

interoperability 56

syntax 5

K

K-fold cross-validation 9798, 183

K-means clustering 148149, 159, 161

K-nearest neighbours (kNN) 99100

Kernel density 7172

L

Language-INtegrated Query (LINQ) 59

Loess model 85

Log-likelihood 142, 144, 153

Logistic regression model 90

Loops

continue keyword 33

for loop 30, 32

overview 30

termination of 32

while loop 31, 32

M

MASS package 7

Mean squared error (MSE) 94

Median absolute error (MAE) 94, 121, 125, 177178, 180, 181, 182

Missing values 4546

Mixture package 7

MLBase.jl 116

MM algorithm 39

N

Numeric literals 15

Numeric primitives 15

O

ODBS 59

Operators 1415

P

Pareto distributions 80

Parsimonious Gaussian mixture models (PGMMs) 172173, 174

Perl 21

Plots, saving 92

Principal components analysis (PCA) 132134, 135, 147

Probabilistic principal components analysis (PPCA) 123125, 132134, 142, 144

AECM algorithm for mixture model 159

mixture models 151152

parameter estimation 152153

Pseudocode 113

Python 5, 6, 14, 19, 37

Q

QQ-plots 68, 77, 80, 84

Query.jl package

descending() function 6364

@group statement 65

@join statement 64

@let 62

@orderby 63

@select statement 60

arrays 60, 61

overview 59, 60

query statement 59

syntax 60, 61

R

R 5, 6, 14, 19, 38, 49, 55, 56, 67, 166171

Random forests 112, 113, 176184

Random matrix 130

Random vector 43, 44, 130, 131

RCall 166167

RDatasets.jl 165, 166

Regexes 21

Root mean squared error (RMSE) 95, 121, 125, 181

S

Scatterplots 8384, 85

Scientific notation 16, 19

Split-apply-combine (SAC) strategy 5658

SQLite 59

Statistical modelling, cultures within 23

Statistics, role of, in data science 12, 4

Strings 1921, 47, 49

Supervised learning 9396

T

Ternary operators 30

Training-test paradigm 97

Trees, combining 112

Try-catch statements 35

Tuples 22, 61

Type promotion 40

U

Unicode 13, 19, 36

Unsigned integers 16

Unsupervised learning 129, 130, 162

UTF-8 13, 19

V

Validation set 97

Variables

global 40

names 13, 44

symbols 44

Velocity 4, 5

Violin plots 75, 76, 77, 83, 183

Visualizations, data

custom 70

GadFly.jl (see GadFly.jl)

overview 6972

VSCC technique 175

W

Woodbury identity 140141, 144

X

XGBoost 93, 115, 116, 121, 122