Contents

CHAPTER  1 ■ Introduction

1.1    DATA SCIENCE

1.2    BIG DATA

1.3    JULIA

1.4    JULIA AND R PACKAGES

1.5    DATASETS

1.5.1    Overview

1.5.2    Beer Data

1.5.3    Coffee Data

1.5.4    Leptograpsus Crabs Data

1.5.5    Food Preferences Data

1.5.6    x2 Data

1.5.7    Iris Data

1.6    OUTLINE OF THE CONTENTS OF THIS MONOGRAPH

CHAPTER  2 ■ Core Julia

2.1    VARIABLE NAMES

2.2    OPERATORS

2.3    TYPES

2.3.1    Numeric

2.3.2    Floats

2.3.3    Strings

2.3.4    Tuples

2.4    DATA STRUCTURES

2.4.1    Arrays

2.4.2    Dictionaries

2.5    CONTROL FLOW

2.5.1    Compound Expressions

2.5.2    Conditional Evaluation

2.5.3    Loops

2.5.3.1    Basics

2.5.3.2    Loop termination

2.5.3.3    Exception handling

2.6    FUNCTIONS

CHAPTER  3 ■ Working with Data

3.1    DATAFRAMES

3.2    CATEGORICAL DATA

3.3    INPUT/OUTPUT

3.4    USEFUL DATAFRAME FUNCTIONS

3.5    SPLIT-APPLY-COMBINE STRATEGY

3.6    QUERY.JL

CHAPTER  4 ■ Visualizing Data

4.1    GADFLY.JL

4.2    VISUALIZING UNIVARIATE DATA

4.3    DISTRIBUTIONS

4.4    VISUALIZING BIVARIATE DATA

4.5    ERROR BARS

4.6    FACETS

4.7    SAVING PLOTS

CHAPTER  5 ■ Supervised Learning

5.1    INTRODUCTION

5.2    CROSS-VALIDATION

5.2.1    Overview

5.2.2    K-Fold Cross-Validation

5.3    K-NEAREST NEIGHBOURS CLASSIFICATION

5.4    CLASSIFICATION AND REGRESSION TREES

5.4.1    Overview

5.4.2    Classification Trees

5.4.3    Regression Trees

5.4.4    Comments

5.5    BOOTSTRAP

5.6    RANDOM FORESTS

5.7    GRADIENT BOOSTING

5.7.1    Overview

5.7.2    Beer Data

5.7.3    Food Data

5.8    COMMENTS

CHAPTER  6 ■ Unsupervised Learning

6.1    INTRODUCTION

6.2    PRINCIPAL COMPONENTS ANALYSIS

6.3    PROBABILISTIC PRINCIPAL COMPONENTS ANALYSIS

6.4    EM ALGORITHM FOR PPCA

6.4.1    Background: EM Algorithm

6.4.2    E-step

6.4.3    M-step

6.4.4    Woodbury Identity

6.4.5    Initialization

6.4.6    Stopping Rule

6.4.7    Implementing the EM Algorithm for PPCA

6.4.8    Comments

6.5    K-MEANS CLUSTERING

6.6    MIXTURE OF PROBABILISTIC PRINCIPAL COMPONENTS ANALYZERS

6.6.1    Model

6.6.2    Parameter Estimation

6.6.3    Illustrative Example: Coffee Data

6.7    COMMENTS

CHAPTER  7 ■ R Interoperability

7.1    ACCESSING R DATASETS

7.2    INTERACTING WITH R

7.3    EXAMPLE: CLUSTERING AND DATA REDUCTION FOR THE COFFEE DATA

7.3.1    Coffee Data

7.3.2    PGMM Analysis

7.3.3    VSCC Analysis

7.4    EXAMPLE: FOOD DATA

7.4.1    Overview

7.4.2    Random Forests

APPENDIX A ■ Julia and R Packages Used Herein

APPENDIX B ■ Variables for Food Data

APPENDIX C ■ Useful Mathematical Results

C.1   BRIEF OVERVIEW OF EIGENVALUES

C.2   SELECTED LINEAR ALGEBRA RESULTS

C.3   MATRIX CALCULUS RESULTS

APPENDIX D ■ Performance Tips

D.1   FLOATING POINT NUMBERS

D.1.1   Do Not Test for Equality

D.1.2   Use Logarithms for Division

D.1.3   Subtracting Two Nearly Equal Numbers

D.2   JULIA PERFORMANCE

D.2.1   General Tips

D.2.2   Array Processing

D.2.3   Separate Core Computations

APPENDIX E ■ Linear Algebra Functions

E.1   VECTOR OPERATIONS

E.2   MATRIX OPERATIONS

E.3   MATRIX DECOMPOSITIONS

References

Index