Scala · Guide for Data Science Professionals by Bugnion, Pascal -- Read -- Imperial Library of Trantor

Index

Scala: Guide for Data Science Professionals

Table of Contents Scala: Guide for Data Science Professionals Scala: Guide for Data Science Professionals Credits Preface

What this learning path covers What you need for this learning path Who this learning path is for Reader feedback Customer support

Downloading the example code Errata Piracy Questions

1. Module 1

1. Scala and Data Science

Data science Programming in data science Why Scala?

Static typing and type inference Scala encourages immutability Scala and functional programs Null pointer uncertainty Easier parallelism Interoperability with Java

When not to use Scala Summary References

2. Manipulating Data with Breeze

Code examples Installing Breeze Getting help on Breeze Basic Breeze data types

Vectors Dense and sparse vectors and the vector trait Matrices Building vectors and matrices Advanced indexing and slicing Mutating vectors and matrices Matrix multiplication, transposition, and the orientation of vectors Data preprocessing and feature engineering Breeze – function optimization Numerical derivatives Regularization

An example – logistic regression Towards re-usable code Alternatives to Breeze Summary References

3. Plotting with breeze-viz

Diving into Breeze Customizing plots Customizing the line type More advanced scatter plots Multi-plot example – scatterplot matrix plots Managing without documentation Breeze-viz reference Data visualization beyond breeze-viz Summary

4. Parallel Collections and Futures

Parallel collections

Limitations of parallel collections Error handling Setting the parallelism level An example – cross-validation with parallel collections

Futures

Future composition – using a future's result Blocking until completion Controlling parallel execution with execution contexts Futures example – stock price fetcher

Summary References

5. Scala and SQL through JDBC

Interacting with JDBC First steps with JDBC

Connecting to a database server Creating tables Inserting data Reading data

JDBC summary Functional wrappers for JDBC Safer JDBC connections with the loan pattern Enriching JDBC statements with the "pimp my library" pattern Wrapping result sets in a stream Looser coupling with type classes

Type classes Coding against type classes When to use type classes Benefits of type classes

Creating a data access layer Summary References

6. Slick – A Functional Interface for SQL

FEC data

Importing Slick Defining the schema Connecting to the database Creating tables Inserting data Querying data

Invokers Operations on columns Aggregations with "Group by" Accessing database metadata Slick versus JDBC Summary References

7. Web APIs

A whirlwind tour of JSON Querying web APIs JSON in Scala – an exercise in pattern matching

JSON4S types Extracting fields using XPath

Extraction using case classes Concurrency and exception handling with futures Authentication – adding HTTP headers

HTTP – a whirlwind overview Adding headers to HTTP requests in Scala

Summary References

8. Scala and MongoDB

MongoDB Connecting to MongoDB with Casbah

Connecting with authentication

Inserting documents Extracting objects from the database Complex queries Casbah query DSL Custom type serialization Beyond Casbah Summary References

9. Concurrency with Akka

GitHub follower graph Actors as people Hello world with Akka Case classes as messages Actor construction Anatomy of an actor Follower network crawler Fetcher actors Routing Message passing between actors Queue control and the pull pattern Accessing the sender of a message Stateful actors Follower network crawler Fault tolerance Custom supervisor strategies Life-cycle hooks What we have not talked about Summary References

10. Distributed Batch Processing with Spark

Installing Spark Acquiring the example data Resilient distributed datasets

RDDs are immutable RDDs are lazy RDDs know their lineage RDDs are resilient RDDs are distributed Transformations and actions on RDDs Persisting RDDs Key-value RDDs Double RDDs

Building and running standalone programs

Running Spark applications locally Reducing logging output and Spark configuration Running Spark applications on EC2

Spam filtering Lifting the hood Data shuffling and partitions Summary Reference

11. Spark SQL and DataFrames

DataFrames – a whirlwind introduction Aggregation operations Joining DataFrames together Custom functions on DataFrames DataFrame immutability and persistence SQL statements on DataFrames Complex data types – arrays, maps, and structs

Structs Arrays Maps

Interacting with data sources

JSON files Parquet files

Standalone programs Summary References

12. Distributed Machine Learning with MLlib

Introducing MLlib – Spam classification Pipeline components

Transformers Estimators

Evaluation Regularization in logistic regression Cross-validation and model selection Beyond logistic regression Summary References

13. Web APIs with Play

Client-server applications Introduction to web frameworks Model-View-Controller architecture Single page applications Building an application The Play framework Dynamic routing Actions

Composing the response Understanding and parsing the request

Interacting with JSON Querying external APIs and consuming JSON

Calling external web services Parsing JSON Asynchronous actions

Creating APIs with Play: a summary Rest APIs: best practice Summary References

14. Visualization with D3 and the Play Framework

GitHub user data Do I need a backend? JavaScript dependencies through web-jars Towards a web application: HTML templates Modular JavaScript through RequireJS Bootstrapping the applications Client-side program architecture

Designing the model The event bus AJAX calls through JQuery Response views

Drawing plots with NVD3 Summary References

A. Pattern Matching and Extractors

Pattern matching in for comprehensions Pattern matching internals Extracting sequences Summary Reference

II. Module 2

1. Getting Started with Breeze

Introduction Getting Breeze – the linear algebra library

How to do it... There's more...

The org.scalanlp.breeze dependency The org.scalanlp.breeze-natives package

Working with vectors

Getting ready How to do it...

Creating vectors Constructing a vector from values

Creating a zero vector

Creating a vector out of a function Creating a vector of linearly spaced values Creating a vector with values in a specific range Creating an entire vector with a single value Slicing a sub-vector from a bigger vector Creating a Breeze Vector from a Scala Vector Vector arithmetic Scalar operations Calculating the dot product of two vectors Creating a new vector by adding two vectors together Appending vectors and converting a vector of one type to another Concatenating two vectors

Converting a vector of Int to a vector of Double Computing basic statistics Mean and variance

Standard deviation Find the largest value in a vector Finding the sum, square root and log of all the values in the vector

The Sqrt function The Log function

Working with matrices

How to do it...

Creating matrices

Creating a matrix from values Creating a zero matrix Creating a matrix out of a function Creating an identity matrix Creating a matrix from random numbers Creating from a Scala collection

Matrix arithmetic

Addition Multiplication

Appending and conversion

Concatenating matrices – vertically Concatenating matrices – horizontally Converting a matrix of Int to a matrix of Double

Data manipulation operations

Getting column vectors out of the matrix Getting row vectors out of the matrix Getting values inside the matrix Getting the inverse and transpose of a matrix

Computing basic statistics

Mean and variance Standard deviation Finding the largest value in a matrix Finding the sum, square root and log of all the values in the matrix Sqrt Log Calculating the eigenvectors and eigenvalues of a matrix

How it works...

Vectors and matrices with randomly distributed values

How it works...

Creating vectors with uniformly distributed random values Creating vectors with normally distributed random values Creating vectors with random values that have a Poisson distribution Creating a matrix with uniformly random values Creating a matrix with normally distributed random values Creating a matrix with random values that has a Poisson distribution

Reading and writing CSV files

How it works...

2. Getting Started with Apache Spark DataFrames

Introduction Getting Apache Spark

How to do it...

Creating a DataFrame from CSV

How to do it... How it works... There's more…

Manipulating DataFrames

How to do it...

Printing the schema of the DataFrame Sampling the data in the DataFrame Selecting DataFrame columns Filtering data by condition Sorting data in the frame Renaming columns Treating the DataFrame as a relational table Joining two DataFrames

Inner join Right outer join Left outer join

Saving the DataFrame as a file

Creating a DataFrame from Scala case classes

How to do it... How it works...

3. Loading and Preparing Data – DataFrame

Introduction Loading more than 22 features into classes

How to do it... How it works... There's more…

Loading JSON into DataFrames

How to do it…

Reading a JSON file using SQLContext.jsonFile Reading a text file and converting it to JSON RDD Explicitly specifying your schema

There's more…

Storing data as Parquet files

How to do it…

Load a simple CSV file, convert it to case classes, and create a DataFrame from it Save it as a Parquet file Install Parquet tools Using the tools to inspect the Parquet file Enable compression for the Parquet file

Using the Avro data model in Parquet

How to do it…

Creation of the Avro model Generation of Avro objects using the sbt-avro plugin Constructing an RDD of our generated object from Students.csv Saving RDD[StudentAvro] in a Parquet file Reading the file back for verification Using Parquet tools for verification

Loading from RDBMS

How to do it…

Preparing data in Dataframes

How to do it...

4. Data Visualization

Introduction Visualizing using Zeppelin

How to do it...

Installing Zeppelin Customizing Zeppelin's server and websocket port Visualizing data on HDFS – parameterizing inputs Running custom functions Adding external dependencies to Zeppelin Pointing to an external Spark cluster

Creating scatter plots with Bokeh-Scala

How to do it...

Preparing our data Creating Plot and Document objects Creating a marker object Setting the X and Y axes' data range for the plot Drawing the x and the y axes Viewing flower species with varying colors Adding grid lines Adding a legend to the plot

Creating a time series MultiPlot with Bokeh-Scala

How to do it...

Preparing our data Creating a plot Creating a line that joins all the data points Setting the x and y axes' data range for the plot Drawing the axes and the grids Adding tools Adding a legend to the plot Multiple plots in the document

5. Learning from Data

Introduction Supervised and unsupervised learning Gradient descent Predicting continuous values using linear regression

How to do it...

Importing the data Converting each instance into a LabeledPoint Preparing the training and test data Scaling the features Training the model Predicting against test data Evaluating the model Regularizing the parameters Mini batching

Binary classification using LogisticRegression and SVM

How to do it...

Importing the data Tokenizing the data and converting it into LabeledPoints Factoring the inverse document frequency Prepare the training and test data Constructing the algorithm Training the model and predicting the test data Evaluating the model

Binary classification using LogisticRegression with Pipeline API

How to do it...

Importing and splitting data as test and training sets Construct the participants of the Pipeline Preparing a pipeline and training a model Predicting against test data Evaluating a model without cross-validation Constructing parameters for cross-validation Constructing cross-validator and fit the best model Evaluating the model with cross-validation

Clustering using K-means

How to do it...

KMeans.RANDOM KMeans.PARALLEL

K-means++ K-means||

Max iterations Epsilon Importing the data and converting it into a vector Feature scaling the data Deriving the number of clusters Constructing the model Evaluating the model

Feature reduction using principal component analysis

How to do it...

Dimensionality reduction of data for supervised learning Mean-normalizing the training data Extracting the principal components Preparing the labeled data Preparing the test data Classify and evaluate the metrics Dimensionality reduction of data for unsupervised learning Mean-normalizing the training data Extracting the principal components Arriving at the number of components Evaluating the metrics

6. Scaling Up

Introduction Building the Uber JAR

How to do it...

Transitive dependency stated explicitly in the SBT dependency

Two different libraries depend on the same external library

Submitting jobs to the Spark cluster (local)

How to do it...

Downloading Spark Running HDFS on Pseudo-clustered mode Running the Spark master and slave locally Pushing data into HDFS Submitting the Spark application on the cluster

Running the Spark Standalone cluster on EC2

How to do it...

Creating the AccessKey and pem file Setting the environment variables Running the launch script Verifying installation Making changes to the code Transferring the data and job files Loading the dataset into HDFS Running the job Destroying the cluster

Running the Spark Job on Mesos (local)

How to do it...

Installing Mesos Starting the Mesos master and slave Uploading the Spark binary package and the dataset to HDFS Running the job

Running the Spark Job on YARN (local)

How to do it...

Installing the Hadoop cluster Starting HDFS and YARN Pushing Spark assembly and dataset to HDFS Running a Spark job in yarn-client mode Running Spark job in yarn-cluster mode

7. Going Further

Introduction Using Spark Streaming to subscribe to a Twitter stream

How to do it...

Using Spark as an ETL tool

How to do it...

Using StreamingLogisticRegression to classify a Twitter stream using Kafka as a training stream

How to do it...

Using GraphX to analyze Twitter data

How to do it...

III. Module 3

1. Getting Started

Mathematical notation for the curious Why machine learning?

Classification Prediction Optimization Regression

Why Scala?

Abstraction Scalability Configurability Maintainability Computation on demand

Model categorization Taxonomy of machine learning algorithms

Unsupervised learning

Clustering Dimension reduction

Supervised learning

Generative models Discriminative models

Reinforcement learning

Tools and frameworks

Java Scala Apache Commons Math

Description Licensing Installation

JFreeChart

Description Licensing Installation

Other libraries and frameworks

Source code

Context versus view bounds Presentation Primitives and implicits

Primitive types Type conversions Operators

Immutability Performance of Scala iterators

Let's kick the tires

Overview of computational workflows Writing a simple workflow

Selecting a dataset Loading the dataset Preprocessing the dataset

Basic statistics Normalization and Gauss distribution Plotting data

Creating a model (learning) Classify the data

Summary

2. Hello World!

Modeling

A model by any other name Model versus design Selecting a model's features Extracting features

Designing a workflow

The computational framework The pipe operator Monadic data transformation Dependency injection Workflow modules The workflow factory Examples of workflow components

The preprocessing module The clustering module

Assessing a model

Validation

Key metrics Implementation

K-fold cross-validation Bias-variance decomposition Overfitting

Summary

3. Data Preprocessing

Time series Moving averages

The simple moving average The weighted moving average The exponential moving average

Fourier analysis

Discrete Fourier transform (DFT) DFT-based filtering Detection of market cycles

The Kalman filter

The state space estimation

The transition equation The measurement equation

The recursive algorithm

Prediction Correction Kalman smoothing Experimentation

Alternative preprocessing techniques Summary

4. Unsupervised Learning

Clustering

K-means clustering

Measuring similarity Overview of the K-means algorithm Step 1 – cluster configuration

Defining clusters Defining K-means Initializing clusters

Step 2 – cluster assignment Step 3 – iterative reconstruction Curse of dimensionality Experiment Tuning the number of clusters Validation

Expectation-maximization (EM) algorithm

Gaussian mixture model EM overview Implementation Testing Online EM

Dimension reduction

Principal components analysis (PCA)

Algorithm Implementation Test case Evaluation

Other dimension reduction techniques

Performance considerations

K-means EM PCA

Summary

5. Naïve Bayes Classifiers

Probabilistic graphical models Naïve Bayes classifiers

Introducing the multinomial Naïve Bayes

Formalism The frequentist perspective The predictive model The zero-frequency problem

Implementation

Software design Training Classification Labeling Results

Multivariate Bernoulli classification

Model Implementation

Naïve Bayes and text mining

Basics of information retrieval Implementation

Extraction of terms Scoring of terms

Testing

Retrieving textual information Evaluation

Pros and cons Summary

6. Regression and Regularization

Linear regression

One-variate linear regression

Implementation Test case

Ordinary least squares (OLS) regression

Design Implementation Test case 1 – trending Test case 2 – features selection

Regularization

Ln roughness penalty The ridge regression

Implementation The test case

Numerical optimization The logistic regression

The logit function Binomial classification Software design The training workflow

Configuring the least squares optimizer Computing the Jacobian matrix Defining the exit conditions Defining the least squares problem Minimizing the loss function Test

Classification

Summary

7. Sequential Data Models

Markov decision processes

The Markov property The first-order discrete Markov chain

The hidden Markov model (HMM)

Notation The lambda model HMM execution state Evaluation (CF-1)

Alpha class (the forward variable) Beta class (the backward variable)

Training (CF-2)

Baum-Welch estimator (EM)

Decoding (CF-3)

The Viterbi algorithm

Putting it all together Test case The hidden Markov model for time series analysis

Conditional random fields

Introduction to CRF Linear chain CRF

CRF and text analytics

The feature functions model Software design Implementation

Building the training set Generating tags Extracting data sequences CRF control parameters Putting it all together

Tests

The training convergence profile Impact of the size of the training set Impact of the L2 regularization factor

Comparing CRF and HMM Performance consideration Summary

8. Kernel Models and Support Vector Machines

Kernel functions

Overview Common discriminative kernels

The support vector machine (SVM)

The linear SVM

The separable case (hard margin) The nonseparable case (soft margin)

The nonlinear SVM

Max-margin classification The kernel trick

Support vector classifier (SVC)

The binary SVC

LIBSVM Software design Configuration parameters

SVM Formulation The SVM kernel function SVM execution

SVM implementation C-penalty and margin Kernel evaluation Application to risk analysis

Features and labels

Anomaly detection with one-class SVC Support vector regression (SVR)

Overview SVR versus linear regression

Performance considerations Summary

9. Artificial Neural Networks

Feed-forward neural networks (FFNN)

The Biological background The mathematical background

The multilayer perceptron (MLP)

The activation function The network architecture Software design Model definition

Layers Synapses Connections

Training cycle/epoch

Step 1 – input forward propagation

The computational model Objective Softmax

Step 2 – sum of squared errors Step 3 – error backpropagation

Error propagation The computational model

Step 4 – synapse/weights adjustment

Momentum factor for gradient descent Implementation

Step 5 – convergence criteria Configuration Putting all together

Training strategies and classification

Online versus batch training Regularization Model instantiation Prediction

Evaluation

Impact of learning rate Impact of the momentum factor Test case

Implementation Models evaluation Impact of hidden layers architecture

Benefits and limitations Summary

10. Genetic Algorithms

Evolution

The origin NP problems Evolutionary computing

Genetic algorithms and machine learning Genetic algorithm components

Encodings

Value encoding Predicate encoding Solution encoding The encoding scheme

Flat encoding Hierarchical encoding

Genetic operators

Selection Crossover Mutation

Fitness score

Implementation

Software design Key components Selection Controlling population growth GA configuration Crossover

Population Chromosomes Genes

Mutation

Population Chromosomes Genes

The reproduction cycle

GA for trading strategies

Definition of trading strategies

Trading operators The cost/unfitness function Trading signals Trading strategies Signal encoding

Test case

Data extraction Initial population Configuration GA instantiation GA execution Tests

The unweighted score The weighted score

Advantages and risks of genetic algorithms Summary

11. Reinforcement Learning

Introduction

The problem A solution – Q-learning

Terminology Concept Value of policy Bellman optimality equations Temporal difference for model-free learning Action-value iterative update

Implementation

Software design States and actions Search space Policy and action-value The Q-learning training Tail recursion to the rescue Prediction

Option trading using Q-learning

Option property Option model Function approximation Constrained state-transition Putting it all together

Evaluation Pros and cons of reinforcement learning

Learning classifier systems

Introduction to LCS Why LCS Terminology Extended learning classifier systems (XCS) XCS components

Application to portfolio management XCS core data XCS rules Covering Example of implementation

Benefits and limitation of learning classifier systems

Summary

12. Scalable Frameworks

Overview Scala

Controlling object creation Parallel collections

Processing a parallel collection Benchmark framework Performance evaluation

Scalability with Actors

The Actor model Partitioning Beyond actors – reactive programming

Akka

Master-workers

Messages exchange Worker actors The workflow controller The master Actor Master with routing Distributed discrete Fourier transform Limitations

Futures

The Actor life cycle Blocking on futures Handling future callbacks Putting all together

Apache Spark

Why Spark Design principles

In-memory persistency Laziness Transforms and Actions Shared variables

Experimenting with Spark

Deploying Spark Using Spark shell MLlib RDD generation K-means using Spark

Performance evaluation

Tuning parameters Tests Performance considerations

Pros and cons 0xdata Sparkling Water

Summary

B. Basic Concepts

Scala programming

List of libraries Format of code snippets Encapsulation Class constructor template Companion objects versus case classes Enumerations versus case classes Overloading Design template for classifiers Data extraction Data sources Extraction of documents Matrix class

Mathematics

Linear algebra

QR Decomposition LU factorization LDL decomposition Cholesky factorization Singular value decomposition Eigenvalue decomposition Algebraic and numerical libraries

First order predicate logic Jacobian and Hessian matrices Summary of optimization techniques

Gradient descent methods

Steepest descent Conjugate gradient Stochastic gradient descent

Quasi-Newton algorithms

BFGS L-BFGS

Nonlinear least squares minimization

Gauss-Newton Levenberg-Marquardt

Lagrange multipliers

Overview of dynamic programming

Finances 101

Fundamental analysis Technical analysis

Terminology Trading signals and strategy Price patterns

Options trading Financial data sources

Suggested online courses References

C. Bibliography

Index

← Prev
Back
Next →

← Prev
Back
Next →