Scala ·Applied Machine Learning by Bugnion, Pascal -- Read -- Imperial Library of Trantor

Index

Scala:Applied Machine Learning

Table of Contents Scala:Applied Machine Learning Scala:Applied Machine Learning Credits Preface

What this learning path covers What you need for this learning path

Module 1

Installing the JDK Installing and using SBT

Module 2 Module 3

Who this learning path is for Reader feedback Customer support

Downloading the example code Errata Piracy Questions

I. Module 1

1. Scala and Data Science

Data science Programming in data science Why Scala?

Static typing and type inference Scala encourages immutability Scala and functional programs Null pointer uncertainty Easier parallelism Interoperability with Java

When not to use Scala Summary References

2. Manipulating Data with Breeze

Code examples Installing Breeze Getting help on Breeze Basic Breeze data types

Vectors Dense and sparse vectors and the vector trait Matrices Building vectors and matrices Advanced indexing and slicing Mutating vectors and matrices Matrix multiplication, transposition, and the orientation of vectors Data preprocessing and feature engineering Breeze – function optimization Numerical derivatives Regularization

An example – logistic regression Towards re-usable code Alternatives to Breeze Summary References

3. Plotting with breeze-viz

Diving into Breeze Customizing plots Customizing the line type More advanced scatter plots Multi-plot example – scatterplot matrix plots Managing without documentation Breeze-viz reference Data visualization beyond breeze-viz Summary

4. Parallel Collections and Futures

Parallel collections

Limitations of parallel collections Error handling Setting the parallelism level An example – cross-validation with parallel collections

Futures

Future composition – using a future's result Blocking until completion Controlling parallel execution with execution contexts Futures example – stock price fetcher

Summary References

5. Scala and SQL through JDBC

Interacting with JDBC First steps with JDBC

Connecting to a database server Creating tables Inserting data Reading data

JDBC summary Functional wrappers for JDBC Safer JDBC connections with the loan pattern Enriching JDBC statements with the "pimp my library" pattern Wrapping result sets in a stream Looser coupling with type classes

Type classes Coding against type classes When to use type classes Benefits of type classes

Creating a data access layer Summary References

6. Slick – A Functional Interface for SQL

FEC data

Importing Slick Defining the schema Connecting to the database Creating tables Inserting data Querying data

Invokers Operations on columns Aggregations with "Group by" Accessing database metadata Slick versus JDBC Summary References

7. Web APIs

A whirlwind tour of JSON Querying web APIs JSON in Scala – an exercise in pattern matching

JSON4S types Extracting fields using XPath

Extraction using case classes Concurrency and exception handling with futures Authentication – adding HTTP headers

HTTP – a whirlwind overview Adding headers to HTTP requests in Scala

Summary References

8. Scala and MongoDB

MongoDB Connecting to MongoDB with Casbah

Connecting with authentication

Inserting documents Extracting objects from the database Complex queries Casbah query DSL Custom type serialization Beyond Casbah Summary References

9. Concurrency with Akka

GitHub follower graph Actors as people Hello world with Akka Case classes as messages Actor construction Anatomy of an actor Follower network crawler Fetcher actors Routing Message passing between actors Queue control and the pull pattern Accessing the sender of a message Stateful actors Follower network crawler Fault tolerance Custom supervisor strategies Life-cycle hooks What we have not talked about Summary References

10. Distributed Batch Processing with Spark

Installing Spark Acquiring the example data Resilient distributed datasets

RDDs are immutable RDDs are lazy RDDs know their lineage RDDs are resilient RDDs are distributed Transformations and actions on RDDs Persisting RDDs Key-value RDDs Double RDDs

Building and running standalone programs

Running Spark applications locally Reducing logging output and Spark configuration Running Spark applications on EC2

Spam filtering Lifting the hood Data shuffling and partitions Summary Reference

11. Spark SQL and DataFrames

DataFrames – a whirlwind introduction Aggregation operations Joining DataFrames together Custom functions on DataFrames DataFrame immutability and persistence SQL statements on DataFrames Complex data types – arrays, maps, and structs

Structs Arrays Maps

Interacting with data sources

JSON files Parquet files

Standalone programs Summary References

12. Distributed Machine Learning with MLlib

Introducing MLlib – Spam classification Pipeline components

Transformers Estimators

Evaluation Regularization in logistic regression Cross-validation and model selection Beyond logistic regression Summary References

13. Web APIs with Play

Client-server applications Introduction to web frameworks Model-View-Controller architecture Single page applications Building an application The Play framework Dynamic routing Actions

Composing the response Understanding and parsing the request

Interacting with JSON Querying external APIs and consuming JSON

Calling external web services Parsing JSON Asynchronous actions

Creating APIs with Play: a summary Rest APIs: best practice Summary References

14. Visualization with D3 and the Play Framework

GitHub user data Do I need a backend? JavaScript dependencies through web-jars Towards a web application: HTML templates Modular JavaScript through RequireJS Bootstrapping the applications Client-side program architecture

Designing the model The event bus AJAX calls through JQuery Response views

Drawing plots with NVD3 Summary References

A. Pattern Matching and Extractors

Pattern matching in for comprehensions Pattern matching internals Extracting sequences Summary Reference

II. Module 2

1. Getting Started

Mathematical notation for the curious Why machine learning?

Classification Prediction Optimization Regression

Why Scala?

Abstraction

Higher-kind projection Covariant functors for vectors Contravariant functors for co-vectors Monads

Scalability Configurability Maintainability Computation on demand

Model categorization Taxonomy of machine learning algorithms

Unsupervised learning

Clustering Dimension reduction

Supervised learning

Generative models Discriminative models

Semi-supervised learning Reinforcement learning

Don't reinvent the wheel! Tools and frameworks

Java Scala Apache Commons Math

Description Licensing Installation

JFreeChart

Description Licensing Installation

Other libraries and frameworks

Source code

Context versus view bounds Presentation Primitives and implicits

Primitive types Type conversions

Immutability Performance of Scala iterators

Let's kick the tires

An overview of computational workflows Writing a simple workflow

Step 1 – scoping the problem Step 2 – loading data Step 3 – preprocessing the data

Immutable normalization

Step 4 – discovering patterns

Analyzing data Plotting data

Step 5 – implementing the classifier

Selecting an optimizer Training the model Classifying observations

Step 6 – evaluating the model

Summary

2. Hello World!

Modeling

A model by any other name Model versus design Selecting features Extracting features

Defining a methodology Monadic data transformation

Error handling Explicit models Implicit models

A workflow computational model

Supporting mathematical abstractions

Step 1 – variable declaration Step 2 – model definition Step 3 – instantiation

Composing mixins to build a workflow

Understanding the problem Defining modules Instantiating the workflow

Modularization

Profiling data

Immutable statistics Z-Score and Gauss

Assessing a model

Validation

Key quality metrics F-score for binomial classification F-score for multinomial classification

Cross-validation

One-fold cross validation K-fold cross validation

Bias-variance decomposition Overfitting

Summary

3. Data Preprocessing

Time series in Scala

Types and operations The magnet pattern

The transpose operator The differential operator

Lazy views

Moving averages

The simple moving average The weighted moving average The exponential moving average

Fourier analysis

Discrete Fourier transform DFT-based filtering Detection of market cycles

The discrete Kalman filter

The state space estimation

The transition equation The measurement equation

The recursive algorithm

Prediction Correction Kalman smoothing Fixed lag smoothing Experimentation Benefits and drawbacks

Alternative preprocessing techniques Summary

4. Unsupervised Learning

Clustering

K-means clustering

Measuring similarity Defining the algorithm Step 1 – cluster configuration

Defining clusters Initializing clusters

Step 2 – cluster assignment Step 3 – reconstruction/error minimization

Creating K-means components Tail recursive implementation Iterative implementation

Step 4 – classification The curse of dimensionality Setting up the evaluation Evaluating the results Tuning the number of clusters Validation

The expectation-maximization algorithm

Gaussian mixture models Overview of EM Implementation Classification Testing The online EM algorithm

Dimension reduction

Principal components analysis

Algorithm Implementation Test case Evaluation

Non-linear models

Kernel PCA Manifolds

Performance considerations

K-means EM PCA

Summary

5. Naïve Bayes Classifiers

Probabilistic graphical models Naïve Bayes classifiers

Introducing the multinomial Naïve Bayes

Formalism The frequentist perspective The predictive model The zero-frequency problem

Implementation

Design Training

Class likelihood Binomial model The multinomial model Classifier components

Classification F1 validation Feature extraction Testing

The Multivariate Bernoulli classification

Model Implementation

Naïve Bayes and text mining

Basics of information retrieval Implementation

Analyzing documents Extracting the frequency of relative terms Generating the features

Testing

Retrieving the textual information Evaluating the text mining classifier

Pros and cons Summary

6. Regression and Regularization

Linear regression

One-variate linear regression

Implementation Test case

Ordinary least squares regression

Design Implementation Test case 1 – trending Test case 2 – feature selection

Regularization

Ln roughness penalty Ridge regression

Design Implementation Test case

Numerical optimization Logistic regression

Logistic function Binomial classification Design The training workflow

Step 1 – configuring the optimizer Step 2 – computing the Jacobian matrix Step 3 – managing the convergence of the optimizer Step 4 – defining the least squares problem Step 5 – minimizing the sum of square errors Test

Classification

Summary

7. Sequential Data Models

Markov decision processes

The Markov property The first order discrete Markov chain

The hidden Markov model

Notations The lambda model Design Evaluation – CF-1

Alpha – the forward pass Beta – the backward pass

Training – CF-2

The Baum-Welch estimator (EM)

Decoding – CF-3

The Viterbi algorithm

Putting it all together Test case 1 – training Test case 2 – evaluation HMM as a filtering technique

Conditional random fields

Introduction to CRF Linear chain CRF

Regularized CRFs and text analytics

The feature functions model Design Implementation

Configuring the CRF classifier Training the CRF model Applying the CRF model

Tests

The training convergence profile Impact of the size of the training set Impact of the L2 regularization factor

Comparing CRF and HMM Performance consideration Summary

8. Kernel Models and Support Vector Machines

Kernel functions

An overview Common discriminative kernels Kernel monadic composition

Support vector machines

The linear SVM

The separable case – the hard margin The nonseparable case – the soft margin

The nonlinear SVM

Max-margin classification The kernel trick

Support vector classifiers – SVC

The binary SVC

LIBSVM Design Configuration parameters

The SVM formulation The SVM kernel function The SVM execution

Interface to LIBSVM Training Classification C-penalty and margin Kernel evaluation Applications in risk analysis

Anomaly detection with one-class SVC Support vector regression

An overview SVR versus linear regression

Performance considerations Summary

9. Artificial Neural Networks

Feed-forward neural networks

The biological background Mathematical background

The multilayer perceptron

The activation function The network topology Design Configuration Network components

The network topology Input and hidden layers The output layer Synapses Connections The initialization weights

The model Problem types (modes) Online training versus batch training The training epoch

Step 1 – input forward propagation

The computational flow Error functions Operating modes Softmax

Step 2 – error backpropagation

Weights' adjustment The error propagation The computational model

Step 3 – exit condition Putting it all together

Training and classification

Regularization The model generation The Fast Fisher-Yates shuffle Prediction Model fitness

Evaluation

The execution profile Impact of the learning rate The impact of the momentum factor The impact of the number of hidden layers Test case

Implementation Evaluation of models Impact of the hidden layers' architecture

Convolution neural networks

Local receptive fields Sharing of weights Convolution layers Subsampling layers Putting it all together

Benefits and limitations Summary

10. Genetic Algorithms

Evolution

The origin NP problems Evolutionary computing

Genetic algorithms and machine learning Genetic algorithm components

Encoding

Value encoding Predicate encoding Solution encoding The encoding scheme

Flat encoding Hierarchical encoding

Genetic operators

Selection Crossover Mutation

The fitness score

Implementation

Software design Key components

Population Chromosomes Genes

Selection Controlling the population growth The GA configuration Crossover

Population Chromosomes Genes

Mutation

Population Chromosomes Genes

Reproduction Solver

GA for trading strategies

Definition of trading strategies

Trading operators The cost function Trading signals Trading strategies Trading signal encoding

A test case

Creating trading strategies Configuring the optimizer Finding the best trading strategy Tests

The weighted score The unweighted score

Advantages and risks of genetic algorithms Summary

11. Reinforcement Learning

Reinforcement learning

The problem A solution – Q-learning

Terminology Concepts Value of a policy The Bellman optimality equations Temporal difference for model-free learning Action-value iterative update

Implementation

Software design The states and actions The search space The policy and action-value The Q-learning components The Q-learning training Tail recursion to the rescue The validation The prediction

Option trading using Q-learning

The OptionProperty class The OptionModel class Quantization

Putting it all together Evaluation Pros and cons of reinforcement learning

Learning classifier systems

Introduction to LCS Why LCS? Terminology Extended learning classifier systems XCS components

Application to portfolio management The XCS core data XCS rules Covering An implementation example

Benefits and limitations of learning classifier systems

Summary

12. Scalable Frameworks

An overview Scala

Object creation Streams Parallel collections

Processing a parallel collection The benchmark framework Performance evaluation

Scalability with Actors

The Actor model Partitioning Beyond actors – reactive programming

Akka

Master-workers

Exchange of messages Worker actors The workflow controller The master actor Master with routing Distributed discrete Fourier transform Limitations

Futures

The Actor life cycle Blocking on futures Handling future callbacks Putting it all together

Apache Spark

Why Spark? Design principles

In-memory persistency Laziness Transforms and actions Shared variables

Experimenting with Spark

Deploying Spark Using Spark shell MLlib RDD generation K-means using Spark

Performance evaluation

Tuning parameters Tests Performance considerations

Pros and cons 0xdata Sparkling Water

Summary

A. Basic Concepts

Scala programming

List of libraries and tools Code snippets format Best practices

Encapsulation Class constructor template Companion objects versus case classes Enumerations versus case classes Overloading Design template for immutable classifiers

Utility classes

Data extraction Data sources Extraction of documents DMatrix class Counter Monitor

Mathematics

Linear algebra

QR decomposition LU factorization LDL decomposition Cholesky factorization Singular Value Decomposition Eigenvalue decomposition Algebraic and numerical libraries

First order predicate logic Jacobian and Hessian matrices Summary of optimization techniques

Gradient descent methods

Steepest descent Conjugate gradient Stochastic gradient descent

Quasi-Newton algorithms

BFGS L-BFGS

Nonlinear least squares minimization

Gauss-Newton Levenberg-Marquardt

Lagrange multipliers

Overview of dynamic programming

Finances 101

Fundamental analysis Technical analysis

Terminology Trading data Trading signals and strategy Price patterns

Options trading Financial data sources

Suggested online courses References

III. Module 3

1. Exploratory Data Analysis

Getting started with Scala Distinct values of a categorical field Summarization of a numeric field

Grepping across multiple fields

Basic, stratified, and consistent sampling Working with Scala and Spark Notebooks Basic correlations Summary

2. Data Pipelines and Modeling

Influence diagrams Sequential trials and dealing with risk Exploration and exploitation Unknown unknowns Basic components of a data-driven system

Data ingest Data transformation layer Data analytics and machine learning UI component Actions engine Correlation engine Monitoring

Optimization and interactivity

Feedback loops

Summary

3. Working with Spark and MLlib

Setting up Spark Understanding Spark architecture

Task scheduling Spark components MQTT, ZeroMQ, Flume, and Kafka HDFS, Cassandra, S3, and Tachyon Mesos, YARN, and Standalone

Applications

Word count Streaming word count Spark SQL and DataFrame

ML libraries

SparkR Graph algorithms – GraphX and GraphFrames

Spark performance tuning Running Hadoop HDFS Summary

4. Supervised and Unsupervised Learning

Records and supervised learning

Iris dataset Labeled point SVMWithSGD Logistic regression Decision tree Bagging and boosting – ensemble learning methods

Unsupervised learning Problem dimensionality Summary

5. Regression and Classification

What regression stands for? Continuous space and metrics Linear regression Logistic regression Regularization Multivariate regression Heteroscedasticity Regression trees Classification metrics Multiclass problems Perceptron Generalization error and overfitting Summary

6. Working with Unstructured Data

Nested data Other serialization formats Hive and Impala Sessionization Working with traits Working with pattern matching Other uses of unstructured data Probabilistic structures Projections Summary

7. Working with Graph Algorithms

A quick introduction to graphs SBT Graph for Scala

Adding nodes and edges Graph constraints JSON

GraphX

Who is getting e-mails? Connected components Triangle counting Strongly connected components PageRank SVD++

Summary

8. Integrating Scala with R and Python

Integrating with R

Setting up R and SparkR

Linux Mac OS Windows Running SparkR via scripts Running Spark via R's command line

DataFrames Linear models Generalized linear model Reading JSON files in SparkR Writing Parquet files in SparkR Invoking Scala from R

Using Rserve

Integrating with Python

Setting up Python PySpark Calling Python from Java/Scala

Using sys.process._ Spark pipe Jython and JSR 223

Summary

9. NLP in Scala

Text analysis pipeline

Simple text analysis

MLlib algorithms in Spark

TF-IDF LDA

Segmentation, annotation, and chunking POS tagging Using word2vec to find word relationships

A Porter Stemmer implementation of the code

Summary

10. Advanced Model Monitoring

System monitoring Process monitoring Model monitoring

Performance over time Criteria for model retiring A/B testing

Summary

A. Bibliography Index

← Prev
Back
Next →

← Prev
Back
Next →