Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Scala: Guide for Data Science Professionals
Table of Contents Scala: Guide for Data Science Professionals Scala: Guide for Data Science Professionals Credits Preface
What this learning path covers What you need for this learning path Who this learning path is for Reader feedback Customer support
Downloading the example code Errata Piracy Questions
1. Module 1
1. Scala and Data Science
Data science Programming in data science Why Scala?
Static typing and type inference Scala encourages immutability Scala and functional programs Null pointer uncertainty Easier parallelism Interoperability with Java
When not to use Scala Summary References
2. Manipulating Data with Breeze
Code examples Installing Breeze Getting help on Breeze Basic Breeze data types
Vectors Dense and sparse vectors and the vector trait Matrices Building vectors and matrices Advanced indexing and slicing Mutating vectors and matrices Matrix multiplication, transposition, and the orientation of vectors Data preprocessing and feature engineering Breeze – function optimization Numerical derivatives Regularization
An example – logistic regression Towards re-usable code Alternatives to Breeze Summary References
3. Plotting with breeze-viz
Diving into Breeze Customizing plots Customizing the line type More advanced scatter plots Multi-plot example – scatterplot matrix plots Managing without documentation Breeze-viz reference Data visualization beyond breeze-viz Summary
4. Parallel Collections and Futures
Parallel collections
Limitations of parallel collections Error handling Setting the parallelism level An example – cross-validation with parallel collections
Futures
Future composition – using a future's result Blocking until completion Controlling parallel execution with execution contexts Futures example – stock price fetcher
Summary References
5. Scala and SQL through JDBC
Interacting with JDBC First steps with JDBC
Connecting to a database server Creating tables Inserting data Reading data
JDBC summary Functional wrappers for JDBC Safer JDBC connections with the loan pattern Enriching JDBC statements with the "pimp my library" pattern Wrapping result sets in a stream Looser coupling with type classes
Type classes Coding against type classes When to use type classes Benefits of type classes
Creating a data access layer Summary References
6. Slick – A Functional Interface for SQL
FEC data
Importing Slick Defining the schema Connecting to the database Creating tables Inserting data Querying data
Invokers Operations on columns Aggregations with "Group by" Accessing database metadata Slick versus JDBC Summary References
7. Web APIs
A whirlwind tour of JSON Querying web APIs JSON in Scala – an exercise in pattern matching
JSON4S types Extracting fields using XPath
Extraction using case classes Concurrency and exception handling with futures Authentication – adding HTTP headers
HTTP – a whirlwind overview Adding headers to HTTP requests in Scala
Summary References
8. Scala and MongoDB
MongoDB Connecting to MongoDB with Casbah
Connecting with authentication
Inserting documents Extracting objects from the database Complex queries Casbah query DSL Custom type serialization Beyond Casbah Summary References
9. Concurrency with Akka
GitHub follower graph Actors as people Hello world with Akka Case classes as messages Actor construction Anatomy of an actor Follower network crawler Fetcher actors Routing Message passing between actors Queue control and the pull pattern Accessing the sender of a message Stateful actors Follower network crawler Fault tolerance Custom supervisor strategies Life-cycle hooks What we have not talked about Summary References
10. Distributed Batch Processing with Spark
Installing Spark Acquiring the example data Resilient distributed datasets
RDDs are immutable RDDs are lazy RDDs know their lineage RDDs are resilient RDDs are distributed Transformations and actions on RDDs Persisting RDDs Key-value RDDs Double RDDs
Building and running standalone programs
Running Spark applications locally Reducing logging output and Spark configuration Running Spark applications on EC2
Spam filtering Lifting the hood Data shuffling and partitions Summary Reference
11. Spark SQL and DataFrames
DataFrames – a whirlwind introduction Aggregation operations Joining DataFrames together Custom functions on DataFrames DataFrame immutability and persistence SQL statements on DataFrames Complex data types – arrays, maps, and structs
Structs Arrays Maps
Interacting with data sources
JSON files Parquet files
Standalone programs Summary References
12. Distributed Machine Learning with MLlib
Introducing MLlib – Spam classification Pipeline components
Transformers Estimators
Evaluation Regularization in logistic regression Cross-validation and model selection Beyond logistic regression Summary References
13. Web APIs with Play
Client-server applications Introduction to web frameworks Model-View-Controller architecture Single page applications Building an application The Play framework Dynamic routing Actions
Composing the response Understanding and parsing the request
Interacting with JSON Querying external APIs and consuming JSON
Calling external web services Parsing JSON Asynchronous actions
Creating APIs with Play: a summary Rest APIs: best practice Summary References
14. Visualization with D3 and the Play Framework
GitHub user data Do I need a backend? JavaScript dependencies through web-jars Towards a web application: HTML templates Modular JavaScript through RequireJS Bootstrapping the applications Client-side program architecture
Designing the model The event bus AJAX calls through JQuery Response views
Drawing plots with NVD3 Summary References
A. Pattern Matching and Extractors
Pattern matching in for comprehensions Pattern matching internals Extracting sequences Summary Reference
II. Module 2
1. Getting Started with Breeze
Introduction Getting Breeze – the linear algebra library
How to do it... There's more...
The org.scalanlp.breeze dependency The org.scalanlp.breeze-natives package
Working with vectors
Getting ready How to do it...
Creating vectors Constructing a vector from values
Creating a zero vector
Creating a vector out of a function Creating a vector of linearly spaced values Creating a vector with values in a specific range Creating an entire vector with a single value Slicing a sub-vector from a bigger vector Creating a Breeze Vector from a Scala Vector Vector arithmetic Scalar operations Calculating the dot product of two vectors Creating a new vector by adding two vectors together Appending vectors and converting a vector of one type to another Concatenating two vectors
Converting a vector of Int to a vector of Double Computing basic statistics Mean and variance
Standard deviation Find the largest value in a vector Finding the sum, square root and log of all the values in the vector
The Sqrt function The Log function
Working with matrices
How to do it...
Creating matrices
Creating a matrix from values Creating a zero matrix Creating a matrix out of a function Creating an identity matrix Creating a matrix from random numbers Creating from a Scala collection
Matrix arithmetic
Addition Multiplication
Appending and conversion
Concatenating matrices – vertically Concatenating matrices – horizontally Converting a matrix of Int to a matrix of Double
Data manipulation operations
Getting column vectors out of the matrix Getting row vectors out of the matrix Getting values inside the matrix Getting the inverse and transpose of a matrix
Computing basic statistics
Mean and variance Standard deviation Finding the largest value in a matrix Finding the sum, square root and log of all the values in the matrix Sqrt Log Calculating the eigenvectors and eigenvalues of a matrix
How it works...
Vectors and matrices with randomly distributed values
How it works...
Creating vectors with uniformly distributed random values Creating vectors with normally distributed random values Creating vectors with random values that have a Poisson distribution Creating a matrix with uniformly random values Creating a matrix with normally distributed random values Creating a matrix with random values that has a Poisson distribution
Reading and writing CSV files
How it works...
2. Getting Started with Apache Spark DataFrames
Introduction Getting Apache Spark
How to do it...
Creating a DataFrame from CSV
How to do it... How it works... There's more…
Manipulating DataFrames
How to do it...
Printing the schema of the DataFrame Sampling the data in the DataFrame Selecting DataFrame columns Filtering data by condition Sorting data in the frame Renaming columns Treating the DataFrame as a relational table Joining two DataFrames
Inner join Right outer join Left outer join
Saving the DataFrame as a file
Creating a DataFrame from Scala case classes
How to do it... How it works...
3. Loading and Preparing Data – DataFrame
Introduction Loading more than 22 features into classes
How to do it... How it works... There's more…
Loading JSON into DataFrames
How to do it…
Reading a JSON file using SQLContext.jsonFile Reading a text file and converting it to JSON RDD Explicitly specifying your schema
There's more…
Storing data as Parquet files
How to do it…
Load a simple CSV file, convert it to case classes, and create a DataFrame from it Save it as a Parquet file Install Parquet tools Using the tools to inspect the Parquet file Enable compression for the Parquet file
Using the Avro data model in Parquet
How to do it…
Creation of the Avro model Generation of Avro objects using the sbt-avro plugin Constructing an RDD of our generated object from Students.csv Saving RDD[StudentAvro] in a Parquet file Reading the file back for verification Using Parquet tools for verification
Loading from RDBMS
How to do it…
Preparing data in Dataframes
How to do it...
4. Data Visualization
Introduction Visualizing using Zeppelin
How to do it...
Installing Zeppelin Customizing Zeppelin's server and websocket port Visualizing data on HDFS – parameterizing inputs Running custom functions Adding external dependencies to Zeppelin Pointing to an external Spark cluster
Creating scatter plots with Bokeh-Scala
How to do it...
Preparing our data Creating Plot and Document objects Creating a marker object Setting the X and Y axes' data range for the plot Drawing the x and the y axes Viewing flower species with varying colors Adding grid lines Adding a legend to the plot
Creating a time series MultiPlot with Bokeh-Scala
How to do it...
Preparing our data Creating a plot Creating a line that joins all the data points Setting the x and y axes' data range for the plot Drawing the axes and the grids Adding tools Adding a legend to the plot Multiple plots in the document
5. Learning from Data
Introduction Supervised and unsupervised learning Gradient descent Predicting continuous values using linear regression
How to do it...
Importing the data Converting each instance into a LabeledPoint Preparing the training and test data Scaling the features Training the model Predicting against test data Evaluating the model Regularizing the parameters Mini batching
Binary classification using LogisticRegression and SVM
How to do it...
Importing the data Tokenizing the data and converting it into LabeledPoints Factoring the inverse document frequency Prepare the training and test data Constructing the algorithm Training the model and predicting the test data Evaluating the model
Binary classification using LogisticRegression with Pipeline API
How to do it...
Importing and splitting data as test and training sets Construct the participants of the Pipeline Preparing a pipeline and training a model Predicting against test data Evaluating a model without cross-validation Constructing parameters for cross-validation Constructing cross-validator and fit the best model Evaluating the model with cross-validation
Clustering using K-means
How to do it...
KMeans.RANDOM KMeans.PARALLEL
K-means++ K-means||
Max iterations Epsilon Importing the data and converting it into a vector Feature scaling the data Deriving the number of clusters Constructing the model Evaluating the model
Feature reduction using principal component analysis
How to do it...
Dimensionality reduction of data for supervised learning Mean-normalizing the training data Extracting the principal components Preparing the labeled data Preparing the test data Classify and evaluate the metrics Dimensionality reduction of data for unsupervised learning Mean-normalizing the training data Extracting the principal components Arriving at the number of components Evaluating the metrics
6. Scaling Up
Introduction Building the Uber JAR
How to do it...
Transitive dependency stated explicitly in the SBT dependency
Two different libraries depend on the same external library
Submitting jobs to the Spark cluster (local)
How to do it...
Downloading Spark Running HDFS on Pseudo-clustered mode Running the Spark master and slave locally Pushing data into HDFS Submitting the Spark application on the cluster
Running the Spark Standalone cluster on EC2
How to do it...
Creating the AccessKey and pem file Setting the environment variables Running the launch script Verifying installation Making changes to the code Transferring the data and job files Loading the dataset into HDFS Running the job Destroying the cluster
Running the Spark Job on Mesos (local)
How to do it...
Installing Mesos Starting the Mesos master and slave Uploading the Spark binary package and the dataset to HDFS Running the job
Running the Spark Job on YARN (local)
How to do it...
Installing the Hadoop cluster Starting HDFS and YARN Pushing Spark assembly and dataset to HDFS Running a Spark job in yarn-client mode Running Spark job in yarn-cluster mode
7. Going Further
Introduction Using Spark Streaming to subscribe to a Twitter stream
How to do it...
Using Spark as an ETL tool
How to do it...
Using StreamingLogisticRegression to classify a Twitter stream using Kafka as a training stream
How to do it...
Using GraphX to analyze Twitter data
How to do it...
III. Module 3
1. Getting Started
Mathematical notation for the curious Why machine learning?
Classification Prediction Optimization Regression
Why Scala?
Abstraction Scalability Configurability Maintainability Computation on demand
Model categorization Taxonomy of machine learning algorithms
Unsupervised learning
Clustering Dimension reduction
Supervised learning
Generative models Discriminative models
Reinforcement learning
Tools and frameworks
Java Scala Apache Commons Math
Description Licensing Installation
JFreeChart
Description Licensing Installation
Other libraries and frameworks
Source code
Context versus view bounds Presentation Primitives and implicits
Primitive types Type conversions Operators
Immutability Performance of Scala iterators
Let's kick the tires
Overview of computational workflows Writing a simple workflow
Selecting a dataset Loading the dataset Preprocessing the dataset
Basic statistics Normalization and Gauss distribution Plotting data
Creating a model (learning) Classify the data
Summary
2. Hello World!
Modeling
A model by any other name Model versus design Selecting a model's features Extracting features
Designing a workflow
The computational framework The pipe operator Monadic data transformation Dependency injection Workflow modules The workflow factory Examples of workflow components
The preprocessing module The clustering module
Assessing a model
Validation
Key metrics Implementation
K-fold cross-validation Bias-variance decomposition Overfitting
Summary
3. Data Preprocessing
Time series Moving averages
The simple moving average The weighted moving average The exponential moving average
Fourier analysis
Discrete Fourier transform (DFT) DFT-based filtering Detection of market cycles
The Kalman filter
The state space estimation
The transition equation The measurement equation
The recursive algorithm
Prediction Correction Kalman smoothing Experimentation
Alternative preprocessing techniques Summary
4. Unsupervised Learning
Clustering
K-means clustering
Measuring similarity Overview of the K-means algorithm Step 1 – cluster configuration
Defining clusters Defining K-means Initializing clusters
Step 2 – cluster assignment Step 3 – iterative reconstruction Curse of dimensionality Experiment Tuning the number of clusters Validation
Expectation-maximization (EM) algorithm
Gaussian mixture model EM overview Implementation Testing Online EM
Dimension reduction
Principal components analysis (PCA)
Algorithm Implementation Test case Evaluation
Other dimension reduction techniques
Performance considerations
K-means EM PCA
Summary
5. Naïve Bayes Classifiers
Probabilistic graphical models Naïve Bayes classifiers
Introducing the multinomial Naïve Bayes
Formalism The frequentist perspective The predictive model The zero-frequency problem
Implementation
Software design Training Classification Labeling Results
Multivariate Bernoulli classification
Model Implementation
Naïve Bayes and text mining
Basics of information retrieval Implementation
Extraction of terms Scoring of terms
Testing
Retrieving textual information Evaluation
Pros and cons Summary
6. Regression and Regularization
Linear regression
One-variate linear regression
Implementation Test case
Ordinary least squares (OLS) regression
Design Implementation Test case 1 – trending Test case 2 – features selection
Regularization
Ln roughness penalty The ridge regression
Implementation The test case
Numerical optimization The logistic regression
The logit function Binomial classification Software design The training workflow
Configuring the least squares optimizer Computing the Jacobian matrix Defining the exit conditions Defining the least squares problem Minimizing the loss function Test
Classification
Summary
7. Sequential Data Models
Markov decision processes
The Markov property The first-order discrete Markov chain
The hidden Markov model (HMM)
Notation The lambda model HMM execution state Evaluation (CF-1)
Alpha class (the forward variable) Beta class (the backward variable)
Training (CF-2)
Baum-Welch estimator (EM)
Decoding (CF-3)
The Viterbi algorithm
Putting it all together Test case The hidden Markov model for time series analysis
Conditional random fields
Introduction to CRF Linear chain CRF
CRF and text analytics
The feature functions model Software design Implementation
Building the training set Generating tags Extracting data sequences CRF control parameters Putting it all together
Tests
The training convergence profile Impact of the size of the training set Impact of the L2 regularization factor
Comparing CRF and HMM Performance consideration Summary
8. Kernel Models and Support Vector Machines
Kernel functions
Overview Common discriminative kernels
The support vector machine (SVM)
The linear SVM
The separable case (hard margin) The nonseparable case (soft margin)
The nonlinear SVM
Max-margin classification The kernel trick
Support vector classifier (SVC)
The binary SVC
LIBSVM Software design Configuration parameters
SVM Formulation The SVM kernel function SVM execution
SVM implementation C-penalty and margin Kernel evaluation Application to risk analysis
Features and labels
Anomaly detection with one-class SVC Support vector regression (SVR)
Overview SVR versus linear regression
Performance considerations Summary
9. Artificial Neural Networks
Feed-forward neural networks (FFNN)
The Biological background The mathematical background
The multilayer perceptron (MLP)
The activation function The network architecture Software design Model definition
Layers Synapses Connections
Training cycle/epoch
Step 1 – input forward propagation
The computational model Objective Softmax
Step 2 – sum of squared errors Step 3 – error backpropagation
Error propagation The computational model
Step 4 – synapse/weights adjustment
Momentum factor for gradient descent Implementation
Step 5 – convergence criteria Configuration Putting all together
Training strategies and classification
Online versus batch training Regularization Model instantiation Prediction
Evaluation
Impact of learning rate Impact of the momentum factor Test case
Implementation Models evaluation Impact of hidden layers architecture
Benefits and limitations Summary
10. Genetic Algorithms
Evolution
The origin NP problems Evolutionary computing
Genetic algorithms and machine learning Genetic algorithm components
Encodings
Value encoding Predicate encoding Solution encoding The encoding scheme
Flat encoding Hierarchical encoding
Genetic operators
Selection Crossover Mutation
Fitness score
Implementation
Software design Key components Selection Controlling population growth GA configuration Crossover
Population Chromosomes Genes
Mutation
Population Chromosomes Genes
The reproduction cycle
GA for trading strategies
Definition of trading strategies
Trading operators The cost/unfitness function Trading signals Trading strategies Signal encoding
Test case
Data extraction Initial population Configuration GA instantiation GA execution Tests
The unweighted score The weighted score
Advantages and risks of genetic algorithms Summary
11. Reinforcement Learning
Introduction
The problem A solution – Q-learning
Terminology Concept Value of policy Bellman optimality equations Temporal difference for model-free learning Action-value iterative update
Implementation
Software design States and actions Search space Policy and action-value The Q-learning training Tail recursion to the rescue Prediction
Option trading using Q-learning
Option property Option model Function approximation Constrained state-transition Putting it all together
Evaluation Pros and cons of reinforcement learning
Learning classifier systems
Introduction to LCS Why LCS Terminology Extended learning classifier systems (XCS) XCS components
Application to portfolio management XCS core data XCS rules Covering Example of implementation
Benefits and limitation of learning classifier systems
Summary
12. Scalable Frameworks
Overview Scala
Controlling object creation Parallel collections
Processing a parallel collection Benchmark framework Performance evaluation
Scalability with Actors
The Actor model Partitioning Beyond actors – reactive programming
Akka
Master-workers
Messages exchange Worker actors The workflow controller The master Actor Master with routing Distributed discrete Fourier transform Limitations
Futures
The Actor life cycle Blocking on futures Handling future callbacks Putting all together
Apache Spark
Why Spark Design principles
In-memory persistency Laziness Transforms and Actions Shared variables
Experimenting with Spark
Deploying Spark Using Spark shell MLlib RDD generation K-means using Spark
Performance evaluation
Tuning parameters Tests Performance considerations
Pros and cons 0xdata Sparkling Water
Summary
B. Basic Concepts
Scala programming
List of libraries Format of code snippets Encapsulation Class constructor template Companion objects versus case classes Enumerations versus case classes Overloading Design template for classifiers Data extraction Data sources Extraction of documents Matrix class
Mathematics
Linear algebra
QR Decomposition LU factorization LDL decomposition Cholesky factorization Singular value decomposition Eigenvalue decomposition Algebraic and numerical libraries
First order predicate logic Jacobian and Hessian matrices Summary of optimization techniques
Gradient descent methods
Steepest descent Conjugate gradient Stochastic gradient descent
Quasi-Newton algorithms
BFGS L-BFGS
Nonlinear least squares minimization
Gauss-Newton Levenberg-Marquardt
Lagrange multipliers
Overview of dynamic programming
Finances 101
Fundamental analysis Technical analysis
Terminology Trading signals and strategy Price patterns
Options trading Financial data sources
Suggested online courses References
C. Bibliography
Index
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion