Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Practical Machine Learning
Table of Contents
Practical Machine Learning
Credits
Foreword
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Introduction to Machine learning
Machine learning
Definition
Core Concepts and Terminology
What is learning?
Data
Labeled and unlabeled data
Tasks
Algorithms
Models
Logical models
Geometric models
Probabilistic models
Data and inconsistencies in Machine learning
Under-fitting
Over-fitting
Data instability
Unpredictable data formats
Practical Machine learning examples
Types of learning problems
Classification
Clustering
Forecasting, prediction or regression
Simulation
Optimization
Supervised learning
Unsupervised learning
Semi-supervised learning
Reinforcement learning
Deep learning
Performance measures
Is the solution good?
Mean squared error (MSE)
Mean absolute error (MAE)
Normalized MSE and MAE (NMSE and NMAE)
Solving the errors: bias and variance
Some complementing fields of Machine learning
Data mining
Artificial intelligence (AI)
Statistical learning
Data science
Machine learning process lifecycle and solution architecture
Machine learning algorithms
Decision tree based algorithms
Bayesian method based algorithms
Kernel method based algorithms
Clustering methods
Artificial neural networks (ANN)
Dimensionality reduction
Ensemble methods
Instance based learning algorithms
Regression analysis based algorithms
Association rule based learning algorithms
Machine learning tools and frameworks
Summary
2. Machine learning and Large-scale datasets
Big data and the context of large-scale Machine learning
Functional versus Structural – A methodological mismatch
Commoditizing information
Theoretical limitations of RDBMS
Scaling-up versus Scaling-out storage
Distributed and parallel computing strategies
Machine learning: Scalability and Performance
Too many data points or instances
Too many attributes or features
Shrinking response time windows – need for real-time responses
Highly complex algorithm
Feed forward, iterative prediction cycles
Model selection process
Potential issues in large-scale Machine learning
Algorithms and Concurrency
Developing concurrent algorithms
Technology and implementation options for scaling-up Machine learning
MapReduce programming paradigm
High Performance Computing (HPC) with Message Passing Interface (MPI)
Language Integrated Queries (LINQ) framework
Manipulating datasets with LINQ
Graphics Processing Unit (GPU)
Field Programmable Gate Array (FPGA)
Multicore or multiprocessor systems
Summary
3. An Introduction to Hadoop's Architecture and Ecosystem
Introduction to Apache Hadoop
Evolution of Hadoop (the platform of choice)
Hadoop and its core elements
Machine learning solution architecture for big data (employing Hadoop)
The Data Source layer
The Ingestion layer
The Hadoop Storage layer
The Hadoop (Physical) Infrastructure layer – supporting appliance
Hadoop platform / Processing layer
The Analytics layer
The Consumption layer
Explaining and exploring data with Visualizations
Security and Monitoring layer
Hadoop core components framework
Hadoop Distributed File System (HDFS)
Secondary Namenode and Checkpoint process
Splitting large data files
Block loading to the cluster and replication
Writing to and reading from HDFS
Handling failures
HDFS command line
RESTFul HDFS
MapReduce
MapReduce architecture
What makes MapReduce cater to the needs of large datasets?
MapReduce execution flow and components
Developing MapReduce components
InputFormat
OutputFormat
Mapper implementation
Hadoop 2.x
Hadoop ecosystem components
Hadoop installation and setup
Installing Jdk 1.7
Creating a system user for Hadoop (dedicated)
Disable IPv6
Steps for installing Hadoop 2.6.0
Starting Hadoop
Hadoop distributions and vendors
Summary
4. Machine Learning Tools, Libraries, and Frameworks
Machine learning tools – A landscape
Apache Mahout
How does Mahout work?
Installing and setting up Apache Mahout
Setting up Maven
Setting-up Apache Mahout using Eclipse IDE
Setting up Apache Mahout without Eclipse
Mahout Packages
Implementing vectors in Mahout
R
Installing and setting up R
Integrating R with Apache Hadoop
Approach 1 – Using R and Streaming APIs in Hadoop
Approach 2 – Using the Rhipe package of R
Approach 3 – Using RHadoop
Summary of R/Hadoop integration approaches
Implementing in R (using examples)
R Expressions
Assignments
Functions
R Vectors
Assigning, accessing, and manipulating vectors
R Matrices
R Factors
R Data Frames
R Statistical frameworks
Julia
Installing and setting up Julia
Downloading and using the command line version of Julia
Using Juno IDE for running Julia
Using Julia via the browser
Running the Julia code from the command line
Implementing in Julia (with examples)
Using variables and assignments
Numeric primitives
Data structures
Working with Strings and String manipulations
Packages
Interoperability
Integrating with C
Integrating with Python
Integrating with MATLAB
Graphics and plotting
Benefits of adopting Julia
Integrating Julia and Hadoop
Python
Toolkit options in Python
Implementation of Python (using examples)
Installing Python and setting up scikit-learn
Loading data
Apache Spark
Scala
Programming with Resilient Distributed Datasets (RDD)
Spring XD
Summary
5. Decision Tree based learning
Decision trees
Terminology
Purpose and uses
Constructing a Decision tree
Handling missing values
Considerations for constructing Decision trees
Choosing the appropriate attribute(s)
Information gain and Entropy
Gini index
Gain ratio
Termination Criteria / Pruning Decision trees
Decision trees in a graphical representation
Inducing Decision trees – Decision tree algorithms
CART
C4.5
Greedy Decision trees
Benefits of Decision trees
Specialized trees
Oblique trees
Random forests
Evolutionary trees
Hellinger trees
Implementing Decision trees
Using Mahout
Using R
Using Spark
Using Python (scikit-learn)
Using Julia
Summary
6. Instance and Kernel Methods Based Learning
Instance-based learning (IBL)
Nearest Neighbors
Value of k in KNN
Distance measures in KNN
Euclidean distance
Hamming distance
Minkowski distance
Case-based reasoning (CBR)
Locally weighed regression (LWR)
Implementing KNN
Using Mahout
Using R
Using Spark
Using Python (scikit-learn)
Using Julia
Kernel methods-based learning
Kernel functions
Support Vector Machines (SVM)
Inseparable Data
Implementing SVM
Using Mahout
Using R
Using Spark
Using Python (Scikit-learn)
Using Julia
Summary
7. Association Rules based learning
Association rules based learning
Association rule – a definition
Apriori algorithm
Rule generation strategy
Rules for defining appropriate minsup
Apriori – the downside
FP-growth algorithm
Apriori versus FP-growth
Implementing Apriori and FP-growth
Using Mahout
Using R
Using Spark
Using Python (Scikit-learn)
Using Julia
Summary
8. Clustering based learning
Clustering-based learning
Types of clustering
Hierarchical clustering
Partitional clustering
The k-means clustering algorithm
Convergence or stopping criteria for the k-means clustering
K-means clustering on disk
Advantages of the k-means approach
Disadvantages of the k-means algorithm
Distance measures
Complexity measures
Implementing k-means clustering
Using Mahout
Using R
Using Spark
Using Python (scikit-learn)
Using Julia
Summary
9. Bayesian learning
Bayesian learning
Statistician's thinking
Important terms and definitions
Probability
Types of events
Mutually exclusive or disjoint events
Independent events
Dependent events
Types of probability
Distribution
Bernoulli distribution
Binomial distribution
Poisson probability distribution
Exponential distribution
Normal distribution
Relationship between the distributions
Bayes' theorem
Naïve Bayes classifier
Multinomial Naïve Bayes classifier
The Bernoulli Naïve Bayes classifier
Implementing Naïve Bayes algorithm
Using Mahout
Using R
Using Spark
Using scikit-learn
Using Julia
Summary
10. Regression based learning
Regression analysis
Revisiting statistics
Properties of expectation, variance, and covariance
Properties of variance
Properties of covariance
Example
ANOVA and F Statistics
Confounding
Effect modification
Regression methods
Simple regression or simple linear regression
Multiple regression
Polynomial (non-linear) regression
Generalized Linear Models (GLM)
Logistic regression (logit link)
Odds ratio in logistic regression
Model
Poisson regression
Implementing linear and logistic regression
Using Mahout
Using R
Using Spark
Using scikit-learn
Using Julia
Summary
11. Deep learning
Background
The human brain
Neural networks
Neuron
Synapses
Artificial neurons or perceptrons
Linear neurons
Rectified linear neurons / linear threshold neurons
Binary threshold neurons
Sigmoid neurons
Stochastic binary neurons
Neural Network size
An example
Neural network types
Multilayer fully connected feedforward networks or Multilayer Perceptrons (MLP)
Jordan networks
Elman networks
Radial Bias Function (RBF) networks
Hopfield networks
Dynamic Learning Vector Quantization (DLVQ) networks
Gradient descent method
Backpropagation algorithm
Softmax regression technique
Deep learning taxonomy
Convolutional neural networks (CNN/ConvNets)
Convolutional layer (CONV)
Pooling layer (POOL)
Fully connected layer (FC)
Recurrent Neural Networks (RNNs)
Restricted Boltzmann Machines (RBMs)
Deep Boltzmann Machines (DBMs)
Autoencoders
Implementing ANNs and Deep learning methods
Using Mahout
Using R
Using Spark
Using Python (Scikit-learn)
Using Julia
Summary
12. Reinforcement learning
Reinforcement Learning (RL)
The context of Reinforcement Learning
Examples of Reinforcement Learning
Evaluative Feedback
n-Armed Bandit problem
Action-value methods
Reinforcement comparison methods
The Reinforcement Learning problem – the world grid example
Markov Decision Process (MDP)
Basic RL model – agent-environment interface
Delayed rewards
The policy
Reinforcement Learning – key features
Reinforcement learning solution methods
Dynamic Programming (DP)
Generalized Policy Iteration (GPI)
Monte Carlo methods
Temporal difference (TD) learning
Sarsa - on-Policy TD
Q-Learning – off-Policy TD
Actor-critic methods (on-policy)
R Learning (Off-policy)
Implementing Reinforcement Learning algorithms
Using Mahout
Using R
Using Spark
Using Python (Scikit-learn)
Using Julia
Summary
13. Ensemble learning
Ensemble learning methods
The wisdom of the crowd
Key use cases
Recommendation systems
Anomaly detection
Transfer learning
Stream mining or classification
Ensemble methods
Supervised ensemble methods
Boosting
AdaBoost
Bagging
Wagging
Random forests
Gradient boosting machines (GBM)
Unsupervised ensemble methods
Implementing ensemble methods
Using Mahout
Using R
Using Spark
Using Python (Scikit-learn)
Using Julia
Summary
14. New generation data architectures for Machine learning
Evolution of data architectures
Emerging perspectives & drivers for new age data architectures
Modern data architectures for Machine learning
Semantic data architecture
The business data lake
Semantic Web technologies
Ontology and data integration
Vendors
Multi-model database architecture / polyglot persistence
Vendors
Lambda Architecture (LA)
Vendors
Summary
Index
← Prev
Back
Next →
← Prev
Back
Next →