Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
1. Introduction
Why machine learning?
Problems that machine learning can solve
Knowing your data
Why Python?
What this book will cover
What this book will not cover
Scikit-learn
Installing Scikit-learn
Essential Libraries and Tools
Jupyter Notebook
NumPy
SciPy
matplotlib
Pandas
Python2 versus Python3
Versions Used in this Book
A First Application: Classifying iris species
Meet the data
Measuring Success: Training and testing data
First things first: Look at your data
Building your first model: k nearest neighbors
Making predictions
Evaluating the model
Summary
2. Supervised Learning
Classification and Regression
Generalization, Overfitting and Underfitting
Supervised Machine Learning Algorithms
k-Nearest Neighbor
k-Neighbors Classification
Analyzing KNeighborsClassifier
k-Neighbors Regression
Analyzing k nearest neighbors regression
Strengths, weaknesses and parameters
Linear models
Linear models for regression
Linear Regression aka Ordinary Least Squares
Ridge regression
Lasso
Linear models for Classification
Linear Models for multiclass classification
Strengths, weaknesses and parameters
Naive Bayes Classifiers
Strengths, weaknesses and parameters
Decision trees
Building Decision Trees
Controlling complexity of Decision Trees
Analyzing Decision Trees
Feature Importance in trees
Strengths, weaknesses and parameters
Ensembles of Decision Trees
Random Forests
Building Random Forests
Analyzing Random Forests
Strengths, weaknesses and parameters
Gradient Boosted Regression Trees (Gradient Boosting Machines)
Strengths, weaknesses and parameters
Kernelized Support Vector Machines
Linear Models and Non-linear Features
The Kernel Trick
Understanding SVMs
Tuning SVM parameters
Preprocessing Data for SVMs
Strengths, weaknesses and parameters
Neural Networks (Deep Learning)
The Neural Network Model
Tuning Neural Networks
Strengths, weaknesses and parameters
Estimating complexity in neural networks
Uncertainty estimates from classifiers
The Decision Function
Predicting probabilities
Uncertainty in multi-class classification
Summary and Outlook
3. Unsupervised Learning and Preprocessing
Types of unsupervised learning
Challenges in unsupervised learning
Preprocessing and Scaling
Different kinds of preprocessing
Applying data transformations
Scaling training and test data the same way
The effect of preprocessing on supervised learning
Dimensionality Reduction, Feature Extraction and Manifold Learning
Principal Component Analysis (PCA)
Applying PCA to the cancer dataset for visualization
Eigenfaces for feature extraction
Non-Negative Matrix Factorization (NMF)
Applying NMF to synthetic data
Applying NMF to face images
Manifold learning with t-SNE
Clustering
k-Means clustering
Failure cases of k-Means
Vector Quantization - Or Seeing k-Means as Decomposition
Agglomerative Clustering
Hierarchical Clustering and Dendrograms
DBSCAN
Comparing and evaluating clustering algorithms
Evaluating clustering with ground truth
Evaluating clustering without ground truth
Comparing algorithms on the faces dataset
Analyzing the faces dataset with DBSCAN
Analyzing the faces dataset with k-Means
Analyzing the faces dataset with agglomerative clustering
Summary of Clustering Methods
Summary and Outlook
4. Summary of scikit-learn methods and usage
The Estimator Interface
Fit resets a model
Method chaining
Shortcuts and efficient alternatives
Important Attributes
Summary and outlook
5. Representing Data and Engineering Features
Categorical Variables
One-Hot-Encoding (Dummy variables)
Checking string-encoded categorical data
Binning, Discretization, Linear Models and Trees
Interactions and Polynomials
Univariate Non-linear transformations
Automatic Feature Selection
Univariate statistics
Model-based Feature Selection
Iterative feature selection
Utilizing Expert Knowledge
Summary and outlook
6. Model evaluation and improvement
Cross-validation
Cross-validation in scikit-learn
Benefits of cross-validation
Stratified K-Fold cross-validation and other strategies
More control over cross-validation
Leave-One-Out cross-validation
Shuffle-Split cross-validation
Cross-validation with groups
Grid Search
Simple Grid-Search
The danger of overfitting the parameters and the validation set
Grid-search with cross-validation
Analyzing the result of cross-validation
Using different cross-validation strategies with grid-search
Nested cross-validation
Parallelizing cross-validation and grid-search
Evaluation Metrics and scoring
Keep the end-goal in mind
Metrics for binary classification
Kinds of Errors
Imbalanced datasets
Confusion matrices
Relation to accuracy
Precision, recall and f-score
Taking uncertainty into account
Precision-Recall curves and ROC curves
Receiver Operating Characteristics (ROC) and AUC
Multi-class classification
Regression metrics
Using evaluation metrics in model selection
Summary and outlook
7. Algorithm Chains and Pipelines
Parameter Selection with Preprocessing
Building Pipelines
Using Pipelines in Grid-searches
The General Pipeline Interface
Convenient Pipeline creation with make_pipeline
Accessing step attributes
Accessing attributes in grid-searched pipeline.
Grid-searching preprocessing steps and model parameters
Summary and Outlook
8. Working with Text Data
Types of data represented as strings
Example application: Sentiment analysis of movie reviews
Representing text data as Bag of Words
Applying bag-of-words to a toy dataset
Bag-of-word for movie reviews
Stop-words
Rescaling the data with TFIDF
Investigating model coefficients
Bag of words with more than one word (n-grams)
Advanced tokenization, stemming and lemmatization
Topic Modeling and Document Clustering
Summary and Outlook
← Prev
Back
Next →
← Prev
Back
Next →