Machine Learning With Python by Mueller, Andreas C. -- Read -- Imperial Library of Trantor

Index

1. Introduction

Why machine learning?

Problems that machine learning can solve Knowing your data

Why Python?

What this book will cover What this book will not cover

Scikit-learn

Installing Scikit-learn Essential Libraries and Tools

Jupyter Notebook NumPy SciPy matplotlib Pandas

Python2 versus Python3 Versions Used in this Book

A First Application: Classifying iris species

Meet the data Measuring Success: Training and testing data First things first: Look at your data Building your first model: k nearest neighbors Making predictions Evaluating the model Summary

2. Supervised Learning

Classification and Regression Generalization, Overfitting and Underfitting Supervised Machine Learning Algorithms k-Nearest Neighbor

k-Neighbors Classification Analyzing KNeighborsClassifier k-Neighbors Regression Analyzing k nearest neighbors regression Strengths, weaknesses and parameters

Linear models

Linear models for regression Linear Regression aka Ordinary Least Squares Ridge regression Lasso Linear models for Classification Linear Models for multiclass classification Strengths, weaknesses and parameters

Naive Bayes Classifiers

Strengths, weaknesses and parameters

Decision trees

Building Decision Trees Controlling complexity of Decision Trees Analyzing Decision Trees Feature Importance in trees Strengths, weaknesses and parameters

Ensembles of Decision Trees

Random Forests

Building Random Forests Analyzing Random Forests Strengths, weaknesses and parameters

Gradient Boosted Regression Trees (Gradient Boosting Machines)

Strengths, weaknesses and parameters

Kernelized Support Vector Machines

Linear Models and Non-linear Features The Kernel Trick Understanding SVMs Tuning SVM parameters Preprocessing Data for SVMs Strengths, weaknesses and parameters

Neural Networks (Deep Learning)

The Neural Network Model Tuning Neural Networks Strengths, weaknesses and parameters

Estimating complexity in neural networks

Uncertainty estimates from classifiers

The Decision Function Predicting probabilities Uncertainty in multi-class classification

Summary and Outlook

3. Unsupervised Learning and Preprocessing

Types of unsupervised learning

Challenges in unsupervised learning

Preprocessing and Scaling

Different kinds of preprocessing Applying data transformations Scaling training and test data the same way The effect of preprocessing on supervised learning

Dimensionality Reduction, Feature Extraction and Manifold Learning

Principal Component Analysis (PCA)

Applying PCA to the cancer dataset for visualization Eigenfaces for feature extraction

Non-Negative Matrix Factorization (NMF)

Applying NMF to synthetic data Applying NMF to face images

Manifold learning with t-SNE

Clustering

k-Means clustering

Failure cases of k-Means Vector Quantization - Or Seeing k-Means as Decomposition

Agglomerative Clustering

Hierarchical Clustering and Dendrograms

DBSCAN

Comparing and evaluating clustering algorithms Evaluating clustering with ground truth Evaluating clustering without ground truth Comparing algorithms on the faces dataset Analyzing the faces dataset with DBSCAN Analyzing the faces dataset with k-Means Analyzing the faces dataset with agglomerative clustering

Summary of Clustering Methods

Summary and Outlook

4. Summary of scikit-learn methods and usage

The Estimator Interface Fit resets a model Method chaining Shortcuts and efficient alternatives Important Attributes Summary and outlook

5. Representing Data and Engineering Features

Categorical Variables

One-Hot-Encoding (Dummy variables)

Checking string-encoded categorical data

Binning, Discretization, Linear Models and Trees Interactions and Polynomials Univariate Non-linear transformations Automatic Feature Selection

Univariate statistics Model-based Feature Selection Iterative feature selection

Utilizing Expert Knowledge Summary and outlook

6. Model evaluation and improvement

Cross-validation

Cross-validation in scikit-learn

Benefits of cross-validation Stratified K-Fold cross-validation and other strategies More control over cross-validation

Leave-One-Out cross-validation Shuffle-Split cross-validation Cross-validation with groups

Grid Search

Simple Grid-Search The danger of overfitting the parameters and the validation set Grid-search with cross-validation Analyzing the result of cross-validation Using different cross-validation strategies with grid-search Nested cross-validation Parallelizing cross-validation and grid-search

Evaluation Metrics and scoring

Keep the end-goal in mind Metrics for binary classification

Kinds of Errors Imbalanced datasets Confusion matrices

Relation to accuracy

Precision, recall and f-score Taking uncertainty into account Precision-Recall curves and ROC curves Receiver Operating Characteristics (ROC) and AUC

Multi-class classification Regression metrics

Using evaluation metrics in model selection Summary and outlook

7. Algorithm Chains and Pipelines

Parameter Selection with Preprocessing

Building Pipelines Using Pipelines in Grid-searches The General Pipeline Interface Convenient Pipeline creation with make_pipeline

Accessing step attributes Accessing attributes in grid-searched pipeline.

Grid-searching preprocessing steps and model parameters

Summary and Outlook

8. Working with Text Data

Types of data represented as strings

Example application: Sentiment analysis of movie reviews Representing text data as Bag of Words

Applying bag-of-words to a toy dataset

Bag-of-word for movie reviews Stop-words

Rescaling the data with TFIDF

Investigating model coefficients Bag of words with more than one word (n-grams) Advanced tokenization, stemming and lemmatization

Topic Modeling and Document Clustering Summary and Outlook

← Prev
Back
Next →

← Prev
Back
Next →