Introduction to Machine Learning With Python · A Guide for Data Scientists by Müller, Andreas C. -- Read -- Imperial Library of Trantor

Index

Preface

Who Should Read This Book Why We Wrote This Book Navigating This Book Online Resources Conventions Used in This Book Using Code Examples O’Reilly Safari How to Contact Us Acknowledgments

From Andreas From Sarah

1. Introduction

Why Machine Learning?

Problems Machine Learning Can Solve Knowing Your Task and Knowing Your Data

Why Python? scikit-learn

Installing scikit-learn

Essential Libraries and Tools

Jupyter Notebook NumPy SciPy matplotlib pandas mglearn

Python 2 Versus Python 3 Versions Used in this Book A First Application: Classifying Iris Species

Meet the Data Measuring Success: Training and Testing Data First Things First: Look at Your Data Building Your First Model: k-Nearest Neighbors Making Predictions Evaluating the Model

Summary and Outlook

2. Supervised Learning

Classification and Regression Generalization, Overfitting, and Underfitting

Relation of Model Complexity to Dataset Size

Supervised Machine Learning Algorithms

Some Sample Datasets k-Nearest Neighbors

k-Neighbors classification Analyzing KNeighborsClassifier k-neighbors regression Analyzing KNeighborsRegressor Strengths, weaknesses, and parameters

Linear Models

Linear models for regression Linear regression (aka ordinary least squares) Ridge regression Lasso Linear models for classification Linear models for multiclass classification Strengths, weaknesses, and parameters

Naive Bayes Classifiers

Strengths, weaknesses, and parameters

Decision Trees

Building decision trees Controlling complexity of decision trees Analyzing decision trees Feature importance in trees Strengths, weaknesses, and parameters

Ensembles of Decision Trees

Random forests

Building random forests Analyzing random forests Strengths, weaknesses, and parameters

Gradient boosted regression trees (gradient boosting machines)

Strengths, weaknesses, and parameters

Kernelized Support Vector Machines

Linear models and nonlinear features The kernel trick Understanding SVMs Tuning SVM parameters Preprocessing data for SVMs Strengths, weaknesses, and parameters

Neural Networks (Deep Learning)

The neural network model Tuning neural networks Strengths, weaknesses, and parameters

Estimating complexity in neural networks

Uncertainty Estimates from Classifiers

The Decision Function Predicting Probabilities Uncertainty in Multiclass Classification

Summary and Outlook

3. Unsupervised Learning and Preprocessing

Types of Unsupervised Learning Challenges in Unsupervised Learning Preprocessing and Scaling

Different Kinds of Preprocessing Applying Data Transformations Scaling Training and Test Data the Same Way The Effect of Preprocessing on Supervised Learning

Dimensionality Reduction, Feature Extraction, and Manifold Learning

Principal Component Analysis (PCA)

Applying PCA to the cancer dataset for visualization Eigenfaces for feature extraction

Non-Negative Matrix Factorization (NMF)

Applying NMF to synthetic data Applying NMF to face images

Manifold Learning with t-SNE

Clustering

k-Means Clustering

Failure cases of k-means Vector quantization, or seeing k-means as decomposition

Agglomerative Clustering

Hierarchical clustering and dendrograms

DBSCAN Comparing and Evaluating Clustering Algorithms

Evaluating clustering with ground truth Evaluating clustering without ground truth Comparing algorithms on the faces dataset

Analyzing the faces dataset with DBSCAN Analyzing the faces dataset with k-means Analyzing the faces dataset with agglomerative clustering

Summary of Clustering Methods

Summary and Outlook

4. Representing Data and Engineering Features

Categorical Variables

One-Hot-Encoding (Dummy Variables)

Checking string-encoded categorical data

Numbers Can Encode Categoricals

Binning, Discretization, Linear Models, and Trees Interactions and Polynomials Univariate Nonlinear Transformations Automatic Feature Selection

Univariate Statistics Model-Based Feature Selection Iterative Feature Selection

Utilizing Expert Knowledge Summary and Outlook

5. Model Evaluation and Improvement

Cross-Validation

Cross-Validation in scikit-learn Benefits of Cross-Validation Stratified k-Fold Cross-Validation and Other Strategies

More control over cross-validation Leave-one-out cross-validation Shuffle-split cross-validation Cross-validation with groups

Grid Search

Simple Grid Search The Danger of Overfitting the Parameters and the Validation Set Grid Search with Cross-Validation

Analyzing the result of cross-validation Search over spaces that are not grids Using different cross-validation strategies with grid search Nested cross-validation Parallelizing cross-validation and grid search

Evaluation Metrics and Scoring

Keep the End Goal in Mind Metrics for Binary Classification

Kinds of errors Imbalanced datasets Confusion matrices

Relation to accuracy Precision, recall, and f-score

Taking uncertainty into account Precision-recall curves and ROC curves Receiver operating characteristics (ROC) and AUC

Metrics for Multiclass Classification Regression Metrics Using Evaluation Metrics in Model Selection

Summary and Outlook

6. Algorithm Chains and Pipelines

Parameter Selection with Preprocessing Building Pipelines Using Pipelines in Grid Searches The General Pipeline Interface

Convenient Pipeline Creation with make_pipeline Accessing Step Attributes Accessing Attributes in a Pipeline inside GridSearchCV

Grid-Searching Preprocessing Steps and Model Parameters Grid-Searching Which Model To Use Summary and Outlook

7. Working with Text Data

Types of Data Represented as Strings Example Application: Sentiment Analysis of Movie Reviews Representing Text Data as a Bag of Words

Applying Bag-of-Words to a Toy Dataset Bag-of-Words for Movie Reviews

Stopwords Rescaling the Data with tf–idf Investigating Model Coefficients Bag-of-Words with More Than One Word (n-Grams) Advanced Tokenization, Stemming, and Lemmatization Topic Modeling and Document Clustering

Latent Dirichlet Allocation

Summary and Outlook

8. Wrapping Up

Approaching a Machine Learning Problem

Humans in the Loop

From Prototype to Production Testing Production Systems Building Your Own Estimator Where to Go from Here

Theory Other Machine Learning Frameworks and Packages Ranking, Recommender Systems, and Other Kinds of Learning Probabilistic Modeling, Inference, and Probabilistic Programming Neural Networks Scaling to Larger Datasets Honing Your Skills

Conclusion

Index

← Prev
Back
Next →

← Prev
Back
Next →