Python · Data Analytics and Visualization by V.o.T.H., Phuong -- Read -- Imperial Library of Trantor

Index

Python: Data Analytics and Visualization

Table of Contents Python: Data Analytics and Visualization Credits Preface

What this learning path covers What you need for this learning path Who this learning path is for Reader feedback Customer support

Downloading the example code Errata Piracy Questions

1. Module 1

1. Introducing Data Analysis and Libraries

Data analysis and processing An overview of the libraries in data analysis Python libraries in data analysis

NumPy Pandas Matplotlib PyMongo The scikit-learn library

Summary

2. NumPy Arrays and Vectorized Computation

NumPy arrays

Data types Array creation Indexing and slicing Fancy indexing Numerical operations on arrays

Array functions Data processing using arrays

Loading and saving data Saving an array Loading an array

Linear algebra with NumPy NumPy random numbers Summary

3. Data Analysis with Pandas

An overview of the Pandas package The Pandas data structure

Series The DataFrame

The essential basic functionality

Reindexing and altering labels Head and tail Binary operations Functional statistics Function application Sorting

Indexing and selecting data Computational tools Working with missing data Advanced uses of Pandas for data analysis

Hierarchical indexing The Panel data

Summary

4. Data Visualization

The matplotlib API primer

Line properties Figures and subplots

Exploring plot types

Scatter plots Bar plots Contour plots Histogram plots

Legends and annotations Plotting functions with Pandas Additional Python data visualization tools

Bokeh MayaVi

Summary

5. Time Series

Time series primer Working with date and time objects Resampling time series Downsampling time series data Upsampling time series data Time zone handling Timedeltas Time series plotting Summary

6. Interacting with Databases

Interacting with data in text format

Reading data from text format Writing data to text format

Interacting with data in binary format

HDF5

Interacting with data in MongoDB Interacting with data in Redis

The simple value List Set Ordered set

Summary

7. Data Analysis Application Examples

Data munging

Cleaning data Filtering Merging data Reshaping data

Data aggregation Grouping data Summary

8. Machine Learning Models with scikit-learn

An overview of machine learning models The scikit-learn modules for different models Data representation in scikit-learn Supervised learning – classification and regression Unsupervised learning – clustering and dimensionality reduction Measuring prediction performance Summary

2. Module 2

1. Getting Started with Predictive Modelling

Introducing predictive modelling

Scope of predictive modelling

Ensemble of statistical algorithms Statistical tools Historical data Mathematical function Business context

Knowledge matrix for predictive modelling Task matrix for predictive modelling

Applications and examples of predictive modelling

LinkedIn's "People also viewed" feature

What it does? How is it done?

Correct targeting of online ads

How is it done?

Santa Cruz predictive policing

How is it done?

Determining the activity of a smartphone user using accelerometer data

How is it done?

Sport and fantasy leagues

How was it done?

Python and its packages – download and installation

Anaconda Standalone Python Installing a Python package

Installing pip Installing Python packages with pip

Python and its packages for predictive modelling IDEs for Python Summary

2. Data Cleaning

Reading the data – variations and examples

Data frames Delimiters

Various methods of importing data in Python

Case 1 – reading a dataset using the read_csv method

The read_csv method Use cases of the read_csv method

Passing the directory address and filename as variables Reading a .txt dataset with a comma delimiter Specifying the column names of a dataset from a list

Case 2 – reading a dataset using the open method of Python

Reading a dataset line by line Changing the delimiter of a dataset

Case 3 – reading data from a URL Case 4 – miscellaneous cases

Reading from an .xls or .xlsx file Writing to a CSV or Excel file

Basics – summary, dimensions, and structure Handling missing values

Checking for missing values What constitutes missing data?

How missing values are generated and propagated

Treating missing values

Deletion Imputation

Creating dummy variables Visualizing a dataset by basic plotting

Scatter plots Histograms Boxplots

Summary

3. Data Wrangling

Subsetting a dataset

Selecting columns Selecting rows Selecting a combination of rows and columns Creating new columns

Generating random numbers and their usage

Various methods for generating random numbers Seeding a random number Generating random numbers following probability distributions

Probability density function Cumulative density function Uniform distribution Normal distribution

Using the Monte-Carlo simulation to find the value of pi

Geometry and mathematics behind the calculation of pi

Generating a dummy data frame

Grouping the data – aggregation, filtering, and transformation

Aggregation Filtering Transformation Miscellaneous operations

Random sampling – splitting a dataset in training and testing datasets

Method 1 – using the Customer Churn Model Method 2 – using sklearn Method 3 – using the shuffle function

Concatenating and appending data Merging/joining datasets

Inner Join Left Join Right Join An example of the Inner Join An example of the Left Join An example of the Right Join Summary of Joins in terms of their length

Summary

4. Statistical Concepts for Predictive Modelling

Random sampling and the central limit theorem Hypothesis testing

Null versus alternate hypothesis Z-statistic and t-statistic Confidence intervals, significance levels, and p-values Different kinds of hypothesis test A step-by-step guide to do a hypothesis test An example of a hypothesis test

Chi-square tests Correlation Summary

5. Linear Regression with Python

Understanding the maths behind linear regression

Linear regression using simulated data

Fitting a linear regression model and checking its efficacy Finding the optimum value of variable coefficients

Making sense of result parameters

p-values F-statistics Residual Standard Error

Implementing linear regression with Python

Linear regression using the statsmodel library Multiple linear regression Multi-collinearity

Variance Inflation Factor

Model validation

Training and testing data split Summary of models Linear regression with scikit-learn Feature selection with scikit-learn

Handling other issues in linear regression

Handling categorical variables Transforming a variable to fit non-linear relations Handling outliers Other considerations and assumptions for linear regression

Summary

6. Logistic Regression with Python

Linear regression versus logistic regression Understanding the math behind logistic regression

Contingency tables Conditional probability Odds ratio Moving on to logistic regression from linear regression Estimation using the Maximum Likelihood Method

Likelihood function: Log likelihood function: Building the logistic regression model from scratch

Making sense of logistic regression parameters

Wald test Likelihood Ratio Test statistic Chi-square test

Implementing logistic regression with Python

Processing the data Data exploration Data visualization Creating dummy variables for categorical variables Feature selection Implementing the model

Model validation and evaluation

Cross validation

Model validation

The ROC curve

Confusion matrix

Summary

7. Clustering with Python

Introduction to clustering – what, why, and how?

What is clustering? How is clustering used? Why do we do clustering?

Mathematics behind clustering

Distances between two observations

Euclidean distance Manhattan distance Minkowski distance The distance matrix

Normalizing the distances Linkage methods

Single linkage Compete linkage Average linkage Centroid linkage Ward's method

Hierarchical clustering K-means clustering

Implementing clustering using Python

Importing and exploring the dataset Normalizing the values in the dataset Hierarchical clustering using scikit-learn K-Means clustering using scikit-learn

Interpreting the cluster

Fine-tuning the clustering

The elbow method Silhouette Coefficient

Summary

8. Trees and Random Forests with Python

Introducing decision trees

A decision tree

Understanding the mathematics behind decision trees

Homogeneity Entropy Information gain ID3 algorithm to create a decision tree Gini index Reduction in Variance Pruning a tree Handling a continuous numerical variable Handling a missing value of an attribute

Implementing a decision tree with scikit-learn

Visualizing the tree Cross-validating and pruning the decision tree

Understanding and implementing regression trees

Regression tree algorithm Implementing a regression tree using Python

Understanding and implementing random forests

The random forest algorithm Implementing a random forest using Python Why do random forests work? Important parameters for random forests

Summary

9. Best Practices for Predictive Modelling

Best practices for coding

Commenting the codes Defining functions for substantial individual tasks

Example 1 Example 2 Example 3

Avoid hard-coding of variables as much as possible Version control Using standard libraries, methods, and formulas

Best practices for data handling Best practices for algorithms Best practices for statistics Best practices for business contexts Summary

A. A List of Links

3. Module 3

1. A Conceptual Framework for Data Visualization

Data, information, knowledge, and insight

Data Information Knowledge Data analysis and insight

The transformation of data

Transforming data into information

Data collection Data preprocessing Data processing Organizing data Getting datasets

Transforming information into knowledge Transforming knowledge into insight

Data visualization history

Visualization before computers

Minard's Russian campaign (1812) The Cholera epidemics in London (1831-1855) Statistical graphics (1850-1915) Later developments in data visualization

How does visualization help decision-making?

Where does visualization fit in? Data visualization today

What is a good visualization?

Visualization plots

Bar graphs and pie charts

Bar graphs Pie charts

Box plots Scatter plots and bubble charts

Scatter plots Bubble charts

KDE plots

Summary

2. Data Analysis and Visualization

Why does visualization require planning? The Ebola example A sports example

Visually representing the results

Creating interesting stories with data

Why are stories so important? Reader-driven narratives

Gapminder The State of the Union address Mortality rate in the USA A few other example narratives

Author-driven narratives

Perception and presentation methods

The Gestalt principles of perception

Some best practices for visualization

Comparison and ranking Correlation Distribution Location-specific or geodata Part-to-whole relationships Trends over time

Visualization tools in Python

Development tools

Canopy from Enthought Anaconda from Continuum Analytics

Interactive visualization

Event listeners Layouts

Circular layout Radial layout Balloon layout

Summary

3. Getting Started with the Python IDE

The IDE tools in Python

Python 3.x versus Python 2.7 Types of interactive tools

IPython Plotly

Types of Python IDE

PyCharm PyDev Interactive Editor for Python (IEP) Canopy from Enthought Anaconda from Continuum Analytics

An overview of Spyder An overview of conda

Visualization plots with Anaconda

The surface-3D plot The square map plot

Interactive visualization packages

Bokeh VisPy

Summary

4. Numerical Computing and Interactive Plotting

NumPy, SciPy, and MKL functions

NumPy

NumPy universal functions Shape and reshape manipulation An example of interpolation Vectorizing functions Summary of NumPy linear algebra

SciPy

An example of linear equations The vectorized numerical derivative

MKL functions The performance of Python

Scalar selection Slicing

Slice using flat

Array indexing

Numerical indexing Logical indexing

Other data structures

Stacks Tuples Sets Queues Dictionaries Dictionaries for matrix representation

Sparse matrices

Visualizing sparseness

Dictionaries for memoization

Tries

Visualization using matplotlib

Word clouds Installing word clouds Input for word clouds

Web feeds The Twitter text

Plotting the stock price chart

Obtaining data

The visualization example in sports Summary

5. Financial and Statistical Models

The deterministic model

Gross returns

The stochastic model

Monte Carlo simulation

What exactly is Monte Carlo simulation? An inventory problem in Monte Carlo simulation Monte Carlo simulation in basketball The volatility plot Implied volatilities

The portfolio valuation The simulation model Geometric Brownian simulation The diffusion-based simulation

The threshold model

Schelling's Segregation Model

An overview of statistical and machine learning

K-nearest neighbors Generalized linear models

Bayesian linear regression

Creating animated and interactive plots Summary

6. Statistical and Machine Learning

Classification methods Understanding linear regression Linear regression Decision tree

An example

The Bayes theorem The NaÃ¯ve Bayes classifier The NaÃ¯ve Bayes classifier using TextBlob

Installing TextBlob Downloading corpora The NaÃ¯ve Bayes classifier using TextBlob

Viewing positive sentiments using word clouds k-nearest neighbors Logistic regression Support vector machines Principal component analysis

Installing scikit-learn

k-means clustering Summary

7. Bioinformatics, Genetics, and Network Models

Directed graphs and multigraphs

Storing graph data Displaying graphs

igraph NetworkX Graph-tool

PageRank

The clustering coefficient of graphs Analysis of social networks The planar graph test The directed acyclic graph test Maximum flow and minimum cut A genetic programming example Stochastic block models Summary

8. Advanced Visualization

Computer simulation

Python's random package SciPy's random functions Simulation examples Signal processing Animation Visualization methods using HTML5 How is Julia different from Python? D3.js for visualization Dashboards

Summary

B. Go Forth and Explore Visualization

An overview of conda Packages installed with Anaconda Packages websites About matplotlib

Bibliography Index

← Prev
Back
Next →

← Prev
Back
Next →