Python: End-to-end Data Analysis by Martins, Luiz Felipe -- Read -- Imperial Library of Trantor

Log In

Or create an account ->

Imperial Library

Home
About
News
Upload
Forum

Help

Login/SignUp

Index

Cover Table of Contents Python: End-to-end Data Analysis Python: End-to-end Data Analysis Python: End-to-end Data Analysis Credits Preface What you need for this learning path Who this learning path is for Reader feedback Customer support Downloading the example code Errata Piracy 1. Module 1 1. Introducing Data Analysis and Libraries An overview of the libraries in data analysis Python libraries in data analysis Summary 2. NumPy Arrays and Vectorized Computation Array functions Data processing using arrays Linear algebra with NumPy NumPy random numbers Summary 3. Data Analysis with Pandas The Pandas data structure The essential basic functionality Indexing and selecting data Computational tools Working with missing data Advanced uses of Pandas for data analysis Summary 4. Data Visualization Exploring plot types Legends and annotations Plotting functions with Pandas Additional Python data visualization tools Summary 5. Time Series Working with date and time objects Resampling time series Downsampling time series data Upsampling time series data Time zone handling Timedeltas Time series plotting Summary 6. Interacting with Databases Interacting with data in binary format Interacting with data in MongoDB Interacting with data in Redis Summary 7. Data Analysis Application Examples Data aggregation Grouping data Summary 8. Machine Learning Models with scikit-learn The scikit-learn modules for different models Data representation in scikit-learn Supervised learning – classification and regression Unsupervised learning – clustering and dimensionality reduction Measuring prediction performance Summary 2. Module 2 1. Laying the Foundation for Reproducible Data Analysis Setting up Anaconda Installing the Data Science Toolbox Creating a virtual environment with virtualenv and virtualenvwrapper Sandboxing Python applications with Docker images Keeping track of package versions and history in IPython Notebook Configuring IPython Learning to log for robust error checking Unit testing your code Configuring pandas Configuring matplotlib Seeding random number generators and NumPy print options Standardizing reports, code style, and data access 2. Creating Attractive Data Visualizations Graphing Anscombe's quartet Choosing seaborn color palettes Choosing matplotlib color maps Interacting with IPython Notebook widgets Viewing a matrix of scatterplots Visualizing with d3.js via mpld3 Creating heatmaps Combining box plots and kernel density plots with violin plots Visualizing network graphs with hive plots Displaying geographical maps Using ggplot2-like plots Highlighting data points with influence plots 3. Statistical Data Analysis and Probability Fitting data to the exponential distribution Fitting aggregated data to the gamma distribution Fitting aggregated counts to the Poisson distribution Determining bias Estimating kernel density Determining confidence intervals for mean, variance, and standard deviation Sampling with probability weights Exploring extreme values Correlating variables with Pearson's correlation Correlating variables with the Spearman rank correlation Correlating a binary and a continuous variable with the point biserial correlation Evaluating relations between variables with ANOVA 4. Dealing with Data and Numerical Issues Clipping and filtering outliers Winsorizing data Measuring central tendency of noisy data Normalizing with the Box-Cox transformation Transforming data with the power ladder Transforming data with logarithms Rebinning data Applying logit() to transform proportions Fitting a robust linear model Taking variance into account with weighted least squares Using arbitrary precision for optimization Using arbitrary precision for linear algebra 5. Web Mining, Databases, and Big Data Simulating web browsing Scraping the Web Dealing with non-ASCII text and HTML entities Implementing association tables Setting up database migration scripts Adding a table column to an existing table Adding indices after table creation Setting up a test web server Implementing a star schema with fact and dimension tables Using HDFS Setting up Spark Clustering data with Spark 6. Signal Processing and Timeseries Spectral analysis with periodograms Estimating power spectral density with the Welch method Analyzing peaks Measuring phase synchronization Exponential smoothing Evaluating smoothing Using the Lomb-Scargle periodogram Analyzing the frequency spectrum of audio Analyzing signals with the discrete cosine transform Block bootstrapping time series data Moving block bootstrapping time series data Applying the discrete wavelet transform 7. Selecting Stocks with Financial Data Analysis Computing simple and log returns Ranking stocks with the Sharpe ratio and liquidity Ranking stocks with the Calmar and Sortino ratios Analyzing returns statistics Correlating individual stocks with the broader market Exploring risk and return Examining the market with the non-parametric runs test Testing for random walks Determining market efficiency with autoregressive models Creating tables for a stock prices database Populating the stock prices database Optimizing an equal weights two-asset portfolio 8. Text Mining and Social Network Analysis Creating a categorized corpus Tokenizing news articles in sentences and words Stemming, lemmatizing, filtering, and TF-IDF scores Recognizing named entities Extracting topics with non-negative matrix factorization Implementing a basic terms database Computing social network density Calculating social network closeness centrality Determining the betweenness centrality Estimating the average clustering coefficient Calculating the assortativity coefficient of a graph Getting the clique number of a graph Creating a document graph with cosine similarity 9. Ensemble Learning and Dimensionality Reduction Recursively eliminating features Applying principal component analysis for dimension reduction Applying linear discriminant analysis for dimension reduction Stacking and majority voting for multiple models Learning with random forests Fitting noisy data with the RANSAC algorithm Bagging to improve results Boosting for better learning Nesting cross-validation Reusing models with joblib Hierarchically clustering data Taking a Theano tour 10. Evaluating Classifiers, Regressors, and Clusters Getting classification straight with the confusion matrix Computing precision, recall, and F1-score Examining a receiver operating characteristic and the area under a curve Visualizing the goodness of fit Computing MSE and median absolute error Evaluating clusters with the mean silhouette coefficient Comparing results with a dummy classifier Determining MAPE and MPE Comparing with a dummy regressor Calculating the mean absolute error and the residual sum of squares Examining the kappa of classification Taking a look at the Matthews correlation coefficient 11. Analyzing Images Setting up OpenCV Applying Scale-Invariant Feature Transform (SIFT) Detecting features with SURF Quantizing colors Denoising images Extracting patches from an image Detecting faces with Haar cascades Searching for bright stars Extracting metadata from images Extracting texture features from images Applying hierarchical clustering on images Segmenting images with spectral clustering 12. Parallelism and Performance Just-in-time compiling with Numba Speeding up numerical expressions with Numexpr Running multiple threads with the threading module Launching multiple tasks with the concurrent.futures module Accessing resources asynchronously with the asyncio module Distributed processing with execnet Profiling memory usage Calculating the mean, variance, skewness, and kurtosis on the fly Caching with a least recently used cache Caching HTTP requests Streaming counting with the Count-min sketch Harnessing the power of the GPU with OpenCL A. Glossary B. Function Reference Matplotlib NumPy pandas Scikit-learn SciPy Seaborn Statsmodels C. Online Resources Mathematics and statistics D. Tips and Tricks for Command-Line and Miscellaneous Tools Command-line tools The alias command Command-line history Reproducible sessions Docker tips 3. Module 3 1. Tools of the Trade Using the notebook interface Imports An example using the Pandas library Summary 2. Exploring Data Univariate data Relationships between variables – scatterplots Summary 3. Learning About Models The cumulative distribution function Working with distributions The probability density function Where do models come from? Multivariate distributions Summary 4. Regression Multivariate regression Logistic regression Summary 5. Clustering K-means clustering Hierarchical clustering analysis Summary 6. Bayesian Methods U.S. air travel safety record Climate change - CO2 in the atmosphere Summary 7. Supervised and Unsupervised Learning Scikit-learn Linear regression Clustering Seeds classification Summary 8. Time Series Analysis Pandas and time series data Indexing and slicing Resampling, smoothing, and other estimates Stationarity Patterns and components Time series models Summary E. More on Jupyter Notebook and matplotlib Styles Matplotlib styles Useful resources Summary A. Bibliography Index

← Prev
Back
Next →

← Prev
Back
Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion