Practical Data Analysis · 2nd Edition by Cuesta, Hector -- Read -- Imperial Library of Trantor

Index

Practical Data Analysis - Second Edition

Practical Data Analysis - Second Edition Credits About the Authors About the Reviewers www.PacktPub.com

eBooks, discount offers, and more

Why subscribe? Free access for Packt account holders

Preface

What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support

Downloading the example code Downloading the color images of this book Errata Piracy Questions

1. Getting Started

Computer science Artificial intelligence Machine learning Statistics Mathematics Knowledge domain Data, information, and knowledge

Inter-relationship between data, information, and knowledge The nature of data

The data analysis process

The problem Data preparation Data exploration Predictive modeling Visualization of results

Quantitative versus qualitative data analysis Importance of data visualization What about big data? Quantified self

Sensors and cameras Social network analysis

Tools and toys for this book

Why Python? Why mlpy? Why D3.js? Why MongoDB?

Summary

2. Preprocessing Data

Data sources

Open data Text files Excel files SQL databases NoSQL databases Multimedia Web scraping

Data scrubbing

Statistical methods Text parsing Data transformation

Data formats

Parsing a CSV file with the CSV module

Parsing CSV file using NumPy

JSON

Parsing JSON file using the JSON module

XML

Parsing XML in Python using the XML module

YAML

Data reduction methods

Filtering and sampling Binned algorithm Dimensionality reduction

Getting started with OpenRefine

Text facet Clustering Text filters Numeric facets Transforming data Exporting data Operation history

Summary

3. Getting to Grips with Visualization

What is visualization? Working with web-based visualization Exploring scientific visualization Visualization in art The visualization life cycle Visualizing different types of data

HTML DOM CSS JavaScript SVG

Getting started with D3.js

Bar chart Pie chart Scatter plots Single line chart Multiple line chart

Interaction and animation Data from social networks An overview of visual analytics Summary

4. Text Classification

Learning and classification Bayesian classification

NaÃ¯ve Bayes

E-mail subject line tester The data The algorithm Classifier accuracy Summary

5. Similarity-Based Image Retrieval

Image similarity search Dynamic time warping Processing the image dataset Implementing DTW Analyzing the results Summary

6. Simulation of Stock Prices

Financial time series Random Walk simulation Monte Carlo methods Generating random numbers Implementation in D3js Quantitative analyst Summary

7. Predicting Gold Prices

Working with time series data

Components of a time series

Smoothing time series Lineal regression The data - historical gold prices Nonlinear regressions

Kernel Ridge Regressions Smoothing the gold prices time series Predicting in the smoothed time series Contrasting the predicted value

Summary

8. Working with Support Vector Machines

Understanding the multivariate dataset Dimensionality reduction

Linear Discriminant Analysis (LDA) Principal Component Analysis (PCA)

Getting started with SVM

Kernel functions The double spiral problem SVM implemented on mlpy

Summary

9. Modeling Infectious Diseases with Cellular Automata

Introduction to epidemiology

The epidemiology triangle

The epidemic models

The SIR model Solving the ordinary differential equation for the SIR model with SciPy The SIRS model

Modeling with Cellular Automaton

Cell, state, grid, neighborhood Global stochastic contact model

Simulation of the SIRS model in CA with D3.js Summary

10. Working with Social Graphs

Structure of a graph

Undirected graph Directed graph

Social networks analysis Acquiring the Facebook graph Working with graphs using Gephi Statistical analysis

Male to female ratio

Degree distribution

Histogram of a graph Centrality

Transforming GDF to JSON Graph visualization with D3.js Summary

11. Working with Twitter Data

The anatomy of Twitter data

Tweet Followers Trending topics

Using OAuth to access Twitter API Getting started with Twython

Simple search using Twython Working with timelines Working with followers Working with places and trends Working with user data Streaming API

Summary

12. Data Processing and Aggregation with MongoDB

Getting started with MongoDB

Database Collection Document Mongo shell Insert/Update/Delete Queries

Data preparation

Data transformation with OpenRefine Inserting documents with PyMongo

Group Aggregation framework

Pipelines Expressions

Summary

13. Working with MapReduce

An overview of MapReduce Programming model Using MapReduce with MongoDB

Map function Reduce function Using mongo shell Using Jupyter Using PyMongo

Filtering the input collection Grouping and aggregation Counting the most common words in tweets Summary

14. Online Data Analysis with Jupyter and Wakari

Getting started with Wakari

Creating an account in Wakari

Getting started with IPython notebook

Data visualization

Introduction to image processing with PIL

Opening an image Working with an image histogram Filtering Operations Transformations

Getting started with pandas

Working with Time Series Working with multivariate datasets with DataFrame Grouping, Aggregation, and Correlation

Sharing your Notebook

The data

Summary

15. Understanding Data Processing using Apache Spark

Platform for data processing

The Cloudera platform Installing Cloudera VM

An introduction to the distributed file system

First steps with Hadoop Distributed File System - HDFS File management with HUE - web interface

An introduction to Apache Spark

The Spark ecosystem The Spark programming model An introductory working example of Apache Startup

Summary

← Prev
Back
Next →

← Prev
Back
Next →