Data Science in R by Nolan, Deborah -- Read -- Imperial Library of Trantor

Index

Preliminaries Series Dedication Preface

Goals of the Book Using These Case Studies in Statistical Computing Courses Broad Topics Target Audience The Themes of the Three Parts Typographic Conventions Available Materials

Acknowledgments Authors Co-Authors Part I Data Manipulation and Modeling

Chapter 1 Predicting Location via Indoor Positioning Systems

1.1 Introduction

1.1.1 Computational Topics

1.2 The Raw Data

1.2.1 Processing the Raw Data

1.3 Cleaning the Data and Building a Representation for Analysis

1.3.1 Exploring Orientation 1.3.2 Exploring MAC Addresses 1.3.3 Exploring the Position of the Hand-Held Device 1.3.4 Creating a Function to Prepare the Data

1.4 Signal Strength Analysis

1.4.1 Distribution of Signal Strength 1.4.2 The Relationship between Signal and Distance

1.5 Nearest Neighbor Methods to Predict Location

1.5.1 Preparing the Test Data 1.5.2 Choice of Orientation 1.5.3 Finding the Nearest Neighbors 1.5.4 Cross-Validation and Choice of k

1.6 Exercises Bibliography Figure 1.1 Figure 1.2 Figure 1.3 Figure 1.4 Figure 1.5 Figure 1.6 Figure 1.7 Figure 1.8 Figure 1.9 Figure 1.10 Figure 1.11 Figure 1.12 Figure 1.13

Chapter 2 Modeling Runners' Times in the Cherry Blossom Race

2.1 Introduction

2.1.1 Computational Topics

2.2 Reading Tables of Race Results into R 2.3 Data Cleaning and Reformatting Variables 2.4 Exploring the Run Time for All Male Runners

2.4.1 Making Plots with Many Observations 2.4.2 Fitting Models to Average Performance 2.4.3 Cross-Sectional Data and Covariates

2.5 Constructing a Record for an Individual Runner across Years 2.6 Modeling the Change in Running Time for Individuals 2.7 Scraping Race Results from the Web 2.8 Exercises Bibliography Figure 2.1 Figure 2.2 Figure 2.3 Figure 2.4 Figure 2.5 Figure 2.6 Figure 2.7 Figure 2.8 Figure 2.9 Figure 2.10 Figure 2.11 Figure 2.12 Figure 2.13 Figure 2.14 Figure 2.15 Figure 2.16 Figure 2.17 Figure 2.18 Figure 2.19 Figure 2.20 Figure 2.21 Table 1.1

Chapter 3 Using Statistics to Identify Spam

3.1 Introduction

3.1.1 Computational Topics

3.2 Anatomy of an email Message 3.3 Reading the email Messages 3.4 Text Mining and Naïve Bayes Classification 3.5 Finding the Words in a Message

3.5.1 Splitting the Message into Its Header and Body 3.5.2 Removing Attachments from the Message Body 3.5.3 Extracting Words from the Message Body 3.5.4 Completing the Data Preparation Process

3.6 Implementing the Naïve Bayes Classifier

3.6.1 Test and Training Data 3.6.2 Probability Estimates from Training Data 3.6.3 Classifying New Messages 3.6.4 Computational Considerations

3.7 Recursive Partitioning and Classification Trees 3.8 Organizing an email Message into an R Data Structure

3.8.1 Processing the Header 3.8.2 Processing Attachments 3.8.3 Testing Our Code on More email Data 3.8.4 Completing the Process

3.9 Deriving Variables from the email Message

3.9.1 Checking Our Code for Errors

3.10 Exploring the email Feature Set 3.11 Fitting the rpart() Model to the email Data 3.12 Exercises Bibliography Figure 3.1 Figure 3.2 Figure 3.3 Figure 3.4 Figure 3.5 Figure 3.6 Figure 3.7 Figure 3.8 Figure 3.9 Table 3.1

Chapter 4 Processing Robot and Sensor Log Files: Seeking a Circular Target

4.1 Description

4.1.1 Computational Topics

4.2 The Data

4.2.1 Reading an Entire Log File 4.2.2 Exploring Log Files 4.2.3 Visualizing the Path 4.2.4 Exploring a "Look" 4.2.5 The Error Distribution for Range Values

4.3 Detecting a Circular Target

4.3.1 Connecting Segments Behind the Robot 4.3.2 Determining If a Segment Corresponds to a Circle

4.4 Detecting the Target with Streaming Data in Real Time Bibliography Figure 4.1 Figure 4.2 Figure 4.3 Figure 4.4 Figure 4.5 Figure 4.6 Figure 4.7 Figure 4.8 Figure 4.9 Figure 4.10 Figure 4.11 Figure 4.12 Figure 4.13 Figure 4.14 Figure 4.15 Figure 4.16

Chapter 5 Strategies for Analyzing a 12-Gigabyte Data Set: Airline Flight Delays

5.1 Introduction

5.1.1 Computational Topics

5.2 Acquiring the Airline Data Set 5.3 Computing with Massive Data: Getting Flight Delay Counts

5.3.1 The R Programming Environment 5.3.2 The UNIX Shell 5.3.3 An SQL Database with R 5.3.4 The bigmemory Package with R

5.4 Explorations Using Parallel Computing: The Distribution of Flight Delays

5.4.1 Writing a Parallelizable Loop with foreach 5.4.2 Using the Split-Apply-Combine Approach for Better Performance 5.4.3 Using Split-Apply-Combine to Find the Best Time to Fly

5.5 From Exploration to Model: Do Older Planes Suffer Greater Delays? Bibliography Figure 5.1

Part II Simulation Studies

Chapter 6 Pairs Trading

6.1 The Problem

6.1.1 Computational Topics

6.2 The Data Format 6.3 Reading the Financial Data 6.4 Visualizing the Time Series 6.5 Finding Opening and Closing Positions

6.5.1 Identifying a Position 6.5.2 Displaying Positions 6.5.3 Finding All Positions 6.5.4 Computing the Profit for a Position 6.5.5 Finding the Optimal Value for k

6.6 Simulation Study

6.6.1 Simulating the Stock Price Series 6.6.2 Making stockSim() Faster Bibliography Figure 6.1 Figure 6.2 Figure 6.3 Figure 6.4 Figure 6.5 Figure 6.6 Figure 6.7 Figure 6.8 Figure 6.9

Chapter 7 Simulation Study of a Branching Process

7.1 Introduction

7.1.1 The Monte Carlo Method 7.1.2 Computational Topics

7.2 Exploring the Random Process 7.3 Generating Offspring

7.3.1 Checking the Results 7.3.2 Considering Alternative Implementations

7.4 Profiling and Improving Our Code 7.5 From One Job's Offspring to an Entire Generation 7.6 Unit Testing 7.7 A Structure for the Function's Return Value 7.8 The Family Tree: Simulating the Branching Process 7.9 Replicating the Simulation

7.9.1 Analyzing the Simulation Results

7.10 Exercises Bibliography Figure 7.1 Figure 7.2 Figure 7.3 Figure 7.4 Figure 7.5 Figure 7.6 Figure 7.7 Figure 7.8 Figure 7.9

Chapter 8 A Self-Organizing Dynamic System with a Phase Transition

8.1 Introduction and Motivation

8.1.1 Computational Topics

8.2 The Model

8.2.1 The Order Cars Move

8.3 Implementing the BML Model

8.3.1 Creating the Initial Grid Configuration 8.3.2 Testing the Grid Creation Function 8.3.3 Displaying the Grid 8.3.4 Visualizing the Grid 8.3.5 Simple and Convenient Object-Oriented Programming 8.3.6 Moving the Cars

8.4 Evaluating the Performance of the Code 8.5 Implementing the BML Model in C

8.5.1 The Algorithm in C 8.5.2 Compiling, Loading, and Calling the C Code

8.6 Running the Simulations

8.6.1 Exploring Car Velocity

8.7 Experimental Compilation Bibliography Figure 8.1 Figure 8.2 Figure 8.3 Figure 8.4 Figure 8.5 Figure 8.6 Figure 8.7 Figure 8.8 Figure 8.9 Figure 8.10 Figure 8.11 Figure 8.12 Figure 8.13

Chapter 9 Simulating Blackjack

9.1 Introduction

9.1.1 Computational Topics

9.2 Blackjack Basics

9.2.1 Testing Functions

9.3 Playing a Hand of Blackjack

9.3.1 Creating Functions for the Player's Actions

9.4 Strategies for Playing

9.4.1 Developing the Optimal Strategy

9.5 Playing Many Games 9.6 A More Accurate Card Dealer Shoe 9.7 Counting Cards 9.8 Putting It All Together 9.9 Exercises Bibliography Figure 9.1 Figure 9.2 Figure 9.3 Figure 9.4 Table 9.1 Table 9.2

Part III Data and Web Technologies

Chapter 10 Baseball: Exploring Data in a Relational Database

10.1 Introduction

10.1.1 Computational Topics

10.2 Sean Lahman's Database

10.2.1 Connecting to the Baseball Database from within R

10.3 Aggregating Salaries into Payroll 10.4 Merging Payroll Data with Information in Other Tables

10.4.1 Adding Team Names to the Payroll Data 10.4.2 Adding World Series Records to the Payroll Data

10.5 Exploring the Extreme Salaries 10.6 Exercises Bibliography Figure 10.1 Figure 10.2 Table 10.1

Chapter 11 CIA Factbook Mashup

11.1 Introduction

11.1.1 Computational Topics

11.2 Acquiring the Data

11.2.1 Extracting Latitude and Longitude from a CSV File

11.3 Integrating Data from Different Sources 11.4 Preparing the Data for Plotting

11.4.1 Redoing the Merge of the Factbook and Location Data

11.5 Plotting with Google Earth™ 11.6 Extracting Demographic Information from the CIA XML File 11.7 Generating KML Directly 11.8 Additional Computational Tasks

11.8.1 Creating Plotting Symbols 11.8.2 Efficiency in Generating KML from Strings 11.8.3 Extracting Latitude and Longitude from an HTML File

11.9 Exercises Bibliography Figure 11.1 Figure 11.2 Figure 11.3 Figure 11.4 Figure 11.5 Figure 11.6 Figure 11.7 Figure 11.8 Figure 11.9 Figure 11.10

Chapter 12 Exploring Data Science Jobs with Web Scraping and Text Mining

12.1 Introduction and Motivation

12.1.1 Computational Topics

12.2 Exploring Different Web Sites 12.3 Preliminary/Exploratory Scraping: The Kaggle Job List

12.3.1 Processing the Text 12.3.2 Generalizing to Other Posts 12.3.3 Scraping the Kaggle Post List

12.4 Scraping CyberCoders.com

12.4.1 Getting the Skill List from a Job Post 12.4.2 Finding the Links to Job Postings in the Search Results 12.4.3 Finding the Next Page of Job Post Search Results 12.4.4 Putting It All Together

12.5 A Reusable Generic Framework for Arbitrary Sites 12.6 Scraping Career Builder 12.7 Scraping Monster.com 12.8 Analyzing the Results: The Important Skills 12.9 Note on Web Scraping 12.10 Exercises Bibliography Figure 12.1 Figure 12.2 Figure 12.3 Figure 12.4 Figure 12.5 Figure 12.6 Figure 12.7 Figure 12.8 Figure 12.9 Figure 12.10 Figure 12.11

Colophon

← Prev
Back
Next →

← Prev
Back
Next →