Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Preliminaries
Series
Dedication
Preface
Goals of the Book
Using These Case Studies in Statistical Computing Courses
Broad Topics
Target Audience
The Themes of the Three Parts
Typographic Conventions
Available Materials
Acknowledgments
Authors
Co-Authors
Part I Data Manipulation and Modeling
Chapter 1 Predicting Location via Indoor Positioning Systems
1.1 Introduction
1.1.1 Computational Topics
1.2 The Raw Data
1.2.1 Processing the Raw Data
1.3 Cleaning the Data and Building a Representation for Analysis
1.3.1 Exploring Orientation
1.3.2 Exploring MAC Addresses
1.3.3 Exploring the Position of the Hand-Held Device
1.3.4 Creating a Function to Prepare the Data
1.4 Signal Strength Analysis
1.4.1 Distribution of Signal Strength
1.4.2 The Relationship between Signal and Distance
1.5 Nearest Neighbor Methods to Predict Location
1.5.1 Preparing the Test Data
1.5.2 Choice of Orientation
1.5.3 Finding the Nearest Neighbors
1.5.4 Cross-Validation and Choice of k
1.6 Exercises
Bibliography
Figure 1.1
Figure 1.2
Figure 1.3
Figure 1.4
Figure 1.5
Figure 1.6
Figure 1.7
Figure 1.8
Figure 1.9
Figure 1.10
Figure 1.11
Figure 1.12
Figure 1.13
Chapter 2 Modeling Runners' Times in the Cherry Blossom Race
2.1 Introduction
2.1.1 Computational Topics
2.2 Reading Tables of Race Results into R
2.3 Data Cleaning and Reformatting Variables
2.4 Exploring the Run Time for All Male Runners
2.4.1 Making Plots with Many Observations
2.4.2 Fitting Models to Average Performance
2.4.3 Cross-Sectional Data and Covariates
2.5 Constructing a Record for an Individual Runner across Years
2.6 Modeling the Change in Running Time for Individuals
2.7 Scraping Race Results from the Web
2.8 Exercises
Bibliography
Figure 2.1
Figure 2.2
Figure 2.3
Figure 2.4
Figure 2.5
Figure 2.6
Figure 2.7
Figure 2.8
Figure 2.9
Figure 2.10
Figure 2.11
Figure 2.12
Figure 2.13
Figure 2.14
Figure 2.15
Figure 2.16
Figure 2.17
Figure 2.18
Figure 2.19
Figure 2.20
Figure 2.21
Table 1.1
Chapter 3 Using Statistics to Identify Spam
3.1 Introduction
3.1.1 Computational Topics
3.2 Anatomy of an email Message
3.3 Reading the email Messages
3.4 Text Mining and Naïve Bayes Classification
3.5 Finding the Words in a Message
3.5.1 Splitting the Message into Its Header and Body
3.5.2 Removing Attachments from the Message Body
3.5.3 Extracting Words from the Message Body
3.5.4 Completing the Data Preparation Process
3.6 Implementing the Naïve Bayes Classifier
3.6.1 Test and Training Data
3.6.2 Probability Estimates from Training Data
3.6.3 Classifying New Messages
3.6.4 Computational Considerations
3.7 Recursive Partitioning and Classification Trees
3.8 Organizing an email Message into an R Data Structure
3.8.1 Processing the Header
3.8.2 Processing Attachments
3.8.3 Testing Our Code on More email Data
3.8.4 Completing the Process
3.9 Deriving Variables from the email Message
3.9.1 Checking Our Code for Errors
3.10 Exploring the email Feature Set
3.11 Fitting the rpart() Model to the email Data
3.12 Exercises
Bibliography
Figure 3.1
Figure 3.2
Figure 3.3
Figure 3.4
Figure 3.5
Figure 3.6
Figure 3.7
Figure 3.8
Figure 3.9
Table 3.1
Chapter 4 Processing Robot and Sensor Log Files: Seeking a Circular Target
4.1 Description
4.1.1 Computational Topics
4.2 The Data
4.2.1 Reading an Entire Log File
4.2.2 Exploring Log Files
4.2.3 Visualizing the Path
4.2.4 Exploring a "Look"
4.2.5 The Error Distribution for Range Values
4.3 Detecting a Circular Target
4.3.1 Connecting Segments Behind the Robot
4.3.2 Determining If a Segment Corresponds to a Circle
4.4 Detecting the Target with Streaming Data in Real Time
Bibliography
Figure 4.1
Figure 4.2
Figure 4.3
Figure 4.4
Figure 4.5
Figure 4.6
Figure 4.7
Figure 4.8
Figure 4.9
Figure 4.10
Figure 4.11
Figure 4.12
Figure 4.13
Figure 4.14
Figure 4.15
Figure 4.16
Chapter 5 Strategies for Analyzing a 12-Gigabyte Data Set: Airline Flight Delays
5.1 Introduction
5.1.1 Computational Topics
5.2 Acquiring the Airline Data Set
5.3 Computing with Massive Data: Getting Flight Delay Counts
5.3.1 The R Programming Environment
5.3.2 The UNIX Shell
5.3.3 An SQL Database with R
5.3.4 The bigmemory Package with R
5.4 Explorations Using Parallel Computing: The Distribution of Flight Delays
5.4.1 Writing a Parallelizable Loop with foreach
5.4.2 Using the Split-Apply-Combine Approach for Better Performance
5.4.3 Using Split-Apply-Combine to Find the Best Time to Fly
5.5 From Exploration to Model: Do Older Planes Suffer Greater Delays?
Bibliography
Figure 5.1
Part II Simulation Studies
Chapter 6 Pairs Trading
6.1 The Problem
6.1.1 Computational Topics
6.2 The Data Format
6.3 Reading the Financial Data
6.4 Visualizing the Time Series
6.5 Finding Opening and Closing Positions
6.5.1 Identifying a Position
6.5.2 Displaying Positions
6.5.3 Finding All Positions
6.5.4 Computing the Profit for a Position
6.5.5 Finding the Optimal Value for k
6.6 Simulation Study
6.6.1 Simulating the Stock Price Series
6.6.2 Making stockSim() Faster
Bibliography
Figure 6.1
Figure 6.2
Figure 6.3
Figure 6.4
Figure 6.5
Figure 6.6
Figure 6.7
Figure 6.8
Figure 6.9
Chapter 7 Simulation Study of a Branching Process
7.1 Introduction
7.1.1 The Monte Carlo Method
7.1.2 Computational Topics
7.2 Exploring the Random Process
7.3 Generating Offspring
7.3.1 Checking the Results
7.3.2 Considering Alternative Implementations
7.4 Profiling and Improving Our Code
7.5 From One Job's Offspring to an Entire Generation
7.6 Unit Testing
7.7 A Structure for the Function's Return Value
7.8 The Family Tree: Simulating the Branching Process
7.9 Replicating the Simulation
7.9.1 Analyzing the Simulation Results
7.10 Exercises
Bibliography
Figure 7.1
Figure 7.2
Figure 7.3
Figure 7.4
Figure 7.5
Figure 7.6
Figure 7.7
Figure 7.8
Figure 7.9
Chapter 8 A Self-Organizing Dynamic System with a Phase Transition
8.1 Introduction and Motivation
8.1.1 Computational Topics
8.2 The Model
8.2.1 The Order Cars Move
8.3 Implementing the BML Model
8.3.1 Creating the Initial Grid Configuration
8.3.2 Testing the Grid Creation Function
8.3.3 Displaying the Grid
8.3.4 Visualizing the Grid
8.3.5 Simple and Convenient Object-Oriented Programming
8.3.6 Moving the Cars
8.4 Evaluating the Performance of the Code
8.5 Implementing the BML Model in C
8.5.1 The Algorithm in C
8.5.2 Compiling, Loading, and Calling the C Code
8.6 Running the Simulations
8.6.1 Exploring Car Velocity
8.7 Experimental Compilation
Bibliography
Figure 8.1
Figure 8.2
Figure 8.3
Figure 8.4
Figure 8.5
Figure 8.6
Figure 8.7
Figure 8.8
Figure 8.9
Figure 8.10
Figure 8.11
Figure 8.12
Figure 8.13
Chapter 9 Simulating Blackjack
9.1 Introduction
9.1.1 Computational Topics
9.2 Blackjack Basics
9.2.1 Testing Functions
9.3 Playing a Hand of Blackjack
9.3.1 Creating Functions for the Player's Actions
9.4 Strategies for Playing
9.4.1 Developing the Optimal Strategy
9.5 Playing Many Games
9.6 A More Accurate Card Dealer Shoe
9.7 Counting Cards
9.8 Putting It All Together
9.9 Exercises
Bibliography
Figure 9.1
Figure 9.2
Figure 9.3
Figure 9.4
Table 9.1
Table 9.2
Part III Data and Web Technologies
Chapter 10 Baseball: Exploring Data in a Relational Database
10.1 Introduction
10.1.1 Computational Topics
10.2 Sean Lahman's Database
10.2.1 Connecting to the Baseball Database from within R
10.3 Aggregating Salaries into Payroll
10.4 Merging Payroll Data with Information in Other Tables
10.4.1 Adding Team Names to the Payroll Data
10.4.2 Adding World Series Records to the Payroll Data
10.5 Exploring the Extreme Salaries
10.6 Exercises
Bibliography
Figure 10.1
Figure 10.2
Table 10.1
Chapter 11 CIA Factbook Mashup
11.1 Introduction
11.1.1 Computational Topics
11.2 Acquiring the Data
11.2.1 Extracting Latitude and Longitude from a CSV File
11.3 Integrating Data from Different Sources
11.4 Preparing the Data for Plotting
11.4.1 Redoing the Merge of the Factbook and Location Data
11.5 Plotting with Google Earth™
11.6 Extracting Demographic Information from the CIA XML File
11.7 Generating KML Directly
11.8 Additional Computational Tasks
11.8.1 Creating Plotting Symbols
11.8.2 Efficiency in Generating KML from Strings
11.8.3 Extracting Latitude and Longitude from an HTML File
11.9 Exercises
Bibliography
Figure 11.1
Figure 11.2
Figure 11.3
Figure 11.4
Figure 11.5
Figure 11.6
Figure 11.7
Figure 11.8
Figure 11.9
Figure 11.10
Chapter 12 Exploring Data Science Jobs with Web Scraping and Text Mining
12.1 Introduction and Motivation
12.1.1 Computational Topics
12.2 Exploring Different Web Sites
12.3 Preliminary/Exploratory Scraping: The Kaggle Job List
12.3.1 Processing the Text
12.3.2 Generalizing to Other Posts
12.3.3 Scraping the Kaggle Post List
12.4 Scraping CyberCoders.com
12.4.1 Getting the Skill List from a Job Post
12.4.2 Finding the Links to Job Postings in the Search Results
12.4.3 Finding the Next Page of Job Post Search Results
12.4.4 Putting It All Together
12.5 A Reusable Generic Framework for Arbitrary Sites
12.6 Scraping Career Builder
12.7 Scraping Monster.com
12.8 Analyzing the Results: The Important Skills
12.9 Note on Web Scraping
12.10 Exercises
Bibliography
Figure 12.1
Figure 12.2
Figure 12.3
Figure 12.4
Figure 12.5
Figure 12.6
Figure 12.7
Figure 12.8
Figure 12.9
Figure 12.10
Figure 12.11
Colophon
← Prev
Back
Next →
← Prev
Back
Next →