Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Preface
About the Authors
Who This Book Is For
Conventions Used in This Book
Using Code Examples
O’Reilly Safari
How to Contact Us
Acknowledgments
I. Gentle Overview of Big Data and Spark
1. What Is Apache Spark?
Apache Spark’s Philosophy
Context: The Big Data Problem
History of Spark
The Present and Future of Spark
Running Spark
Downloading Spark Locally
Launching Spark’s Interactive Consoles
Running Spark in the Cloud
Data Used in This Book
2. A Gentle Introduction to Spark
Spark’s Basic Architecture
Spark Applications
Spark’s Language APIs
Spark’s APIs
Starting Spark
The SparkSession
DataFrames
Partitions
Transformations
Lazy Evaluation
Actions
Spark UI
An End-to-End Example
DataFrames and SQL
Conclusion
3. A Tour of Spark’s Toolset
Running Production Applications
Datasets: Type-Safe Structured APIs
Structured Streaming
Machine Learning and Advanced Analytics
Lower-Level APIs
SparkR
Spark’s Ecosystem and Packages
Conclusion
II. Structured APIs—DataFrames, SQL, and Datasets
4. Structured API Overview
DataFrames and Datasets
Schemas
Overview of Structured Spark Types
DataFrames Versus Datasets
Columns
Rows
Spark Types
Overview of Structured API Execution
Logical Planning
Physical Planning
Execution
Conclusion
5. Basic Structured Operations
Schemas
Columns and Expressions
Columns
Expressions
Records and Rows
Creating Rows
DataFrame Transformations
Creating DataFrames
select and selectExpr
Converting to Spark Types (Literals)
Adding Columns
Renaming Columns
Reserved Characters and Keywords
Case Sensitivity
Removing Columns
Changing a Column’s Type (cast)
Filtering Rows
Getting Unique Rows
Random Samples
Random Splits
Concatenating and Appending Rows (Union)
Sorting Rows
Limit
Repartition and Coalesce
Collecting Rows to the Driver
Conclusion
6. Working with Different Types of Data
Where to Look for APIs
Converting to Spark Types
Working with Booleans
Working with Numbers
Working with Strings
Regular Expressions
Working with Dates and Timestamps
Working with Nulls in Data
Coalesce
ifnull, nullIf, nvl, and nvl2
drop
fill
replace
Ordering
Working with Complex Types
Structs
Arrays
split
Array Length
array_contains
explode
Maps
Working with JSON
User-Defined Functions
Conclusion
7. Aggregations
Aggregation Functions
count
countDistinct
approx_count_distinct
first and last
min and max
sum
sumDistinct
avg
Variance and Standard Deviation
skewness and kurtosis
Covariance and Correlation
Aggregating to Complex Types
Grouping
Grouping with Expressions
Grouping with Maps
Window Functions
Grouping Sets
Rollups
Cube
Grouping Metadata
Pivot
User-Defined Aggregation Functions
Conclusion
8. Joins
Join Expressions
Join Types
Inner Joins
Outer Joins
Left Outer Joins
Right Outer Joins
Left Semi Joins
Left Anti Joins
Natural Joins
Cross (Cartesian) Joins
Challenges When Using Joins
Joins on Complex Types
Handling Duplicate Column Names
How Spark Performs Joins
Communication Strategies
Conclusion
9. Data Sources
The Structure of the Data Sources API
Read API Structure
Basics of Reading Data
Write API Structure
Basics of Writing Data
CSV Files
CSV Options
Reading CSV Files
Writing CSV Files
JSON Files
JSON Options
Reading JSON Files
Writing JSON Files
Parquet Files
Reading Parquet Files
Writing Parquet Files
ORC Files
Reading Orc Files
Writing Orc Files
SQL Databases
Reading from SQL Databases
Query Pushdown
Writing to SQL Databases
Text Files
Reading Text Files
Writing Text Files
Advanced I/O Concepts
Splittable File Types and Compression
Reading Data in Parallel
Writing Data in Parallel
Writing Complex Types
Managing File Size
Conclusion
10. Spark SQL
What Is SQL?
Big Data and SQL: Apache Hive
Big Data and SQL: Spark SQL
Spark’s Relationship to Hive
How to Run Spark SQL Queries
Spark SQL CLI
Spark’s Programmatic SQL Interface
SparkSQL Thrift JDBC/ODBC Server
Catalog
Tables
Spark-Managed Tables
Creating Tables
Creating External Tables
Inserting into Tables
Describing Table Metadata
Refreshing Table Metadata
Dropping Tables
Caching Tables
Views
Creating Views
Dropping Views
Databases
Creating Databases
Setting the Database
Dropping Databases
Select Statements
case…when…then Statements
Advanced Topics
Complex Types
Functions
Subqueries
Miscellaneous Features
Configurations
Setting Configuration Values in SQL
Conclusion
11. Datasets
When to Use Datasets
Creating Datasets
In Java: Encoders
In Scala: Case Classes
Actions
Transformations
Filtering
Mapping
Joins
Grouping and Aggregations
Conclusion
III. Low-Level APIs
12. Resilient Distributed Datasets (RDDs)
What Are the Low-Level APIs?
When to Use the Low-Level APIs?
How to Use the Low-Level APIs?
About RDDs
Types of RDDs
When to Use RDDs?
Datasets and RDDs of Case Classes
Creating RDDs
Interoperating Between DataFrames, Datasets, and RDDs
From a Local Collection
From Data Sources
Manipulating RDDs
Transformations
distinct
filter
map
sort
Random Splits
Actions
reduce
count
first
max and min
take
Saving Files
saveAsTextFile
SequenceFiles
Hadoop Files
Caching
Checkpointing
Pipe RDDs to System Commands
mapPartitions
foreachPartition
glom
Conclusion
13. Advanced RDDs
Key-Value Basics (Key-Value RDDs)
keyBy
Mapping over Values
Extracting Keys and Values
lookup
sampleByKey
Aggregations
countByKey
Understanding Aggregation Implementations
Other Aggregation Methods
CoGroups
Joins
Inner Join
zips
Controlling Partitions
coalesce
repartition
repartitionAndSortWithinPartitions
Custom Partitioning
Custom Serialization
Conclusion
14. Distributed Shared Variables
Broadcast Variables
Accumulators
Basic Example
Custom Accumulators
Conclusion
IV. Production Applications
15. How Spark Runs on a Cluster
The Architecture of a Spark Application
Execution Modes
The Life Cycle of a Spark Application (Outside Spark)
Client Request
Launch
Execution
Completion
The Life Cycle of a Spark Application (Inside Spark)
The SparkSession
Logical Instructions
A Spark Job
Stages
Tasks
Execution Details
Pipelining
Shuffle Persistence
Conclusion
16. Developing Spark Applications
Writing Spark Applications
A Simple Scala-Based App
Writing Python Applications
Writing Java Applications
Testing Spark Applications
Strategic Principles
Tactical Takeaways
Connecting to Unit Testing Frameworks
Connecting to Data Sources
The Development Process
Launching Applications
Application Launch Examples
Configuring Applications
The SparkConf
Application Properties
Runtime Properties
Execution Properties
Configuring Memory Management
Configuring Shuffle Behavior
Environmental Variables
Job Scheduling Within an Application
Conclusion
17. Deploying Spark
Where to Deploy Your Cluster to Run Spark Applications
On-Premises Cluster Deployments
Spark in the Cloud
Cluster Managers
Standalone Mode
Spark on YARN
Configuring Spark on YARN Applications
Spark on Mesos
Secure Deployment Configurations
Cluster Networking Configurations
Application Scheduling
Miscellaneous Considerations
Conclusion
18. Monitoring and Debugging
The Monitoring Landscape
What to Monitor
Driver and Executor Processes
Queries, Jobs, Stages, and Tasks
Spark Logs
The Spark UI
Spark REST API
Spark UI History Server
Debugging and Spark First Aid
Spark Jobs Not Starting
Errors Before Execution
Errors During Execution
Slow Tasks or Stragglers
Slow Aggregations
Slow Joins
Slow Reads and Writes
Driver OutOfMemoryError or Driver Unresponsive
Executor OutOfMemoryError or Executor Unresponsive
Unexpected Nulls in Results
No Space Left on Disk Errors
Serialization Errors
Conclusion
19. Performance Tuning
Indirect Performance Enhancements
Design Choices
Object Serialization in RDDs
Cluster Configurations
Scheduling
Data at Rest
Shuffle Configurations
Memory Pressure and Garbage Collection
Direct Performance Enhancements
Parallelism
Improved Filtering
Repartitioning and Coalescing
User-Defined Functions (UDFs)
Temporary Data Storage (Caching)
Joins
Aggregations
Broadcast Variables
Conclusion
V. Streaming
20. Stream Processing Fundamentals
What Is Stream Processing?
Stream Processing Use Cases
Advantages of Stream Processing
Challenges of Stream Processing
Stream Processing Design Points
Record-at-a-Time Versus Declarative APIs
Event Time Versus Processing Time
Continuous Versus Micro-Batch Execution
Spark’s Streaming APIs
The DStream API
Structured Streaming
Conclusion
21. Structured Streaming Basics
Structured Streaming Basics
Core Concepts
Transformations and Actions
Input Sources
Sinks
Output Modes
Triggers
Event-Time Processing
Structured Streaming in Action
Transformations on Streams
Selections and Filtering
Aggregations
Joins
Input and Output
Where Data Is Read and Written (Sources and Sinks)
Reading from the Kafka Source
Writing to the Kafka Sink
How Data Is Output (Output Modes)
When Data Is Output (Triggers)
Streaming Dataset API
Conclusion
22. Event-Time and Stateful Processing
Event Time
Stateful Processing
Arbitrary Stateful Processing
Event-Time Basics
Windows on Event Time
Tumbling Windows
Handling Late Data with Watermarks
Dropping Duplicates in a Stream
Arbitrary Stateful Processing
Time-Outs
Output Modes
mapGroupsWithState
flatMapGroupsWithState
Conclusion
23. Structured Streaming in Production
Fault Tolerance and Checkpointing
Updating Your Application
Updating Your Streaming Application Code
Updating Your Spark Version
Sizing and Rescaling Your Application
Metrics and Monitoring
Query Status
Recent Progress
Spark UI
Alerting
Advanced Monitoring with the Streaming Listener
Conclusion
VI. Advanced Analytics and Machine Learning
24. Advanced Analytics and Machine Learning Overview
A Short Primer on Advanced Analytics
Supervised Learning
Recommendation
Unsupervised Learning
Graph Analytics
The Advanced Analytics Process
Spark’s Advanced Analytics Toolkit
What Is MLlib?
High-Level MLlib Concepts
MLlib in Action
Feature Engineering with Transformers
Estimators
Pipelining Our Workflow
Training and Evaluation
Persisting and Applying Models
Deployment Patterns
Conclusion
25. Preprocessing and Feature Engineering
Formatting Models According to Your Use Case
Transformers
Estimators for Preprocessing
Transformer Properties
High-Level Transformers
RFormula
SQL Transformers
VectorAssembler
Working with Continuous Features
Bucketing
Scaling and Normalization
StandardScaler
Working with Categorical Features
StringIndexer
Converting Indexed Values Back to Text
Indexing in Vectors
One-Hot Encoding
Text Data Transformers
Tokenizing Text
Removing Common Words
Creating Word Combinations
Converting Words into Numerical Representations
Word2Vec
Feature Manipulation
PCA
Interaction
Polynomial Expansion
Feature Selection
ChiSqSelector
Advanced Topics
Persisting Transformers
Writing a Custom Transformer
Conclusion
26. Classification
Use Cases
Types of Classification
Binary Classification
Multiclass Classification
Multilabel Classification
Classification Models in MLlib
Model Scalability
Logistic Regression
Model Hyperparameters
Training Parameters
Prediction Parameters
Example
Model Summary
Decision Trees
Model Hyperparameters
Training Parameters
Prediction Parameters
Random Forest and Gradient-Boosted Trees
Model Hyperparameters
Training Parameters
Prediction Parameters
Naive Bayes
Model Hyperparameters
Training Parameters
Prediction Parameters
Evaluators for Classification and Automating Model Tuning
Detailed Evaluation Metrics
One-vs-Rest Classifier
Multilayer Perceptron
Conclusion
27. Regression
Use Cases
Regression Models in MLlib
Model Scalability
Linear Regression
Model Hyperparameters
Training Parameters
Example
Training Summary
Generalized Linear Regression
Model Hyperparameters
Training Parameters
Prediction Parameters
Example
Training Summary
Decision Trees
Model Hyperparameters
Training Parameters
Example
Random Forests and Gradient-Boosted Trees
Model Hyperparameters
Training Parameters
Example
Advanced Methods
Survival Regression (Accelerated Failure Time)
Isotonic Regression
Evaluators and Automating Model Tuning
Metrics
Conclusion
28. Recommendation
Use Cases
Collaborative Filtering with Alternating Least Squares
Model Hyperparameters
Training Parameters
Prediction Parameters
Example
Evaluators for Recommendation
Metrics
Regression Metrics
Ranking Metrics
Frequent Pattern Mining
Conclusion
29. Unsupervised Learning
Use Cases
Model Scalability
k-means
Model Hyperparameters
Training Parameters
Example
k-means Metrics Summary
Bisecting k-means
Model Hyperparameters
Training Parameters
Example
Bisecting k-means Summary
Gaussian Mixture Models
Model Hyperparameters
Training Parameters
Example
Gaussian Mixture Model Summary
Latent Dirichlet Allocation
Model Hyperparameters
Training Parameters
Prediction Parameters
Example
Conclusion
30. Graph Analytics
Building a Graph
Querying the Graph
Subgraphs
Motif Finding
Graph Algorithms
PageRank
In-Degree and Out-Degree Metrics
Breadth-First Search
Connected Components
Strongly Connected Components
Advanced Tasks
Conclusion
31. Deep Learning
What Is Deep Learning?
Ways of Using Deep Learning in Spark
Deep Learning Libraries
MLlib Neural Network Support
TensorFrames
BigDL
TensorFlowOnSpark
DeepLearning4J
Deep Learning Pipelines
A Simple Example with Deep Learning Pipelines
Setup
Images and DataFrames
Transfer Learning
Applying Popular Models
Conclusion
VII. Ecosystem
32. Language Specifics: Python (PySpark) and R (SparkR and sparklyr)
PySpark
Fundamental PySpark Differences
Pandas Integration
R on Spark
SparkR
sparklyr
Conclusion
33. Ecosystem and Community
Spark Packages
An Abridged List of Popular Packages
Using Spark Packages
External Packages
Community
Spark Summit
Local Meetups
Conclusion
Index
← Prev
Back
Next →
← Prev
Back
Next →