Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Preface
About the Authors Who This Book Is For Conventions Used in This Book Using Code Examples O’Reilly Safari How to Contact Us Acknowledgments
I. Gentle Overview of Big Data and Spark 1. What Is Apache Spark?
Apache Spark’s Philosophy Context: The Big Data Problem History of Spark The Present and Future of Spark Running Spark
Downloading Spark Locally
Downloading Spark for a Hadoop cluster Building Spark from source
Launching Spark’s Interactive Consoles
Launching the Python console Launching the Scala console Launching the SQL console
Running Spark in the Cloud Data Used in This Book
2. A Gentle Introduction to Spark
Spark’s Basic Architecture
Spark Applications
Spark’s Language APIs Spark’s APIs Starting Spark The SparkSession DataFrames
Partitions
Transformations
Lazy Evaluation
Actions Spark UI An End-to-End Example
DataFrames and SQL
Conclusion
3. A Tour of Spark’s Toolset
Running Production Applications Datasets: Type-Safe Structured APIs Structured Streaming Machine Learning and Advanced Analytics Lower-Level APIs SparkR Spark’s Ecosystem and Packages Conclusion
II. Structured APIs—DataFrames, SQL, and Datasets 4. Structured API Overview
DataFrames and Datasets Schemas Overview of Structured Spark Types
DataFrames Versus Datasets Columns Rows Spark Types
Overview of Structured API Execution
Logical Planning Physical Planning Execution
Conclusion
5. Basic Structured Operations
Schemas Columns and Expressions
Columns
Explicit column references
Expressions
Columns as expressions Accessing a DataFrame’s columns
Records and Rows
Creating Rows
DataFrame Transformations
Creating DataFrames select and selectExpr Converting to Spark Types (Literals) Adding Columns Renaming Columns Reserved Characters and Keywords Case Sensitivity Removing Columns Changing a Column’s Type (cast) Filtering Rows Getting Unique Rows Random Samples Random Splits Concatenating and Appending Rows (Union) Sorting Rows Limit Repartition and Coalesce Collecting Rows to the Driver
Conclusion
6. Working with Different Types of Data
Where to Look for APIs Converting to Spark Types Working with Booleans Working with Numbers Working with Strings
Regular Expressions
Working with Dates and Timestamps Working with Nulls in Data
Coalesce ifnull, nullIf, nvl, and nvl2 drop fill replace
Ordering Working with Complex Types
Structs Arrays split Array Length array_contains explode Maps
Working with JSON User-Defined Functions Conclusion
7. Aggregations
Aggregation Functions
count countDistinct approx_count_distinct first and last min and max sum sumDistinct avg Variance and Standard Deviation skewness and kurtosis Covariance and Correlation Aggregating to Complex Types
Grouping
Grouping with Expressions Grouping with Maps
Window Functions Grouping Sets
Rollups Cube Grouping Metadata Pivot
User-Defined Aggregation Functions Conclusion
8. Joins
Join Expressions Join Types Inner Joins Outer Joins Left Outer Joins Right Outer Joins Left Semi Joins Left Anti Joins Natural Joins Cross (Cartesian) Joins Challenges When Using Joins
Joins on Complex Types Handling Duplicate Column Names
Approach 1: Different join expression Approach 2: Dropping the column after the join Approach 3: Renaming a column before the join
How Spark Performs Joins
Communication Strategies
Big table–to–big table Big table–to–small table Little table–to–little table
Conclusion
9. Data Sources
The Structure of the Data Sources API
Read API Structure Basics of Reading Data
Read modes
Write API Structure Basics of Writing Data
Save modes
CSV Files
CSV Options Reading CSV Files Writing CSV Files
JSON Files
JSON Options Reading JSON Files Writing JSON Files
Parquet Files
Reading Parquet Files
Parquet options
Writing Parquet Files
ORC Files
Reading Orc Files Writing Orc Files
SQL Databases
Reading from SQL Databases Query Pushdown
Reading from databases in parallel Partitioning based on a sliding window
Writing to SQL Databases
Text Files
Reading Text Files Writing Text Files
Advanced I/O Concepts
Splittable File Types and Compression Reading Data in Parallel Writing Data in Parallel
Partitioning Bucketing
Writing Complex Types Managing File Size
Conclusion
10. Spark SQL
What Is SQL? Big Data and SQL: Apache Hive Big Data and SQL: Spark SQL
Spark’s Relationship to Hive
The Hive metastore
How to Run Spark SQL Queries
Spark SQL CLI Spark’s Programmatic SQL Interface SparkSQL Thrift JDBC/ODBC Server
Catalog Tables
Spark-Managed Tables Creating Tables Creating External Tables Inserting into Tables Describing Table Metadata Refreshing Table Metadata Dropping Tables
Dropping unmanaged tables
Caching Tables
Views
Creating Views Dropping Views
Databases
Creating Databases Setting the Database Dropping Databases
Select Statements
case…when…then Statements
Advanced Topics
Complex Types
Structs Lists
Functions
User-defined functions
Subqueries
Uncorrelated predicate subqueries Correlated predicate subqueries Uncorrelated scalar queries
Miscellaneous Features
Configurations Setting Configuration Values in SQL
Conclusion
11. Datasets
When to Use Datasets Creating Datasets
In Java: Encoders In Scala: Case Classes
Actions Transformations
Filtering Mapping
Joins Grouping and Aggregations Conclusion
III. Low-Level APIs 12. Resilient Distributed Datasets (RDDs)
What Are the Low-Level APIs?
When to Use the Low-Level APIs? How to Use the Low-Level APIs?
About RDDs
Types of RDDs When to Use RDDs? Datasets and RDDs of Case Classes
Creating RDDs
Interoperating Between DataFrames, Datasets, and RDDs From a Local Collection From Data Sources
Manipulating RDDs Transformations
distinct filter map
flatMap
sort Random Splits
Actions
reduce count
countApprox countApproxDistinct countByValue countByValueApprox
first max and min take
Saving Files
saveAsTextFile SequenceFiles Hadoop Files
Caching Checkpointing Pipe RDDs to System Commands
mapPartitions foreachPartition glom
Conclusion
13. Advanced RDDs
Key-Value Basics (Key-Value RDDs)
keyBy Mapping over Values Extracting Keys and Values lookup sampleByKey
Aggregations
countByKey Understanding Aggregation Implementations
groupByKey reduceByKey
Other Aggregation Methods
aggregate aggregateByKey combineByKey foldByKey
CoGroups Joins
Inner Join zips
Controlling Partitions
coalesce repartition repartitionAndSortWithinPartitions Custom Partitioning
Custom Serialization Conclusion
14. Distributed Shared Variables
Broadcast Variables Accumulators
Basic Example Custom Accumulators
Conclusion
IV. Production Applications 15. How Spark Runs on a Cluster
The Architecture of a Spark Application
Execution Modes
Cluster mode Client mode Local mode
The Life Cycle of a Spark Application (Outside Spark)
Client Request Launch Execution Completion
The Life Cycle of a Spark Application (Inside Spark)
The SparkSession
The SparkContext
Logical Instructions
Logical instructions to physical execution
A Spark Job Stages Tasks
Execution Details
Pipelining Shuffle Persistence
Conclusion
16. Developing Spark Applications
Writing Spark Applications
A Simple Scala-Based App
Running the application
Writing Python Applications
Running the application
Writing Java Applications
Running the application
Testing Spark Applications
Strategic Principles
Input data resilience Business logic resilience and evolution Resilience in output and atomicity
Tactical Takeaways
Managing SparkSessions Which Spark API to Use?
Connecting to Unit Testing Frameworks Connecting to Data Sources
The Development Process Launching Applications
Application Launch Examples
Configuring Applications
The SparkConf Application Properties Runtime Properties Execution Properties Configuring Memory Management Configuring Shuffle Behavior Environmental Variables Job Scheduling Within an Application
Conclusion
17. Deploying Spark
Where to Deploy Your Cluster to Run Spark Applications
On-Premises Cluster Deployments Spark in the Cloud
Cluster Managers
Standalone Mode
Starting a standalone cluster Cluster launch scripts Standalone cluster configurations Submitting applications
Spark on YARN
Submitting applications
Configuring Spark on YARN Applications
Hadoop configurations Application properties for YARN
Spark on Mesos
Submitting applications Configuring Mesos
Secure Deployment Configurations Cluster Networking Configurations Application Scheduling
Dynamic allocation
Miscellaneous Considerations Conclusion
18. Monitoring and Debugging
The Monitoring Landscape What to Monitor
Driver and Executor Processes Queries, Jobs, Stages, and Tasks
Spark Logs The Spark UI
Other Spark UI tabs Configuring the Spark user interface Spark REST API Spark UI History Server
Debugging and Spark First Aid
Spark Jobs Not Starting
Signs and symptoms Potential treatments
Errors Before Execution
Signs and symptoms Potential treatments
Errors During Execution
Signs and symptoms Potential treatments
Slow Tasks or Stragglers
Signs and symptoms Potential treatments
Slow Aggregations
Signs and symptoms Potential treatments
Slow Joins
Signs and symptoms Potential treatments
Slow Reads and Writes
Signs and symptoms Potential treatments
Driver OutOfMemoryError or Driver Unresponsive
Signs and symptoms Potential treatments
Executor OutOfMemoryError or Executor Unresponsive
Signs and symptoms Potential treatments
Unexpected Nulls in Results
Signs and symptoms Potential treatments
No Space Left on Disk Errors
Signs and symptoms Potential treatments
Serialization Errors
Signs and symptoms Potential treatments
Conclusion
19. Performance Tuning
Indirect Performance Enhancements
Design Choices
Scala versus Java versus Python versus R DataFrames versus SQL versus Datasets versus RDDs
Object Serialization in RDDs Cluster Configurations
Cluster/application sizing and sharing Dynamic allocation
Scheduling Data at Rest
File-based long-term data storage Splittable file types and compression Table partitioning Bucketing The number of files Data locality Statistics collection
Shuffle Configurations Memory Pressure and Garbage Collection
Measuring the impact of garbage collection Garbage collection tuning
Direct Performance Enhancements
Parallelism Improved Filtering Repartitioning and Coalescing
Custom partitioning
User-Defined Functions (UDFs) Temporary Data Storage (Caching) Joins Aggregations Broadcast Variables
Conclusion
V. Streaming 20. Stream Processing Fundamentals
What Is Stream Processing?
Stream Processing Use Cases
Notifications and alerting Real-time reporting Incremental ETL Update data to serve in real time Real-time decision making Online machine learning
Advantages of Stream Processing Challenges of Stream Processing
Stream Processing Design Points
Record-at-a-Time Versus Declarative APIs Event Time Versus Processing Time Continuous Versus Micro-Batch Execution
Spark’s Streaming APIs
The DStream API Structured Streaming
Conclusion
21. Structured Streaming Basics
Structured Streaming Basics Core Concepts
Transformations and Actions Input Sources Sinks Output Modes Triggers Event-Time Processing
Event-time data Watermarks
Structured Streaming in Action Transformations on Streams
Selections and Filtering Aggregations Joins
Input and Output
Where Data Is Read and Written (Sources and Sinks)
File source and sink Kafka source and sink
Reading from the Kafka Source Writing to the Kafka Sink
Foreach sink Sources and sinks for testing
How Data Is Output (Output Modes)
Append mode Complete mode Update mode When can you use each mode?
When Data Is Output (Triggers)
Processing time trigger Once trigger
Streaming Dataset API Conclusion
22. Event-Time and Stateful Processing
Event Time Stateful Processing Arbitrary Stateful Processing Event-Time Basics Windows on Event Time
Tumbling Windows
Sliding windows
Handling Late Data with Watermarks
Dropping Duplicates in a Stream Arbitrary Stateful Processing
Time-Outs Output Modes mapGroupsWithState flatMapGroupsWithState
Conclusion
23. Structured Streaming in Production
Fault Tolerance and Checkpointing Updating Your Application
Updating Your Streaming Application Code Updating Your Spark Version Sizing and Rescaling Your Application
Metrics and Monitoring
Query Status Recent Progress
Input rate and processing rate Batch duration
Spark UI
Alerting Advanced Monitoring with the Streaming Listener Conclusion
VI. Advanced Analytics and Machine Learning 24. Advanced Analytics and Machine Learning Overview
A Short Primer on Advanced Analytics
Supervised Learning
Classification Regression
Recommendation Unsupervised Learning Graph Analytics The Advanced Analytics Process
Data collection Data cleaning Feature engineering Training models Model tuning and evaluation Leveraging the model and/or insights
Spark’s Advanced Analytics Toolkit
What Is MLlib?
When and why should you use MLlib (versus scikit-learn, TensorFlow, or foo package)
High-Level MLlib Concepts
Low-level data types
MLlib in Action
Feature Engineering with Transformers Estimators Pipelining Our Workflow Training and Evaluation Persisting and Applying Models
Deployment Patterns Conclusion
25. Preprocessing and Feature Engineering
Formatting Models According to Your Use Case Transformers Estimators for Preprocessing
Transformer Properties
High-Level Transformers
RFormula SQL Transformers VectorAssembler
Working with Continuous Features
Bucketing
Advanced bucketing techniques
Scaling and Normalization StandardScaler
MinMaxScaler MaxAbsScaler ElementwiseProduct Normalizer
Working with Categorical Features
StringIndexer Converting Indexed Values Back to Text Indexing in Vectors One-Hot Encoding
Text Data Transformers
Tokenizing Text Removing Common Words Creating Word Combinations Converting Words into Numerical Representations
Term frequency–inverse document frequency
Word2Vec
Feature Manipulation
PCA Interaction Polynomial Expansion
Feature Selection
ChiSqSelector
Advanced Topics
Persisting Transformers
Writing a Custom Transformer Conclusion
26. Classification
Use Cases Types of Classification
Binary Classification Multiclass Classification Multilabel Classification
Classification Models in MLlib
Model Scalability
Logistic Regression
Model Hyperparameters Training Parameters Prediction Parameters Example Model Summary
Decision Trees
Model Hyperparameters Training Parameters Prediction Parameters
Random Forest and Gradient-Boosted Trees
Model Hyperparameters
Random forest only Gradient-boosted trees (GBT) only
Training Parameters Prediction Parameters
Naive Bayes
Model Hyperparameters Training Parameters Prediction Parameters
Evaluators for Classification and Automating Model Tuning Detailed Evaluation Metrics One-vs-Rest Classifier Multilayer Perceptron Conclusion
27. Regression
Use Cases Regression Models in MLlib
Model Scalability
Linear Regression
Model Hyperparameters Training Parameters Example Training Summary
Generalized Linear Regression
Model Hyperparameters Training Parameters Prediction Parameters Example Training Summary
Decision Trees
Model Hyperparameters Training Parameters Example
Random Forests and Gradient-Boosted Trees
Model Hyperparameters Training Parameters Example
Advanced Methods
Survival Regression (Accelerated Failure Time) Isotonic Regression
Evaluators and Automating Model Tuning Metrics Conclusion
28. Recommendation
Use Cases Collaborative Filtering with Alternating Least Squares
Model Hyperparameters Training Parameters Prediction Parameters Example
Evaluators for Recommendation Metrics
Regression Metrics Ranking Metrics
Frequent Pattern Mining Conclusion
29. Unsupervised Learning
Use Cases Model Scalability k-means
Model Hyperparameters Training Parameters Example k-means Metrics Summary
Bisecting k-means
Model Hyperparameters Training Parameters Example Bisecting k-means Summary
Gaussian Mixture Models
Model Hyperparameters Training Parameters Example Gaussian Mixture Model Summary
Latent Dirichlet Allocation
Model Hyperparameters Training Parameters Prediction Parameters Example
Conclusion
30. Graph Analytics
Building a Graph Querying the Graph
Subgraphs
Motif Finding Graph Algorithms
PageRank In-Degree and Out-Degree Metrics Breadth-First Search Connected Components Strongly Connected Components Advanced Tasks
Conclusion
31. Deep Learning
What Is Deep Learning? Ways of Using Deep Learning in Spark Deep Learning Libraries
MLlib Neural Network Support TensorFrames BigDL TensorFlowOnSpark DeepLearning4J Deep Learning Pipelines
A Simple Example with Deep Learning Pipelines
Setup Images and DataFrames Transfer Learning
Applying deep learning models at scale
Applying Popular Models
Applying custom Keras models Applying TensorFlow models Deploying models as SQL functions
Conclusion
VII. Ecosystem 32. Language Specifics: Python (PySpark) and R (SparkR and sparklyr)
PySpark
Fundamental PySpark Differences Pandas Integration
R on Spark
SparkR
Pros and cons of using SparkR instead of other languages Setup Key Concepts Function masking SparkR functions only apply to SparkDataFrames Data manipulation Data sources Machine learning User-defined functions
sparklyr
Key concepts No DataFrames Data manipulation Executing SQL Data sources Machine learning
Conclusion
33. Ecosystem and Community
Spark Packages
An Abridged List of Popular Packages Using Spark Packages
In Scala In Python At runtime
External Packages
Community
Spark Summit Local Meetups
Conclusion
Index
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion