Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Preface
About the Authors Who This Book Is For Conventions Used in This Book Using Code Examples O’Reilly Safari How to Contact Us Acknowledgments
I. Gentle Overview of Big Data and Spark 1. What Is Apache Spark?
Apache Spark’s Philosophy Context: The Big Data Problem History of Spark The Present and Future of Spark Running Spark
Downloading Spark Locally Launching Spark’s Interactive Consoles Running Spark in the Cloud Data Used in This Book
2. A Gentle Introduction to Spark
Spark’s Basic Architecture
Spark Applications
Spark’s Language APIs Spark’s APIs Starting Spark The SparkSession DataFrames
Partitions
Transformations
Lazy Evaluation
Actions Spark UI An End-to-End Example
DataFrames and SQL
Conclusion
3. A Tour of Spark’s Toolset
Running Production Applications Datasets: Type-Safe Structured APIs Structured Streaming Machine Learning and Advanced Analytics Lower-Level APIs SparkR Spark’s Ecosystem and Packages Conclusion
II. Structured APIs—DataFrames, SQL, and Datasets 4. Structured API Overview
DataFrames and Datasets Schemas Overview of Structured Spark Types
DataFrames Versus Datasets Columns Rows Spark Types
Overview of Structured API Execution
Logical Planning Physical Planning Execution
Conclusion
5. Basic Structured Operations
Schemas Columns and Expressions
Columns Expressions
Records and Rows
Creating Rows
DataFrame Transformations
Creating DataFrames select and selectExpr Converting to Spark Types (Literals) Adding Columns Renaming Columns Reserved Characters and Keywords Case Sensitivity Removing Columns Changing a Column’s Type (cast) Filtering Rows Getting Unique Rows Random Samples Random Splits Concatenating and Appending Rows (Union) Sorting Rows Limit Repartition and Coalesce Collecting Rows to the Driver
Conclusion
6. Working with Different Types of Data
Where to Look for APIs Converting to Spark Types Working with Booleans Working with Numbers Working with Strings
Regular Expressions
Working with Dates and Timestamps Working with Nulls in Data
Coalesce ifnull, nullIf, nvl, and nvl2 drop fill replace
Ordering Working with Complex Types
Structs Arrays split Array Length array_contains explode Maps
Working with JSON User-Defined Functions Conclusion
7. Aggregations
Aggregation Functions
count countDistinct approx_count_distinct first and last min and max sum sumDistinct avg Variance and Standard Deviation skewness and kurtosis Covariance and Correlation Aggregating to Complex Types
Grouping
Grouping with Expressions Grouping with Maps
Window Functions Grouping Sets
Rollups Cube Grouping Metadata Pivot
User-Defined Aggregation Functions Conclusion
8. Joins
Join Expressions Join Types Inner Joins Outer Joins Left Outer Joins Right Outer Joins Left Semi Joins Left Anti Joins Natural Joins Cross (Cartesian) Joins Challenges When Using Joins
Joins on Complex Types Handling Duplicate Column Names
How Spark Performs Joins
Communication Strategies
Conclusion
9. Data Sources
The Structure of the Data Sources API
Read API Structure Basics of Reading Data Write API Structure Basics of Writing Data
CSV Files
CSV Options Reading CSV Files Writing CSV Files
JSON Files
JSON Options Reading JSON Files Writing JSON Files
Parquet Files
Reading Parquet Files Writing Parquet Files
ORC Files
Reading Orc Files Writing Orc Files
SQL Databases
Reading from SQL Databases Query Pushdown Writing to SQL Databases
Text Files
Reading Text Files Writing Text Files
Advanced I/O Concepts
Splittable File Types and Compression Reading Data in Parallel Writing Data in Parallel Writing Complex Types Managing File Size
Conclusion
10. Spark SQL
What Is SQL? Big Data and SQL: Apache Hive Big Data and SQL: Spark SQL
Spark’s Relationship to Hive
How to Run Spark SQL Queries
Spark SQL CLI Spark’s Programmatic SQL Interface SparkSQL Thrift JDBC/ODBC Server
Catalog Tables
Spark-Managed Tables Creating Tables Creating External Tables Inserting into Tables Describing Table Metadata Refreshing Table Metadata Dropping Tables Caching Tables
Views
Creating Views Dropping Views
Databases
Creating Databases Setting the Database Dropping Databases
Select Statements
case…when…then Statements
Advanced Topics
Complex Types Functions Subqueries
Miscellaneous Features
Configurations Setting Configuration Values in SQL
Conclusion
11. Datasets
When to Use Datasets Creating Datasets
In Java: Encoders In Scala: Case Classes
Actions Transformations
Filtering Mapping
Joins Grouping and Aggregations Conclusion
III. Low-Level APIs 12. Resilient Distributed Datasets (RDDs)
What Are the Low-Level APIs?
When to Use the Low-Level APIs? How to Use the Low-Level APIs?
About RDDs
Types of RDDs When to Use RDDs? Datasets and RDDs of Case Classes
Creating RDDs
Interoperating Between DataFrames, Datasets, and RDDs From a Local Collection From Data Sources
Manipulating RDDs Transformations
distinct filter map sort Random Splits
Actions
reduce count first max and min take
Saving Files
saveAsTextFile SequenceFiles Hadoop Files
Caching Checkpointing Pipe RDDs to System Commands
mapPartitions foreachPartition glom
Conclusion
13. Advanced RDDs
Key-Value Basics (Key-Value RDDs)
keyBy Mapping over Values Extracting Keys and Values lookup sampleByKey
Aggregations
countByKey Understanding Aggregation Implementations Other Aggregation Methods
CoGroups Joins
Inner Join zips
Controlling Partitions
coalesce repartition repartitionAndSortWithinPartitions Custom Partitioning
Custom Serialization Conclusion
14. Distributed Shared Variables
Broadcast Variables Accumulators
Basic Example Custom Accumulators
Conclusion
IV. Production Applications 15. How Spark Runs on a Cluster
The Architecture of a Spark Application
Execution Modes
The Life Cycle of a Spark Application (Outside Spark)
Client Request Launch Execution Completion
The Life Cycle of a Spark Application (Inside Spark)
The SparkSession Logical Instructions A Spark Job Stages Tasks
Execution Details
Pipelining Shuffle Persistence
Conclusion
16. Developing Spark Applications
Writing Spark Applications
A Simple Scala-Based App Writing Python Applications Writing Java Applications
Testing Spark Applications
Strategic Principles Tactical Takeaways Connecting to Unit Testing Frameworks Connecting to Data Sources
The Development Process Launching Applications
Application Launch Examples
Configuring Applications
The SparkConf Application Properties Runtime Properties Execution Properties Configuring Memory Management Configuring Shuffle Behavior Environmental Variables Job Scheduling Within an Application
Conclusion
17. Deploying Spark
Where to Deploy Your Cluster to Run Spark Applications
On-Premises Cluster Deployments Spark in the Cloud
Cluster Managers
Standalone Mode Spark on YARN Configuring Spark on YARN Applications Spark on Mesos Secure Deployment Configurations Cluster Networking Configurations Application Scheduling
Miscellaneous Considerations Conclusion
18. Monitoring and Debugging
The Monitoring Landscape What to Monitor
Driver and Executor Processes Queries, Jobs, Stages, and Tasks
Spark Logs The Spark UI
Spark REST API Spark UI History Server
Debugging and Spark First Aid
Spark Jobs Not Starting Errors Before Execution Errors During Execution Slow Tasks or Stragglers Slow Aggregations Slow Joins Slow Reads and Writes Driver OutOfMemoryError or Driver Unresponsive Executor OutOfMemoryError or Executor Unresponsive Unexpected Nulls in Results No Space Left on Disk Errors Serialization Errors
Conclusion
19. Performance Tuning
Indirect Performance Enhancements
Design Choices Object Serialization in RDDs Cluster Configurations Scheduling Data at Rest Shuffle Configurations Memory Pressure and Garbage Collection
Direct Performance Enhancements
Parallelism Improved Filtering Repartitioning and Coalescing User-Defined Functions (UDFs) Temporary Data Storage (Caching) Joins Aggregations Broadcast Variables
Conclusion
V. Streaming 20. Stream Processing Fundamentals
What Is Stream Processing?
Stream Processing Use Cases Advantages of Stream Processing Challenges of Stream Processing
Stream Processing Design Points
Record-at-a-Time Versus Declarative APIs Event Time Versus Processing Time Continuous Versus Micro-Batch Execution
Spark’s Streaming APIs
The DStream API Structured Streaming
Conclusion
21. Structured Streaming Basics
Structured Streaming Basics Core Concepts
Transformations and Actions Input Sources Sinks Output Modes Triggers Event-Time Processing
Structured Streaming in Action Transformations on Streams
Selections and Filtering Aggregations Joins
Input and Output
Where Data Is Read and Written (Sources and Sinks) Reading from the Kafka Source Writing to the Kafka Sink How Data Is Output (Output Modes) When Data Is Output (Triggers)
Streaming Dataset API Conclusion
22. Event-Time and Stateful Processing
Event Time Stateful Processing Arbitrary Stateful Processing Event-Time Basics Windows on Event Time
Tumbling Windows Handling Late Data with Watermarks
Dropping Duplicates in a Stream Arbitrary Stateful Processing
Time-Outs Output Modes mapGroupsWithState flatMapGroupsWithState
Conclusion
23. Structured Streaming in Production
Fault Tolerance and Checkpointing Updating Your Application
Updating Your Streaming Application Code Updating Your Spark Version Sizing and Rescaling Your Application
Metrics and Monitoring
Query Status Recent Progress Spark UI
Alerting Advanced Monitoring with the Streaming Listener Conclusion
VI. Advanced Analytics and Machine Learning 24. Advanced Analytics and Machine Learning Overview
A Short Primer on Advanced Analytics
Supervised Learning Recommendation Unsupervised Learning Graph Analytics The Advanced Analytics Process
Spark’s Advanced Analytics Toolkit
What Is MLlib?
High-Level MLlib Concepts MLlib in Action
Feature Engineering with Transformers Estimators Pipelining Our Workflow Training and Evaluation Persisting and Applying Models
Deployment Patterns Conclusion
25. Preprocessing and Feature Engineering
Formatting Models According to Your Use Case Transformers Estimators for Preprocessing
Transformer Properties
High-Level Transformers
RFormula SQL Transformers VectorAssembler
Working with Continuous Features
Bucketing Scaling and Normalization StandardScaler
Working with Categorical Features
StringIndexer Converting Indexed Values Back to Text Indexing in Vectors One-Hot Encoding
Text Data Transformers
Tokenizing Text Removing Common Words Creating Word Combinations Converting Words into Numerical Representations Word2Vec
Feature Manipulation
PCA Interaction Polynomial Expansion
Feature Selection
ChiSqSelector
Advanced Topics
Persisting Transformers
Writing a Custom Transformer Conclusion
26. Classification
Use Cases Types of Classification
Binary Classification Multiclass Classification Multilabel Classification
Classification Models in MLlib
Model Scalability
Logistic Regression
Model Hyperparameters Training Parameters Prediction Parameters Example Model Summary
Decision Trees
Model Hyperparameters Training Parameters Prediction Parameters
Random Forest and Gradient-Boosted Trees
Model Hyperparameters Training Parameters Prediction Parameters
Naive Bayes
Model Hyperparameters Training Parameters Prediction Parameters
Evaluators for Classification and Automating Model Tuning Detailed Evaluation Metrics One-vs-Rest Classifier Multilayer Perceptron Conclusion
27. Regression
Use Cases Regression Models in MLlib
Model Scalability
Linear Regression
Model Hyperparameters Training Parameters Example Training Summary
Generalized Linear Regression
Model Hyperparameters Training Parameters Prediction Parameters Example Training Summary
Decision Trees
Model Hyperparameters Training Parameters Example
Random Forests and Gradient-Boosted Trees
Model Hyperparameters Training Parameters Example
Advanced Methods
Survival Regression (Accelerated Failure Time) Isotonic Regression
Evaluators and Automating Model Tuning Metrics Conclusion
28. Recommendation
Use Cases Collaborative Filtering with Alternating Least Squares
Model Hyperparameters Training Parameters Prediction Parameters Example
Evaluators for Recommendation Metrics
Regression Metrics Ranking Metrics
Frequent Pattern Mining Conclusion
29. Unsupervised Learning
Use Cases Model Scalability k-means
Model Hyperparameters Training Parameters Example k-means Metrics Summary
Bisecting k-means
Model Hyperparameters Training Parameters Example Bisecting k-means Summary
Gaussian Mixture Models
Model Hyperparameters Training Parameters Example Gaussian Mixture Model Summary
Latent Dirichlet Allocation
Model Hyperparameters Training Parameters Prediction Parameters Example
Conclusion
30. Graph Analytics
Building a Graph Querying the Graph
Subgraphs
Motif Finding Graph Algorithms
PageRank In-Degree and Out-Degree Metrics Breadth-First Search Connected Components Strongly Connected Components Advanced Tasks
Conclusion
31. Deep Learning
What Is Deep Learning? Ways of Using Deep Learning in Spark Deep Learning Libraries
MLlib Neural Network Support TensorFrames BigDL TensorFlowOnSpark DeepLearning4J Deep Learning Pipelines
A Simple Example with Deep Learning Pipelines
Setup Images and DataFrames Transfer Learning Applying Popular Models
Conclusion
VII. Ecosystem 32. Language Specifics: Python (PySpark) and R (SparkR and sparklyr)
PySpark
Fundamental PySpark Differences Pandas Integration
R on Spark
SparkR sparklyr
Conclusion
33. Ecosystem and Community
Spark Packages
An Abridged List of Popular Packages Using Spark Packages External Packages
Community
Spark Summit Local Meetups
Conclusion
Index
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion