Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Foreword Preface
Audience How This Book Is Organized Supporting Books Conventions Used in This Book Code Examples Safari® Books Online How to Contact Us Acknowledgments
1. Introduction to Data Analysis with Spark
What Is Apache Spark? A Unified Stack
Spark Core Spark SQL Spark Streaming MLlib GraphX Cluster Managers
Who Uses Spark, and for What?
Data Science Tasks Data Processing Applications
A Brief History of Spark Spark Versions and Releases Storage Layers for Spark
2. Downloading Spark and Getting Started
Downloading Spark Introduction to Spark’s Python and Scala Shells Introduction to Core Spark Concepts Standalone Applications
Initializing a SparkContext Building Standalone Applications
Conclusion
3. Programming with RDDs
RDD Basics Creating RDDs RDD Operations
Transformations Actions Lazy Evaluation
Passing Functions to Spark
Python Scala Java
Common Transformations and Actions
Basic RDDs
Element-wise transformations Pseudo set operations Actions
Converting Between RDD Types
Scala Java Python
Persistence (Caching) Conclusion
4. Working with Key/Value Pairs
Motivation Creating Pair RDDs Transformations on Pair RDDs
Aggregations
Tuning the level of parallelism
Grouping Data Joins Sorting Data
Actions Available on Pair RDDs Data Partitioning (Advanced)
Determining an RDD’s Partitioner Operations That Benefit from Partitioning Operations That Affect Partitioning Example: PageRank Custom Partitioners
Conclusion
5. Loading and Saving Your Data
Motivation File Formats
Text Files
Loading text files Saving text files
JSON
Loading JSON Saving JSON
Comma-Separated Values and Tab-Separated Values
Loading CSV Saving CSV
SequenceFiles
Loading SequenceFiles Saving SequenceFiles
Object Files Hadoop Input and Output Formats
Loading with other Hadoop input formats Saving with Hadoop output formats Non-filesystem data sources Example: Protocol buffers
File Compression
Filesystems
Local/“Regular” FS Amazon S3 HDFS
Structured Data with Spark SQL
Apache Hive JSON
Databases
Java Database Connectivity Cassandra HBase Elasticsearch
Conclusion
6. Advanced Spark Programming
Introduction Accumulators
Accumulators and Fault Tolerance Custom Accumulators
Broadcast Variables
Optimizing Broadcasts
Working on a Per-Partition Basis Piping to External Programs Numeric RDD Operations Conclusion
7. Running on a Cluster
Introduction Spark Runtime Architecture
The Driver Executors Cluster Manager Launching a Program Summary
Deploying Applications with spark-submit Packaging Your Code and Dependencies
A Java Spark Application Built with Maven A Scala Spark Application Built with sbt Dependency Conflicts
Scheduling Within and Between Spark Applications Cluster Managers
Standalone Cluster Manager
Launching the Standalone cluster manager Submitting applications Configuring resource usage High availability
Hadoop YARN
Configuring resource usage
Apache Mesos
Mesos scheduling modes Client and cluster mode Configuring resource usage
Amazon EC2
Launching a cluster Logging in to a cluster Destroying a cluster Pausing and restarting clusters Storage on the cluster
Which Cluster Manager to Use? Conclusion
8. Tuning and Debugging Spark
Configuring Spark with SparkConf Components of Execution: Jobs, Tasks, and Stages Finding Information
Spark Web UI
Jobs: Progress and metrics of stages, tasks, and more Storage: Information for RDDs that are persisted Executors: A list of executors present in the application Environment: Debugging Spark’s configuration
Driver and Executor Logs
Key Performance Considerations
Level of Parallelism Serialization Format Memory Management Hardware Provisioning
Conclusion
9. Spark SQL
Linking with Spark SQL Using Spark SQL in Applications
Initializing Spark SQL Basic Query Example SchemaRDDs
Working with Row objects
Caching
Loading and Saving Data
Apache Hive Parquet JSON From RDDs
JDBC/ODBC Server
Working with Beeline Long-Lived Tables and Queries
User-Defined Functions
Spark SQL UDFs Hive UDFs
Spark SQL Performance
Performance Tuning Options
Conclusion
10. Spark Streaming
A Simple Example Architecture and Abstraction Transformations
Stateless Transformations Stateful Transformations
Windowed transformations UpdateStateByKey transformation
Output Operations Input Sources
Core Sources
Stream of files Akka actor stream
Additional Sources
Apache Kafka Apache Flume Push-based receiver Pull-based receiver Custom input sources
Multiple Sources and Cluster Sizing
24/7 Operation
Checkpointing Driver Fault Tolerance Worker Fault Tolerance Receiver Fault Tolerance Processing Guarantees
Streaming UI Performance Considerations
Batch and Window Sizes Level of Parallelism Garbage Collection and Memory Usage
Conclusion
11. Machine Learning with MLlib
Overview System Requirements Machine Learning Basics
Example: Spam Classification
Data Types
Working with Vectors
Algorithms
Feature Extraction
TF-IDF Scaling Normalization Word2Vec
Statistics Classification and Regression
Linear regression Logistic regression Support Vector Machines Naive Bayes Decision trees and random forests
Clustering
K-means
Collaborative Filtering and Recommendation
Alternating Least Squares
Dimensionality Reduction
Principal component analysis Singular value decomposition
Model Evaluation
Tips and Performance Considerations
Preparing Features Configuring Algorithms Caching RDDs to Reuse Recognizing Sparsity Level of Parallelism
Pipeline API Conclusion
Index
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion