Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
About This E-Book Title Page Copyright Page Contents at a Glance Table of Contents Preface
Why Should I Learn Spark? How This Book Is Organized Data Used in the Exercises Conventions Used in This Book
About the Author Dedication Acknowledgments We Want to Hear from You Reader Services Part I: Getting Started with Apache Spark
Hour 1. Introducing Apache Spark
What Is Spark?
Spark and Hadoop Spark as an Abstraction Spark Is Fast, Efficient, and Scalable
What Sort of Applications Use Spark? Programming Interfaces to Spark Ways to Use Spark
Interactive Use Non-interactive Use Input/Output Types
Summary Q&A Workshop
Quiz Answers
Hour 2. Understanding Hadoop
Hadoop and a Brief History of Big Data Hadoop Explained Introducing HDFS
HDFS Overview HDFS Architecture
Introducing YARN
What Is YARN? Running an Application on YARN Other Resource Managers
Anatomy of a Hadoop Cluster How Spark Works with Hadoop
HDFS as a Data Source for Spark YARN as a Resource Scheduler for Spark
Summary Q&A Workshop
Quiz Answers
Hour 3. Installing Spark
Spark Deployment Modes Preparing to Install Spark Installing Spark in Standalone Mode
Getting Spark Installing a Multi-node Spark Standalone Cluster
Exploring the Spark Install Deploying Spark on Hadoop
Using a Management Console or Interface Installing Manually
Summary Q&A Workshop
Quiz Answers
Exercises
Hour 4. Understanding the Spark Application Architecture
Anatomy of a Spark Application Spark Driver
The Spark Context Application Planning Application Scheduling Other Driver Functions
Spark Executors and Workers Spark Master and Cluster Manager
Spark Master Cluster Manager
Spark Applications Running on YARN
ResourceManager as the Cluster Manager ApplicationsMaster as the Spark Master yarn-cluster Mode yarn-client Mode Log File Management with Spark on YARN
Local Mode Summary Q&A Workshop
Quiz Answers
Hour 5. Deploying Spark in the Cloud
Amazon Web Services Primer
Elastic Compute Cloud (EC2) Simple Storage Service (S3) Elastic MapReduce (EMR) AWS Pricing and Getting Started
Spark on EC2 Spark on EMR Hosted Spark with Databricks Summary Q&A Workshop
Quiz Answers
Part II: Programming with Apache Spark
Hour 6. Learning the Basics of Spark Programming with RDDs
Introduction to RDDs Loading Data into RDDs
Creating an RDD from a File or Files Creating an RDD from a Datasource Creating an RDD Programatically
Operations on RDDs
Coarse-Grained versus Fine-Grained Transformations Transformations, Actions, and Lazy Evaluation RDD Persistence and Re-use RDD Lineage Fault Tolerance with RDDs
Types of RDDs Summary Q&A Workshop
Quiz Answers
Hour 7. Understanding MapReduce Concepts
MapReduce History and Background
The Motivation for MapReduce The Design Goals for MapReduce
Records and Key Value Pairs
Key Value Pairs and Records
MapReduce Explained
Map Phase Partitioning Function Shuffle Reduce Phase Fault Tolerance Combiner Functions Asymmetry and Speculative Execution Map-only MapReduce Applications An Election Analogy for MapReduce
Word Count: The “Hello, World” of MapReduce
Why Count Words? How It Works Map and Reduce Functions in Spark
Summary Q&A Workshop
Quiz Answers
Hour 8. Getting Started with Scala
Scala History and Background
Scala Beginnings
Scala Basics
Scala’s Compile Time and Run Time Architecture Variables and Primitives in Scala Data Structures in Scala Control Structures in Scala
Object-Oriented Programming in Scala
Classes and Inheritance Mixin Composition Singleton Objects Polymorphism
Functional Programming in Scala
First-class Functions Anonymous Functions Higher-order Functions Closures Currying Lazy Evaluation Immutable Data Structures
Spark Programming in Scala Summary Q&A Workshop
Quiz Answers
Hour 9. Functional Programming with Python
Python Overview
Python Background Python Runtime Architecture
Data Structures and Serialization in Python
Lists Sets Tuples Dictionaries Python Object Serialization
Python Functional Programming Basics
Anonymous Functions and lambda Higher-order Functions Tail Calls Short-circuiting Parallelization Closures in Python
Interactive Programming Using IPython
IPython History and Background Using IPython with Spark Jupyter, the IPython Notebook
Summary Q&A Workshop
Quiz Answers
Hour 10. Working with the Spark API (Transformations and Actions)
RDDs and Data Sampling
RDD Refresher Data Sampling with Spark
Spark Transformations
Functional Transformations Grouping, Sorting, and Distinct Functions Set Operations
Spark Actions
The count Action The collect, take, top, and first Actions The reduce and fold Actions The foreach Action
Key Value Pair Operations
Key Value Pair RDD Dictionary Functions Functional Key Value Pair RDD Transformations Grouping, Aggregation, Sorting, and Set Operations
Join Functions
Join Types Join Transformations
Numerical RDD Operations
min() max() mean() sum() stdev() variance() stats()
Summary Q&A Workshop
Quiz Answers
Hour 11. Using RDDs: Caching, Persistence, and Output
RDD Storage Levels
RDD Lineage Revisited RDD Storage Levels
Caching, Persistence, and Checkpointing
Caching RDDs Persisting RDDs Choosing When to Persist or Cache RDDs Checkpointing RDDs
Saving RDD Output
External Storage Systems Storage Formats
Introduction to Alluxio (Tachyon)
Alluxio Background Alluxio Architecture Alluxio as a Filesystem Alluxio for Off Heap RDD Persistence Other Alluxio Features and Usages
Summary Q&A Workshop
Quiz Answers
Hour 12. Advanced Spark Programming
Broadcast Variables
Broadcast Variable Creation and Usage Advantages of Broadcast Variables
Accumulators
Using Accumulators Custom Accumulators Uses for Accumulators
Partitioning and Repartitioning
Partitioning Overview Controlling Partitions Repartitioning Functions Partition-specific API Methods
Processing RDDs with External Programs
pipe()
Summary Q&A Workshop
Quiz Answers
Part III: Extensions to Spark
Hour 13. Using SQL with Spark
Introduction to Spark SQL
Background Hive Overview SQL on Hadoop Spark SQL Architecture HiveContext and SQLContext
Getting Started with Spark SQL DataFrames
Creating a DataFrame from an Existing RDD Creating a DataFrame from a Hive Table Creating a DataFrame from JSON Objects Creating DataFrames from Files Using the DataFrameReader Converting DataFrames to RDDs DataFrame Data Model DataFrame Schemas
Using Spark SQL DataFrames
DataFrame Metadata Operations Basic DataFrame Operations DataFrame Built-in Functions and UDFs DataFrame Set Operations Caching, Persisting, and Repartitioning DataFrames Saving DataFrame Output Using the DataFrameWriter
Accessing Spark SQL
Accessing Spark SQL Using the spark-sql Shell Running the Thrift JDBC/ODBC server
Summary Q&A Workshop
Quiz Answers
Hour 14. Stream Processing with Spark
Introduction to Spark Streaming
Streaming, Spark Style Spark Streaming Architecture The StreamingContext
Using DStreams
DStream Sources DStream Transformations DStream Output Operations
State Operations
updateStateByKey()
Sliding Window Operations
window() reduceByKeyAndWindow()
Summary Q&A Workshop
Quiz Answers
Hour 15. Getting Started with Spark and R
Introduction to R
Getting Started with the R Language
Introducing SparkR
The SparkR Shell Creating Data Frames in SparkR
Using SparkR
Building Predictive Models with SparkR
Using SparkR with RStudio Summary Q&A Workshop
Quiz Answers
Hour 16. Machine Learning with Spark
Introduction to Machine Learning and MLlib
Machine Learning Primer Machine Learning with Spark
Classification Using Spark MLlib
Decision Trees Naive Bayes
Collaborative Filtering Using Spark MLlib Clustering Using Spark MLlib
k-means Clustering
Summary Q&A Workshop
Quiz Answers
Hour 17. Introducing Sparkling Water (H20 and Spark)
Introduction to H2O
H2O Deep Learning H2O Flow H2O Architecture Running H2O on Hadoop
Sparkling Water—H2O on Spark
Sparkling Water Architecture
Summary Q&A Workshop
Quiz Answers
Hour 18. Graph Processing with Spark
Introduction to Graphs Graph Processing in Spark
Google, Pregel, and PageRank GraphX: Spark’s Graph Processing System
Introduction to GraphFrames
Accessing the GraphFrames Library Creating a GraphFrame GraphFrame Operations Using Graphing Algorithms with GraphFrames
Summary Q&A Workshop
Quiz Answers
Hour 19. Using Spark with NoSQL Systems
Introduction to NoSQL
Bigtable: The Beginnings of the NoSQL Movement NoSQL System Characteristics Types of NoSQL Systems
Using Spark with HBase
HBase Data Model and Shell Data Distribution in HBase HBase and Spark
Using Spark with Cassandra
Cassandra Data Model Cassandra Query Language (CQL) Accessing Cassandra Using Spark
Using Spark with DynamoDB and More
Amazon DynamoDB Other NoSQL Implementations The Future for NoSQL
Summary Q&A Workshop
Quiz Answers
Hour 20. Using Spark with Messaging Systems
Overview of Messaging Systems
Pub-Sub Messaging Exchange Pattern
Using Spark with Apache Kafka
Kafka Overview Spark and Kafka
Spark, MQTT, and the Internet of Things
MQTT Overview Using Spark with MQTT
Using Spark with Amazon Kinesis
Kinesis Streams Using Spark with Kinesis
Summary Q&A Workshop
Quiz Answers
Part IV: Managing Spark
Hour 21. Administering Spark
Spark Configuration
Spark Environment Variables Spark Configuration
Administering Spark Standalone
Spark Standalone Revisited Deploying Spark Standalone Clusters Scheduling with Spark Standalone
Administering Spark on YARN
Spark on YARN Revisited Deploying Spark on YARN Managing Spark Applications Running on YARN YARN Scheduling
Summary Q&A Workshop
Quiz Answers
Hour 22. Monitoring Spark
Exploring the Spark Application UI
Jobs Stages Storage Environment Executors Viewing the Status of All Running Applications
Spark History Server
Deploying the Spark History Server Exploring the Spark History Server UI Spark History Server API Access
Spark Metrics Logging in Spark
Log4j
Summary Q&A Workshop
Quiz Answers
Hour 23. Extending and Securing Spark
Isolating Spark
Perimeter Security Gateway Services Authentication and Authorization
Securing Spark Communication
Spark Authentication Using a Shared Secret Encrypting Spark Communication Securing the Spark Web UI
Securing Spark with Kerberos
Kerberos Overview Kerberos with Hadoop Kerberos Configuration with Spark
Summary Q&A Workshop
Quiz Answers
Hour 24. Improving Spark Performance
Benchmarking Spark
Benchmarks Canary Queries Performance Monitoring Solutions
Application Development Best Practices
Application Development Optimizations System, Configuration, or Job Submission Optimizations
Optimizing Partitions
Inefficient Partitioning
Diagnosing Application Performance Issues
Using the Application UI to Diagnose Performance Issues Using the Spark History UI to Diagnose Performance Issues
Summary Q&A Workshop
Quiz Answers
Index Code Snippets
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion