Apache Spark in 24 Hours, Sams Teach Yourself by Aven, Jeffrey -- Read -- Imperial Library of Trantor

Index

About This E-Book Title Page Copyright Page Contents at a Glance Table of Contents Preface

Why Should I Learn Spark? How This Book Is Organized Data Used in the Exercises Conventions Used in This Book

About the Author Dedication Acknowledgments We Want to Hear from You Reader Services Part I: Getting Started with Apache Spark

Hour 1. Introducing Apache Spark

What Is Spark?

Spark and Hadoop Spark as an Abstraction Spark Is Fast, Efficient, and Scalable

What Sort of Applications Use Spark? Programming Interfaces to Spark Ways to Use Spark

Interactive Use Non-interactive Use Input/Output Types

Summary Q&A Workshop

Quiz Answers

Hour 2. Understanding Hadoop

Hadoop and a Brief History of Big Data Hadoop Explained Introducing HDFS

HDFS Overview HDFS Architecture

Introducing YARN

What Is YARN? Running an Application on YARN Other Resource Managers

Anatomy of a Hadoop Cluster How Spark Works with Hadoop

HDFS as a Data Source for Spark YARN as a Resource Scheduler for Spark

Summary Q&A Workshop

Quiz Answers

Hour 3. Installing Spark

Spark Deployment Modes Preparing to Install Spark Installing Spark in Standalone Mode

Getting Spark Installing a Multi-node Spark Standalone Cluster

Exploring the Spark Install Deploying Spark on Hadoop

Using a Management Console or Interface Installing Manually

Summary Q&A Workshop

Quiz Answers

Exercises

Hour 4. Understanding the Spark Application Architecture

Anatomy of a Spark Application Spark Driver

The Spark Context Application Planning Application Scheduling Other Driver Functions

Spark Executors and Workers Spark Master and Cluster Manager

Spark Master Cluster Manager

Spark Applications Running on YARN

ResourceManager as the Cluster Manager ApplicationsMaster as the Spark Master yarn-cluster Mode yarn-client Mode Log File Management with Spark on YARN

Local Mode Summary Q&A Workshop

Quiz Answers

Hour 5. Deploying Spark in the Cloud

Amazon Web Services Primer

Elastic Compute Cloud (EC2) Simple Storage Service (S3) Elastic MapReduce (EMR) AWS Pricing and Getting Started

Spark on EC2 Spark on EMR Hosted Spark with Databricks Summary Q&A Workshop

Quiz Answers

Part II: Programming with Apache Spark

Hour 6. Learning the Basics of Spark Programming with RDDs

Introduction to RDDs Loading Data into RDDs

Creating an RDD from a File or Files Creating an RDD from a Datasource Creating an RDD Programatically

Operations on RDDs

Coarse-Grained versus Fine-Grained Transformations Transformations, Actions, and Lazy Evaluation RDD Persistence and Re-use RDD Lineage Fault Tolerance with RDDs

Types of RDDs Summary Q&A Workshop

Quiz Answers

Hour 7. Understanding MapReduce Concepts

MapReduce History and Background

The Motivation for MapReduce The Design Goals for MapReduce

Records and Key Value Pairs

Key Value Pairs and Records

MapReduce Explained

Map Phase Partitioning Function Shuffle Reduce Phase Fault Tolerance Combiner Functions Asymmetry and Speculative Execution Map-only MapReduce Applications An Election Analogy for MapReduce

Word Count: The “Hello, World” of MapReduce

Why Count Words? How It Works Map and Reduce Functions in Spark

Summary Q&A Workshop

Quiz Answers

Hour 8. Getting Started with Scala

Scala History and Background

Scala Beginnings

Scala Basics

Scala’s Compile Time and Run Time Architecture Variables and Primitives in Scala Data Structures in Scala Control Structures in Scala

Object-Oriented Programming in Scala

Classes and Inheritance Mixin Composition Singleton Objects Polymorphism

Functional Programming in Scala

First-class Functions Anonymous Functions Higher-order Functions Closures Currying Lazy Evaluation Immutable Data Structures

Spark Programming in Scala Summary Q&A Workshop

Quiz Answers

Hour 9. Functional Programming with Python

Python Overview

Python Background Python Runtime Architecture

Data Structures and Serialization in Python

Lists Sets Tuples Dictionaries Python Object Serialization

Python Functional Programming Basics

Anonymous Functions and lambda Higher-order Functions Tail Calls Short-circuiting Parallelization Closures in Python

Interactive Programming Using IPython

IPython History and Background Using IPython with Spark Jupyter, the IPython Notebook

Summary Q&A Workshop

Quiz Answers

Hour 10. Working with the Spark API (Transformations and Actions)

RDDs and Data Sampling

RDD Refresher Data Sampling with Spark

Spark Transformations

Functional Transformations Grouping, Sorting, and Distinct Functions Set Operations

Spark Actions

The count Action The collect, take, top, and first Actions The reduce and fold Actions The foreach Action

Key Value Pair Operations

Key Value Pair RDD Dictionary Functions Functional Key Value Pair RDD Transformations Grouping, Aggregation, Sorting, and Set Operations

Join Functions

Join Types Join Transformations

Numerical RDD Operations

min() max() mean() sum() stdev() variance() stats()

Summary Q&A Workshop

Quiz Answers

Hour 11. Using RDDs: Caching, Persistence, and Output

RDD Storage Levels

RDD Lineage Revisited RDD Storage Levels

Caching, Persistence, and Checkpointing

Caching RDDs Persisting RDDs Choosing When to Persist or Cache RDDs Checkpointing RDDs

Saving RDD Output

External Storage Systems Storage Formats

Introduction to Alluxio (Tachyon)

Alluxio Background Alluxio Architecture Alluxio as a Filesystem Alluxio for Off Heap RDD Persistence Other Alluxio Features and Usages

Summary Q&A Workshop

Quiz Answers

Hour 12. Advanced Spark Programming

Broadcast Variables

Broadcast Variable Creation and Usage Advantages of Broadcast Variables

Accumulators

Using Accumulators Custom Accumulators Uses for Accumulators

Partitioning and Repartitioning

Partitioning Overview Controlling Partitions Repartitioning Functions Partition-specific API Methods

Processing RDDs with External Programs

pipe()

Summary Q&A Workshop

Quiz Answers

Part III: Extensions to Spark

Hour 13. Using SQL with Spark

Introduction to Spark SQL

Background Hive Overview SQL on Hadoop Spark SQL Architecture HiveContext and SQLContext

Getting Started with Spark SQL DataFrames

Creating a DataFrame from an Existing RDD Creating a DataFrame from a Hive Table Creating a DataFrame from JSON Objects Creating DataFrames from Files Using the DataFrameReader Converting DataFrames to RDDs DataFrame Data Model DataFrame Schemas

Using Spark SQL DataFrames

DataFrame Metadata Operations Basic DataFrame Operations DataFrame Built-in Functions and UDFs DataFrame Set Operations Caching, Persisting, and Repartitioning DataFrames Saving DataFrame Output Using the DataFrameWriter

Accessing Spark SQL

Accessing Spark SQL Using the spark-sql Shell Running the Thrift JDBC/ODBC server

Summary Q&A Workshop

Quiz Answers

Hour 14. Stream Processing with Spark

Introduction to Spark Streaming

Streaming, Spark Style Spark Streaming Architecture The StreamingContext

Using DStreams

DStream Sources DStream Transformations DStream Output Operations

State Operations

updateStateByKey()

Sliding Window Operations

window() reduceByKeyAndWindow()

Summary Q&A Workshop

Quiz Answers

Hour 15. Getting Started with Spark and R

Introduction to R

Getting Started with the R Language

Introducing SparkR

The SparkR Shell Creating Data Frames in SparkR

Using SparkR

Building Predictive Models with SparkR

Using SparkR with RStudio Summary Q&A Workshop

Quiz Answers

Hour 16. Machine Learning with Spark

Introduction to Machine Learning and MLlib

Machine Learning Primer Machine Learning with Spark

Classification Using Spark MLlib

Decision Trees Naive Bayes

Collaborative Filtering Using Spark MLlib Clustering Using Spark MLlib

k-means Clustering

Summary Q&A Workshop

Quiz Answers

Hour 17. Introducing Sparkling Water (H20 and Spark)

Introduction to H2O

H2O Deep Learning H2O Flow H2O Architecture Running H2O on Hadoop

Sparkling Water—H2O on Spark

Sparkling Water Architecture

Summary Q&A Workshop

Quiz Answers

Hour 18. Graph Processing with Spark

Introduction to Graphs Graph Processing in Spark

Google, Pregel, and PageRank GraphX: Spark’s Graph Processing System

Introduction to GraphFrames

Accessing the GraphFrames Library Creating a GraphFrame GraphFrame Operations Using Graphing Algorithms with GraphFrames

Summary Q&A Workshop

Quiz Answers

Hour 19. Using Spark with NoSQL Systems

Introduction to NoSQL

Bigtable: The Beginnings of the NoSQL Movement NoSQL System Characteristics Types of NoSQL Systems

Using Spark with HBase

HBase Data Model and Shell Data Distribution in HBase HBase and Spark

Using Spark with Cassandra

Cassandra Data Model Cassandra Query Language (CQL) Accessing Cassandra Using Spark

Using Spark with DynamoDB and More

Amazon DynamoDB Other NoSQL Implementations The Future for NoSQL

Summary Q&A Workshop

Quiz Answers

Hour 20. Using Spark with Messaging Systems

Overview of Messaging Systems

Pub-Sub Messaging Exchange Pattern

Using Spark with Apache Kafka

Kafka Overview Spark and Kafka

Spark, MQTT, and the Internet of Things

MQTT Overview Using Spark with MQTT

Using Spark with Amazon Kinesis

Kinesis Streams Using Spark with Kinesis

Summary Q&A Workshop

Quiz Answers

Part IV: Managing Spark

Hour 21. Administering Spark

Spark Configuration

Spark Environment Variables Spark Configuration

Administering Spark Standalone

Spark Standalone Revisited Deploying Spark Standalone Clusters Scheduling with Spark Standalone

Administering Spark on YARN

Spark on YARN Revisited Deploying Spark on YARN Managing Spark Applications Running on YARN YARN Scheduling

Summary Q&A Workshop

Quiz Answers

Hour 22. Monitoring Spark

Exploring the Spark Application UI

Jobs Stages Storage Environment Executors Viewing the Status of All Running Applications

Spark History Server

Deploying the Spark History Server Exploring the Spark History Server UI Spark History Server API Access

Spark Metrics Logging in Spark

Log4j

Summary Q&A Workshop

Quiz Answers

Hour 23. Extending and Securing Spark

Isolating Spark

Perimeter Security Gateway Services Authentication and Authorization

Securing Spark Communication

Spark Authentication Using a Shared Secret Encrypting Spark Communication Securing the Spark Web UI

Securing Spark with Kerberos

Kerberos Overview Kerberos with Hadoop Kerberos Configuration with Spark

Summary Q&A Workshop

Quiz Answers

Hour 24. Improving Spark Performance

Benchmarking Spark

Benchmarks Canary Queries Performance Monitoring Solutions

Application Development Best Practices

Application Development Optimizations System, Configuration, or Job Submission Optimizations

Optimizing Partitions

Inefficient Partitioning

Diagnosing Application Performance Issues

Using the Application UI to Diagnose Performance Issues Using the Spark History UI to Diagnose Performance Issues

Summary Q&A Workshop

Quiz Answers

Index Code Snippets

← Prev
Back
Next →

← Prev
Back
Next →