Mastering Apache Spark 2.x by Kienzler, Romeo -- Read -- Imperial Library of Trantor

Index

Title Page

Second Edition

Mastering Apache Spark 2.x

Second Edition

About the Author About the Reviewer www.PacktPub.com

Why subscribe?

Customer Feedback Preface

What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support

Downloading the example code Downloading the color images of this book Errata Piracy Questions

A First Taste and What’s New in Apache Spark V2

Spark machine learning Spark Streaming Spark SQL Spark graph processing Extended ecosystem What's new in Apache Spark V2? Cluster design Cluster management

Local Standalone Apache YARN Apache Mesos

Cloud-based deployments Performance

The cluster structure Hadoop Distributed File System Data locality Memory Coding

Cloud Summary

Apache Spark SQL

The SparkSession--your gateway to structured data processing Importing and saving data

Processing the text files Processing JSON files Processing the Parquet files

Understanding the DataSource API

Implicit schema discovery Predicate push-down on smart data sources

DataFrames Using SQL

Defining schemas manually Using SQL subqueries Applying SQL table joins

Using Datasets

The Dataset API in action

User-defined functions RDDs versus DataFrames versus Datasets Summary

The Catalyst Optimizer

Understanding the workings of the Catalyst Optimizer Managing temporary views with the catalog API The SQL abstract syntax tree How to go from Unresolved Logical Execution Plan to Resolved Logical Execution Plan

Internal class and object representations of LEPs How to optimize the Resolved Logical Execution Plan

Physical Execution Plan generation and selection

Code generation

Practical examples Using the explain method to obtain the PEP How smart data sources work internally

Summary

Project Tungsten

Memory management beyond the Java Virtual Machine Garbage Collector

Understanding the UnsafeRow object

The null bit set region The fixed length values region The variable length values region

Understanding the BytesToBytesMap A practical example on memory usage and performance

Cache-friendly layout of data in memory

Cache eviction strategies and pre-fetching

Code generation

Understanding columnar storage Understanding whole stage code generation

A practical example on whole stage code generation performance Operator fusing versus the volcano iterator model

Summary

Apache Spark Streaming

Overview Errors and recovery

Checkpointing

Streaming sources

TCP stream File streams Flume Kafka

Summary

Structured Streaming

The concept of continuous applications

True unification - same code, same engine

Windowing

How streaming engines use windowing How Apache Spark improves windowing

Increased performance with good old friends How transparent fault tolerance and exactly-once delivery guarantee is achieved

Replayable sources can replay streams from a given offset Idempotent sinks prevent data duplication State versioning guarantees consistent results after reruns

Example - connection to a MQTT message broker

Controlling continuous applications More on stream life cycle management

Summary

Apache Spark MLlib

Architecture

The development environment

Classification with Naive Bayes

Theory on Classification Naive Bayes in practice

Clustering with K-Means

Theory on Clustering K-Means in practice

Artificial neural networks

ANN in practice

Summary

Apache SparkML

What does the new API look like? The concept of pipelines

Transformers

String indexer OneHotEncoder VectorAssembler

Pipelines Estimators

RandomForestClassifier

Model evaluation CrossValidation and hyperparameter tuning

CrossValidation Hyperparameter tuning

Winning a Kaggle competition with Apache SparkML

Data preparation Feature engineering Testing the feature engineering pipeline Training the machine learning model Model evaluation CrossValidation and hyperparameter tuning Using the evaluator to assess the quality of the cross-validated and tuned model

Summary

Apache SystemML

Why do we need just another library?

Why on Apache Spark? The history of Apache SystemML

A cost-based optimizer for machine learning algorithms

An example - alternating least squares ApacheSystemML architecture

Language parsing High-level operators are generated How low-level operators are optimized on

Performance measurements Apache SystemML in action Summary

Deep Learning on Apache Spark with DeepLearning4j and H2O

H2O

Overview

The build environment Architecture Sourcing the data Data quality Performance tuning Deep Learning

Example code – income

The example code – MNIST H2O Flow

Deeplearning4j

ND4J - high performance linear algebra for the JVM Deeplearning4j Example: an IoT real-time anomaly detector

Mastering chaos: the Lorenz attractor model

Deploying the test data generator

Deploy the Node-RED IoT Starter Boilerplate to the IBM Cloud Deploying the test data generator flow Testing the test data generator

Install the Deeplearning4j example within Eclipse Running the examples in Eclipse Run the examples in Apache Spark

Summary

Apache Spark GraphX

Overview Graph analytics/processing with GraphX

The raw data Creating a graph Example 1 – counting Example 2 – filtering Example 3 – PageRank Example 4 – triangle counting Example 5 – connected components

Summary

Apache Spark GraphFrames

Architecture

Graph-relational translation Materialized views Join elimination Join reordering

Examples

Example 1 – counting Example 2 – filtering Example 3 – page rank Example 4 – triangle counting Example 5 – connected components

Summary

Apache Spark with Jupyter Notebooks on IBM DataScience Experience

Why notebooks are the new standard Learning by example

The IEEE PHM 2012 data challenge bearing dataset ETL with Scala Interactive, exploratory analysis using Python and Pixiedust Real data science work with SparkR

Summary

Apache Spark on Kubernetes

Bare metal, virtual machines, and containers

Containerization

Namespaces Control groups Linux containers

Understanding the core concepts of Docker Understanding Kubernetes Using Kubernetes for provisioning containerized Spark applications Example--Apache Spark on Kubernetes

Prerequisites Deploying the Apache Spark master Deploying the Apache Spark workers Deploying the Zeppelin notebooks

Summary

← Prev
Back
Next →

← Prev
Back
Next →