Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Title Page
Second Edition
Copyright
Mastering Apache Spark 2.x
Second Edition
About the Author About the Reviewer www.PacktPub.com
Why subscribe?
Customer Feedback Preface
What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support
Downloading the example code Downloading the color images of this book Errata Piracy Questions
A First Taste and What’s New in Apache Spark V2
Spark machine learning Spark Streaming Spark SQL Spark graph processing Extended ecosystem What's new in Apache Spark V2? Cluster design Cluster management
Local Standalone Apache YARN Apache Mesos
Cloud-based deployments Performance
The cluster structure Hadoop Distributed File System Data locality Memory Coding
Cloud Summary
Apache Spark SQL
The SparkSession--your gateway to structured data processing Importing and saving data
Processing the text files Processing JSON files Processing the Parquet files
Understanding the DataSource API
Implicit schema discovery Predicate push-down on smart data sources
DataFrames Using SQL
Defining schemas manually Using SQL subqueries Applying SQL table joins
Using Datasets
The Dataset API in action
User-defined functions RDDs versus DataFrames versus Datasets Summary
The Catalyst Optimizer
Understanding the workings of the Catalyst Optimizer Managing temporary views with the catalog API The SQL abstract syntax tree How to go from Unresolved Logical Execution Plan to Resolved Logical Execution Plan
Internal class and object representations of LEPs How to optimize the Resolved Logical Execution Plan
Physical Execution Plan generation and selection
Code generation
Practical examples Using the explain method to obtain the PEP How smart data sources work internally
Summary
Project Tungsten
Memory management beyond the Java Virtual Machine Garbage Collector
Understanding the UnsafeRow object
The null bit set region The fixed length values region The variable length values region
Understanding the BytesToBytesMap A practical example on memory usage and performance
Cache-friendly layout of data in memory
Cache eviction strategies and pre-fetching
Code generation
Understanding columnar storage Understanding whole stage code generation
A practical example on whole stage code generation performance Operator fusing versus the volcano iterator model
Summary
Apache Spark Streaming
Overview Errors and recovery
Checkpointing
Streaming sources
TCP stream File streams Flume Kafka
Summary
Structured Streaming
The concept of continuous applications
True unification - same code, same engine
Windowing
How streaming engines use windowing How Apache Spark improves windowing
Increased performance with good old friends How transparent fault tolerance and exactly-once delivery guarantee is achieved
Replayable sources can replay streams from a given offset Idempotent sinks prevent data duplication State versioning guarantees consistent results after reruns
Example - connection to a MQTT message broker
Controlling continuous applications More on stream life cycle management
Summary
Apache Spark MLlib
Architecture
The development environment
Classification with Naive Bayes
Theory on Classification Naive Bayes in practice
Clustering with K-Means
Theory on Clustering K-Means in practice
Artificial neural networks
ANN in practice
Summary
Apache SparkML
What does the new API look like? The concept of pipelines
Transformers
String indexer OneHotEncoder VectorAssembler
Pipelines Estimators
RandomForestClassifier
Model evaluation CrossValidation and hyperparameter tuning
CrossValidation Hyperparameter tuning
Winning a Kaggle competition with Apache SparkML
Data preparation Feature engineering Testing the feature engineering pipeline Training the machine learning model Model evaluation CrossValidation and hyperparameter tuning Using the evaluator to assess the quality of the cross-validated and tuned model
Summary
Apache SystemML
Why do we need just another library?
Why on Apache Spark? The history of Apache SystemML
A cost-based optimizer for machine learning algorithms
An example - alternating least squares ApacheSystemML architecture
Language parsing High-level operators are generated How low-level operators are optimized on
Performance measurements Apache SystemML in action Summary
Deep Learning on Apache Spark with DeepLearning4j and H2O
H2O
Overview
The build environment Architecture Sourcing the data Data quality Performance tuning Deep Learning
Example code – income
The example code – MNIST H2O Flow
Deeplearning4j
ND4J - high performance linear algebra for the JVM Deeplearning4j Example: an IoT real-time anomaly detector
Mastering chaos: the Lorenz attractor model
Deploying the test data generator
Deploy the Node-RED IoT Starter Boilerplate to the IBM Cloud Deploying the test data generator flow Testing the test data generator
Install the Deeplearning4j example within Eclipse Running the examples in Eclipse Run the examples in Apache Spark
Summary
Apache Spark GraphX
Overview Graph analytics/processing with GraphX
The raw data Creating a graph Example 1 – counting Example 2 – filtering Example 3 – PageRank Example 4 – triangle counting Example 5 – connected components
Summary
Apache Spark GraphFrames
Architecture
Graph-relational translation Materialized views Join elimination Join reordering
Examples
Example 1 – counting Example 2 – filtering Example 3 – page rank Example 4 – triangle counting Example 5 – connected components
Summary
Apache Spark with Jupyter Notebooks on IBM DataScience Experience
Why notebooks are the new standard Learning by example
The IEEE PHM 2012 data challenge bearing dataset ETL with Scala Interactive, exploratory analysis using Python and Pixiedust Real data science work with SparkR
Summary
Apache Spark on Kubernetes
Bare metal, virtual machines, and containers
Containerization
Namespaces Control groups Linux containers
Understanding the core concepts of Docker Understanding Kubernetes Using Kubernetes for provisioning containerized Spark applications Example--Apache Spark on Kubernetes
Prerequisites Deploying the Apache Spark master Deploying the Apache Spark workers Deploying the Zeppelin notebooks
Summary
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion