Mastering Spark for Data Science by Hallett, Matthew -- Read -- Imperial Library of Trantor

Log In

Or create an account ->

Imperial Library

Home
About
News
Upload
Forum

Help

Login/SignUp

Index

Mastering Spark for Data Science

Mastering Spark for Data Science Credits Foreword About the Authors About the Reviewer www.PacktPub.com

Why subscribe?

Customer Feedback Preface

What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support

Downloading the example code Downloading the color images of this book Errata Piracy Questions

1. The Big Data Science Ecosystem

Introducing the Big Data ecosystem

Data management Data management responsibilities The right tool for the job

Overall architecture

Data Ingestion Data Lake

Reliable storage Scalable data processing capability

Data science platform Data Access

Data technologies

The role of Apache Spark

Companion tools

Apache HDFS

Advantages Disadvantages Installation

Amazon S3

Advantages Disadvantages Installation

Apache Kafka

Advantages Disadvantages Installation

Apache Parquet

Advantages Disadvantages Installation

Apache Avro

Advantages Disadvantages Installation

Apache NiFi

Advantages Disadvantages Installation

Apache YARN

Advantages Disadvantages Installation

Apache Lucene

Advantages Disadvantages Installation

Kibana

Advantages Disadvantages Installation

Elasticsearch

Advantages Disadvantages Installation

Accumulo

Advantages Disadvantages Installation

Summary

2. Data Acquisition

Data pipelines

Universal ingestion framework Introducing the GDELT news stream

Discovering GDELT in real-time Our first GDELT feed Improving with publish and subscribe

Content registry

Choices and more choices Going with the flow Metadata model Kibana dashboard

Quality assurance

Example 1 - Basic quality checking, no contending users Example 2 - Advanced quality checking, no contending users Example 3 - Basic quality checking, 50% utility due to contending users

Summary

3. Input Formats and Schema

A structured life is a good life GDELT dimensional modeling

GDELT model

First look at the data Core global knowledge graph model Hidden complexity Denormalized models Challenges with flattened data

Issue 1 - Loss of contextual information Issue 2: Re-establishing dimensions Issue 3: Including reference data

Loading your data

Schema agility

Reality check

GKG ELT

Position matters

Avro

Spark-Avro method Pedagogical method When to perform Avro transformation

Parquet Summary

4. Exploratory Data Analysis

The problem, principles and planning

Understanding the EDA problem Design principles General plan of exploration

Preparation

Introducing mask based data profiling Introducing character class masks Building a mask based profiler

Setting up Apache Zeppelin Constructing a reusable notebook

Exploring GDELT

GDELT GKG datasets

The files Special collections Reference data

Exploring the GKG v2.1

The Translingual files A configurable GCAM time series EDA Plot.ly charting on Apache Zeppelin Exploring translation sourced GCAM sentiment with plot.ly Concluding remarks A configurable GCAM Spatio-Temporal EDA Introducing GeoGCAM Does our spatial pivot work?

Summary

5. Spark for Geographic Analysis

GDELT and oil

GDELT events GDELT GKG

Formulating a plan of action GeoMesa

Installing GDELT Ingest GeoMesa Ingest

MapReduce to Spark

Geohash GeoServer

Map layers CQL

Gauging oil prices

Using the GeoMesa query API Data preparation Machine learning Naive Bayes Results Analysis

Summary

6. Scraping Link-Based External Data

Building a web scale news scanner

Accessing the web content

The Goose library

Integration with Spark

Scala compatibility Serialization issues

Creating a scalable, production-ready library

Build once, read many Exception handling Performance tuning

Named entity recognition

Scala libraries NLP walkthrough

Extracting entities Abstracting methods

Building a scalable code

Build once, read many Scalability is also a state of mind Performance tuning

GIS lookup

GeoNames dataset Building an efficient join

Offline strategy - Bloom filtering Online strategy - Hash partitioning

Content deduplication

Context learning Location scoring

Names de-duplication

Functional programming with Scalaz

Our de-duplication strategy Using the mappend operator

Simple clean DoubleMetaphone

News index dashboard Summary

7. Building Communities

Building a graph of persons

Contact chaining Extracting data from Elasticsearch

Using the Accumulo database

Setup Accumulo Cell security Iterators Elasticsearch to Accumulo

A graph data model in Accumulo Hadoop input and output formats

Reading from Accumulo AccumuloGraphxInputFormat and EdgeWritable Building a graph

Community detection algorithm

Louvain algorithm Weighted Community Clustering (WCC)

Description Preprocessing stage Initial communities

Message passing Community back propagation

WCC iteration

Gathering community statistics WCC Computation WCC iteration

GDELT dataset

The Bowie effect Smaller communities Using Accumulo cell level security

Summary

8. Building a Recommendation System

Different approaches

Collaborative filtering Content-based filtering Custom approach

Uninformed data

Processing bytes Creating a scalable code From time to frequency domain

Fast Fourier transform Sampling by time window Extracting audio signatures

Building a song analyzer

Selling data science is all about selling cupcakes

Using Cassandra Using the Play framework

Building a recommender

The PageRank algorithm

Building a Graph of Frequency Co-occurrence Running PageRank

Building personalized playlists Expanding our cupcake factory

Building a playlist service Leveraging the Spark job server User interface

Summary

9. News Dictionary and Real-Time Tagging System

The mechanical Turk

Human intelligence tasks Bootstrapping a classification model

Learning from Stack Exchange Building text features Training a Naive Bayes model

Laziness, impatience, and hubris

Designing a Spark Streaming application

A tale of two architectures

The CAP theorem The Greeks are here to help

Importance of the Lambda architecture Importance of the Kappa architecture

Consuming data streams

Creating a GDELT data stream

Creating a Kafka topic Publishing content to a Kafka topic Consuming Kafka from Spark Streaming

Creating a Twitter data stream

Processing Twitter data

Extracting URLs and hashtags Keeping popular hashtags Expanding shortened URLs

Fetching HTML content Using Elasticsearch as a caching layer Classifying data

Training a Naive Bayes model Thread safety Predict the GDELT data

Our Twitter mechanical Turk Summary

10. Story De-duplication and Mutation

Detecting near duplicates

First steps with hashing Standing on the shoulders of the Internet giants

Simhashing The hamming weight

Detecting near duplicates in GDELT Indexing the GDELT database

Persisting our RDDs Building a REST API Area of improvement

Building stories

Building term frequency vectors The curse of dimensionality, the data science plague Optimizing KMeans

Story mutation

The Equilibrium state Tracking stories over time

Building a streaming application Streaming KMeans Visualization

Building story connections

Summary

11. Anomaly Detection on Sentiment Analysis

Following the US elections on Twitter

Acquiring data in stream Acquiring data in batch

The search API Rate limit

Analysing sentiment

Massaging Twitter data Using the Stanford NLP Building the Pipeline

Using Timely as a time series database

Storing data Using Grafana to visualize sentiment

Number of processed tweets Give me my Twitter account back Identifying the swing states

Twitter and the Godwin point

Learning context Visualizing our model Word2Graph and Godwin point

Building a Word2Graph Random walks

A Small Step into sarcasm detection

Building features

#LoveTrumpsHates Scoring Emojis Training a KMeans model

Detecting anomalies

Summary

12. TrendCalculus

Studying trends The TrendCalculus algorithm

Trend windows Simple trend User Defined Aggregate Functions Simple trend calculation Reversal rule Introducing the FHLS bar structure Visualize the data

FHLS with reversals Edge cases

Zero values Completing the gaps

Stackable processing

Practical applications

Algorithm characteristics

Advantages Disadvantages

Possible use cases

Chart annotation Co-trending Data reduction Indexing Fractal dimension Streaming proxy for piecewise linear regression

Summary

13. Secure Data

Data security

The problem The basics

Authentication and authorization

Access control lists (ACL) Role-based access control (RBAC)

Access Encryption

Data at rest

Java KeyStore S3 encryption

Data in transit Obfuscation/Anonymizing Masking Tokenization

Using a Hybrid approach

Data disposal Kerberos authentication

Use case 1: Apache Spark accessing data in secure HDFS Use case 2: extending to automated authentication Use case 3: connecting to secure databases from Spark

Security ecosystem

Apache sentry RecordService Apache ranger Apache Knox

Your Secure Responsibility Summary

14. Scalable Algorithms

General principles Spark architecture

History of Spark Moving parts

Driver SparkSession Resilient distributed datasets (RDDs) Executor Shuffle operation Cluster Manager Task DAG DAG scheduler Transformations Stages Actions Task scheduler

Challenges

Algorithmic complexity Numerical anomalies Shuffle Data schemes

Plotting your course

Be iterative

Data preparation Scale up slowly Estimate performance Step through carefully Tune your analytic

Design patterns and techniques

Spark APIs

Problem Solution

Example

Summary pattern

Problem Solution

Example

Expand and Conquer Pattern

Problem Solution

Lightweight Shuffle

Problem Solution

Wide Table pattern

Problem Solution

Example

Broadcast variables pattern

Problem Solution

Creating a broadcast variable Accessing a broadcast variable Removing a broadcast variable Example

Combiner pattern

Problem Solution

Example

Optimized cluster

Problem Solution

Redistribution pattern

Problem Solution

Example

Salting key pattern

Problem Solution

Secondary sort pattern

Problem Solution

Example

Filter overkill pattern

Problem Solution

Probabilistic algorithms

Problem Solution

Example

Selective caching

Problem Solution

Garbage collection

Problem Solution

Graph traversal

Problem Solution Example

Summary

← Prev
Back
Next →

← Prev
Back
Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion