Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Mastering Spark for Data Science
Mastering Spark for Data Science Credits Foreword About the Authors About the Reviewer www.PacktPub.com
Why subscribe?
Customer Feedback Preface
What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support
Downloading the example code Downloading the color images of this book Errata Piracy Questions
1. The Big Data Science Ecosystem
Introducing the Big Data ecosystem
Data management Data management responsibilities The right tool for the job
Overall architecture
Data Ingestion Data Lake
Reliable storage Scalable data processing capability
Data science platform Data Access
Data technologies
The role of Apache Spark
Companion tools
Apache HDFS
Advantages Disadvantages Installation
Amazon S3
Advantages Disadvantages Installation
Apache Kafka
Advantages Disadvantages Installation
Apache Parquet
Advantages Disadvantages Installation
Apache Avro
Advantages Disadvantages Installation
Apache NiFi
Advantages Disadvantages Installation
Apache YARN
Advantages Disadvantages Installation
Apache Lucene
Advantages Disadvantages Installation
Kibana
Advantages Disadvantages Installation
Elasticsearch
Advantages Disadvantages Installation
Accumulo
Advantages Disadvantages Installation
Summary
2. Data Acquisition
Data pipelines
Universal ingestion framework Introducing the GDELT news stream
Discovering GDELT in real-time Our first GDELT feed Improving with publish and subscribe
Content registry
Choices and more choices Going with the flow Metadata model Kibana dashboard
Quality assurance
Example 1 - Basic quality checking, no contending users Example 2 - Advanced quality checking, no contending users Example 3 - Basic quality checking, 50% utility due to contending users
Summary
3. Input Formats and Schema
A structured life is a good life GDELT dimensional modeling
GDELT model
First look at the data Core global knowledge graph model Hidden complexity Denormalized models Challenges with flattened data
Issue 1 - Loss of contextual information Issue 2: Re-establishing dimensions Issue 3: Including reference data
Loading your data
Schema agility
Reality check
GKG ELT
Position matters
Avro
Spark-Avro method Pedagogical method When to perform Avro transformation
Parquet Summary
4. Exploratory Data Analysis
The problem, principles and planning
Understanding the EDA problem Design principles General plan of exploration
Preparation
Introducing mask based data profiling Introducing character class masks Building a mask based profiler
Setting up Apache Zeppelin Constructing a reusable notebook
Exploring GDELT
GDELT GKG datasets
The files Special collections Reference data
Exploring the GKG v2.1
The Translingual files A configurable GCAM time series EDA Plot.ly charting on Apache Zeppelin Exploring translation sourced GCAM sentiment with plot.ly Concluding remarks A configurable GCAM Spatio-Temporal EDA Introducing GeoGCAM Does our spatial pivot work?
Summary
5. Spark for Geographic Analysis
GDELT and oil
GDELT events GDELT GKG
Formulating a plan of action GeoMesa
Installing GDELT Ingest GeoMesa Ingest
MapReduce to Spark
Geohash GeoServer
Map layers CQL
Gauging oil prices
Using the GeoMesa query API Data preparation Machine learning Naive Bayes Results Analysis
Summary
6. Scraping Link-Based External Data
Building a web scale news scanner
Accessing the web content
The Goose library
Integration with Spark
Scala compatibility Serialization issues
Creating a scalable, production-ready library
Build once, read many Exception handling Performance tuning
Named entity recognition
Scala libraries NLP walkthrough
Extracting entities Abstracting methods
Building a scalable code
Build once, read many Scalability is also a state of mind Performance tuning
GIS lookup
GeoNames dataset Building an efficient join
Offline strategy - Bloom filtering Online strategy - Hash partitioning
Content deduplication
Context learning Location scoring
Names de-duplication
Functional programming with Scalaz
Our de-duplication strategy Using the mappend operator
Simple clean DoubleMetaphone
News index dashboard Summary
7. Building Communities
Building a graph of persons
Contact chaining Extracting data from Elasticsearch
Using the Accumulo database
Setup Accumulo Cell security Iterators Elasticsearch to Accumulo
A graph data model in Accumulo Hadoop input and output formats
Reading from Accumulo AccumuloGraphxInputFormat and EdgeWritable Building a graph
Community detection algorithm
Louvain algorithm Weighted Community Clustering (WCC)
Description Preprocessing stage Initial communities
Message passing Community back propagation
WCC iteration
Gathering community statistics WCC Computation WCC iteration
GDELT dataset
The Bowie effect Smaller communities Using Accumulo cell level security
Summary
8. Building a Recommendation System
Different approaches
Collaborative filtering Content-based filtering Custom approach
Uninformed data
Processing bytes Creating a scalable code From time to frequency domain
Fast Fourier transform Sampling by time window Extracting audio signatures
Building a song analyzer
Selling data science is all about selling cupcakes
Using Cassandra Using the Play framework
Building a recommender
The PageRank algorithm
Building a Graph of Frequency Co-occurrence Running PageRank
Building personalized playlists Expanding our cupcake factory
Building a playlist service Leveraging the Spark job server User interface
Summary
9. News Dictionary and Real-Time Tagging System
The mechanical Turk
Human intelligence tasks Bootstrapping a classification model
Learning from Stack Exchange Building text features Training a Naive Bayes model
Laziness, impatience, and hubris
Designing a Spark Streaming application
A tale of two architectures
The CAP theorem The Greeks are here to help
Importance of the Lambda architecture Importance of the Kappa architecture
Consuming data streams
Creating a GDELT data stream
Creating a Kafka topic Publishing content to a Kafka topic Consuming Kafka from Spark Streaming
Creating a Twitter data stream
Processing Twitter data
Extracting URLs and hashtags Keeping popular hashtags Expanding shortened URLs
Fetching HTML content Using Elasticsearch as a caching layer Classifying data
Training a Naive Bayes model Thread safety Predict the GDELT data
Our Twitter mechanical Turk Summary
10. Story De-duplication and Mutation
Detecting near duplicates
First steps with hashing Standing on the shoulders of the Internet giants
Simhashing The hamming weight
Detecting near duplicates in GDELT Indexing the GDELT database
Persisting our RDDs Building a REST API Area of improvement
Building stories
Building term frequency vectors The curse of dimensionality, the data science plague Optimizing KMeans
Story mutation
The Equilibrium state Tracking stories over time
Building a streaming application Streaming KMeans Visualization
Building story connections
Summary
11. Anomaly Detection on Sentiment Analysis
Following the US elections on Twitter
Acquiring data in stream Acquiring data in batch
The search API Rate limit
Analysing sentiment
Massaging Twitter data Using the Stanford NLP Building the Pipeline
Using Timely as a time series database
Storing data Using Grafana to visualize sentiment
Number of processed tweets Give me my Twitter account back Identifying the swing states
Twitter and the Godwin point
Learning context Visualizing our model Word2Graph and Godwin point
Building a Word2Graph Random walks
A Small Step into sarcasm detection
Building features
#LoveTrumpsHates Scoring Emojis Training a KMeans model
Detecting anomalies
Summary
12. TrendCalculus
Studying trends The TrendCalculus algorithm
Trend windows Simple trend User Defined Aggregate Functions Simple trend calculation Reversal rule Introducing the FHLS bar structure Visualize the data
FHLS with reversals Edge cases
Zero values Completing the gaps
Stackable processing
Practical applications
Algorithm characteristics
Advantages Disadvantages
Possible use cases
Chart annotation Co-trending Data reduction Indexing Fractal dimension Streaming proxy for piecewise linear regression
Summary
13. Secure Data
Data security
The problem The basics
Authentication and authorization
Access control lists (ACL) Role-based access control (RBAC)
Access Encryption
Data at rest
Java KeyStore S3 encryption
Data in transit Obfuscation/Anonymizing Masking Tokenization
Using a Hybrid approach
Data disposal Kerberos authentication
Use case 1: Apache Spark accessing data in secure HDFS Use case 2: extending to automated authentication Use case 3: connecting to secure databases from Spark
Security ecosystem
Apache sentry RecordService Apache ranger Apache Knox
Your Secure Responsibility Summary
14. Scalable Algorithms
General principles Spark architecture
History of Spark Moving parts
Driver SparkSession Resilient distributed datasets (RDDs) Executor Shuffle operation Cluster Manager Task DAG DAG scheduler Transformations Stages Actions Task scheduler
Challenges
Algorithmic complexity Numerical anomalies Shuffle Data schemes
Plotting your course
Be iterative
Data preparation Scale up slowly Estimate performance Step through carefully Tune your analytic
Design patterns and techniques
Spark APIs
Problem Solution
Example
Summary pattern
Problem Solution
Example
Expand and Conquer Pattern
Problem Solution
Lightweight Shuffle
Problem Solution
Wide Table pattern
Problem Solution
Example
Broadcast variables pattern
Problem Solution
Creating a broadcast variable Accessing a broadcast variable Removing a broadcast variable Example
Combiner pattern
Problem Solution
Example
Optimized cluster
Problem Solution
Redistribution pattern
Problem Solution
Example
Salting key pattern
Problem Solution
Secondary sort pattern
Problem Solution
Example
Filter overkill pattern
Problem Solution
Probabilistic algorithms
Problem Solution
Example
Selective caching
Problem Solution
Garbage collection
Problem Solution
Graph traversal
Problem Solution Example
Summary
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion