Storm Blueprints · Patterns for Distributed Real-Time Computation by Goetz, P. Taylor -- Read -- Imperial Library of Trantor

Index

Storm Blueprints: Patterns for Distributed Real-time Computation

Table of Contents Storm Blueprints: Patterns for Distributed Real-time Computation Credits About the Authors About the Reviewers www.PacktPub.com

Support files, eBooks, discount offers and more

Why Subscribe? Free Access for Packt account holders

Preface

What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support

Downloading the example code Errata Piracy Questions

1. Distributed Word Count

Introducing elements of a Storm topology – streams, spouts, and bolts

Streams Spouts Bolts

Introducing the word count topology data flow

Sentence spout

Introducing the split sentence bolt Introducing the word count bolt Introducing the report bolt

Implementing the word count topology

Setting up a development environment Implementing the sentence spout Implementing the split sentence bolt Implementing the word count bolt Implementing the report bolt Implementing the word count topology

Introducing parallelism in Storm

WordCountTopology parallelism

Adding workers to a topology Configuring executors and tasks

Understanding stream groupings Guaranteed processing

Reliability in spouts Reliability in bolts Reliable word count

Summary

2. Configuring Storm Clusters

Introducing the anatomy of a Storm cluster

Understanding the nimbus daemon Working with the supervisor daemon Introducing Apache ZooKeeper Working with Storm's DRPC server Introducing the Storm UI

Introducing the Storm technology stack

Java and Clojure Python

Installing Storm on Linux

Installing the base operating system Installing Java ZooKeeper installation Storm installation Running the Storm daemons Configuring Storm Mandatory settings Optional settings The Storm executable Setting up the Storm executable on a workstation The daemon commands

Nimbus Supervisor UI DRPC

The management commands

Jar Kill Deactivate Activate Rebalance Remoteconfvalue

Local debug/development commands

REPL Classpath Localconfvalue

Submitting topologies to a Storm cluster Automating the cluster configuration A rapid introduction to Puppet

Puppet manifests Puppet classes and modules Puppet templates Managing environments with Puppet Hiera Introducing Hiera

Summary

3. Trident Topologies and Sensor Data

Examining our use case Introducing Trident topologies Introducing Trident spouts Introducing Trident operations – filters and functions

Introducing Trident filters Introducing Trident functions

Introducing Trident aggregators – Combiners and Reducers

CombinerAggregator ReducerAggregator Aggregator

Introducing the Trident state

The Repeat Transactional state The Opaque state

Executing the topology Summary

4. Real-time Trend Analysis

Use case Architecture

The source application The logback Kafka appender Apache Kafka Kafka spout The XMPP server

Installing the required software

Installing Kafka Installing OpenFire

Introducing the sample application

Sending log messages to Kafka

Introducing the log analysis topology

Kafka spout The JSON project function Calculating a moving average Adding a sliding window Implementing the moving average function Filtering on thresholds Sending notifications with XMPP

The final topology Running the log analysis topology Summary

5. Real-time Graph Analysis

Use case Architecture

The Twitter client Kafka spout A titan-distributed graph database

A brief introduction to graph databases

Accessing the graph – the TinkerPop stack Manipulating the graph with the Blueprints API Manipulating the graph with the Gremlin shell

Software installation

Titan installation

Setting up Titan to use the Cassandra storage backend

Installing Cassandra Starting Titan with the Cassandra backend

Graph data model Connecting to the Twitter stream

Setting up the Twitter4J client The OAuth configuration

The TwitterStreamConsumer class The TwitterStatusListener class

Twitter graph topology

The JSONProjectFunction class

Implementing GraphState

GraphFactory GraphTupleProcessor GraphStateFactory GraphState GraphUpdater

Implementing GraphFactory Implementing GraphTupleProcessor Putting it all together – the TwitterGraphTopology class

The TwitterGraphTopology class

Querying the graph with Gremlin Summary

6. Artificial Intelligence

Designing for our use case Establishing the architecture

Examining the design challenges Implementing the recursion

Accessing the function's return values Immutable tuple field values Upfront field declaration Tuple acknowledgement in recursion Output to multiple streams Read-before-write

Solving the challenges

Implementing the architecture

The data model Examining the recursive topology The queue interaction Functions and filters Examining the Scoring Topology

Addressing read-before-write

Distributed locking Retry when stale Executing the topology

Enumerating the game tree

Distributed Remote Procedure Call (DRPC)

Remote deployment

Summary

7. Integrating Druid for Financial Analytics

Use case Integrating a non-transactional system The topology

The spout The filter The state design

Implementing the architecture

DruidState Implementing the StormFirehose object Implementing the partition status in ZooKeeper

Executing the implementation Examining the analytics Summary

8. Natural Language Processing

Motivating a Lambda architecture Examining our use case Realizing a Lambda architecture Designing the topology for our use case Implementing the design

TwitterSpout/TweetEmitter Functions

TweetSplitterFunction WordFrequencyFunction PersistenceFunction

Examining the analytics Batch processing / historical analysis Hadoop

An overview of MapReduce The Druid setup

HadoopDruidIndexer

Summary

9. Deploying Storm on Hadoop for Advertising Analysis

Examining the use case Establishing the architecture

Examining HDFS Examining YARN

Configuring the infrastructure

The Hadoop infrastructure Configuring HDFS

Configuring the NameNode Configuring the DataNode Configuring YARN

Configuring the ResourceManager

Configuring the NodeManager

Deploying the analytics

Performing a batch analysis with the Pig infrastructure Performing a real-time analysis with the Storm-YARN infrastructure

Performing the analytics

Executing the batch analysis Executing real-time analysis

Deploying the topology Executing the topology Summary

10. Storm in the Cloud

Introducing Amazon Elastic Compute Cloud (EC2)

Setting up an AWS account The AWS Management Console

Creating an SSH key pair

Launching an EC2 instance manually

Logging in to the EC2 instance

Introducing Apache Whirr

Installing Whirr

Configuring a Storm cluster with Whirr

Launching the cluster

Introducing Whirr Storm

Setting up Whirr Storm

Cluster configuration Customizing Storm's configuration Customizing firewall rules

Introducing Vagrant

Installing Vagrant Launching your first virtual machine

The Vagrantfile and shared filesystem Vagrant provisioning Configuring multimachine clusters with Vagrant

Creating Storm-provisioning scripts

ZooKeeper Storm Supervisord

The Storm Vagrantfile Launching the Storm cluster

Summary

Index

← Prev
Back
Next →

← Prev
Back
Next →