CONTENTS

Foreword

Acknowledgments

About this Book

PART I
Big Data: From the Business Perspective

1 What Is Big Data? Hint: You’re a Part of It Every Day

Characteristics of Big Data

Can There Be Enough? The Volume of Data

Variety Is the Spice of Life

How Fast Is Fast? The Velocity of Data

Data in the Warehouse and Data in Hadoop (It’s Not a Versus Thing)

Wrapping It Up

2 Why Is Big Data Important?

When to Consider a Big Data Solution

Big Data Use Cases: Patterns for Big Data Deployment

IT for IT Log Analytics

The Fraud Detection Pattern

They Said What? The Social Media Pattern

The Call Center Mantra: “This Call May Be Recorded for Quality Assurance Purposes”

Risk: Patterns for Modeling and Management

Big Data and the Energy Sector

3 Why IBM for Big Data?

Big Data Has No Big Brother: It’s Ready, but Still Young

What Can Your Big Data Partner Do for You?

The IBM $100 Million Big Data Investment

A History of Big Data Innovation

Domain Expertise Matters

PART II
Big Data: From the Technology Perspective

4 All About Hadoop: The Big Data Lingo Chapter

Just the Facts: The History of Hadoop

Components of Hadoop

The Hadoop Distributed File System

The Basics of MapReduce

Hadoop Common Components

Application Development in Hadoop

Pig and PigLatin

Hive

Jaql

Getting Your Data into Hadoop

Basic Copy Data

Flume

Other Hadoop Components

ZooKeeper

HBase

Oozie

Lucene

Avro

Wrapping It Up

5 InfoSphere BigInsights: Analytics for Big Data at Rest

Ease of Use: A Simple Installation Process

Hadoop Components Included in BigInsights 1.2

A Hadoop-Ready Enterprise-Quality File System: GPFS-SNC

Extending GPFS for Hadoop: GPFS Shared Nothing Cluster

What Does a GPFS-SNC Cluster Look Like?

GPFS-SNC Failover Scenarios

GPFS-SNC POSIX-Compliance

GPFS-SNC Performance

GPFS-SNC Hadoop Gives Enterprise Qualities

Compression

Splittable Compression

Compression and Decompression

Administrative Tooling

Security

Enterprise Integration

Netezza

DB2 for Linux, UNIX, and Windows

JDBC Module

InfoSphere Streams

InfoSphere DataStage

R Statistical Analysis Applications

Improved Workload Scheduling: Intelligent Scheduler

Adaptive MapReduce

Data Discovery and Visualization: BigSheets

Advanced Text Analytics Toolkit

Machine Learning Analytics

Large-Scale Indexing

BigInsights Summed Up

6 IBM InfoSphere Streams: Analytics for Big Data in Motion

InfoSphere Streams Basics

Industry Use Cases for InfoSphere Streams

How InfoSphere Streams Works

What’s a Stream?

The Streams Processing Language

Source and Sink Adapters

Operators

Streams Toolkits

Enterprise Class

High Availability

Consumability: Making the Platform Easy to Use

Integration is the Apex of Enterprise Class Analysis