HBase · the Definitive Guide by George, Lars -- Read -- Imperial Library of Trantor

Index

HBase: The Definitive Guide

Foreword Preface

General Information

HBase Version Building the Examples Hush: The HBase URL Shortener Running Hush

Conventions Used in This Book Using Code Examples Safari® Books Online How to Contact Us Acknowledgments

1. Introduction

The Dawn of Big Data The Problem with Relational Database Systems Nonrelational Database Systems, Not-Only SQL or NoSQL?

Dimensions Scalability Database (De-)Normalization

Building Blocks

Backdrop Tables, Rows, Columns, and Cells Auto-Sharding Storage API Implementation Summary

HBase: The Hadoop Database

History Nomenclature Summary

2. Installation

Quick-Start Guide Requirements

Hardware

Servers Networking

Software

Operating system Filesystem Java Hadoop SSH Domain Name Service Synchronized time File handles and process limits Datanode handlers Swappiness Windows

Filesystems for HBase

Local HDFS S3 Other Filesystems

Installation Choices

Apache Binary Release Building from Source

Run Modes

Standalone Mode Distributed Mode

Pseudodistributed mode Fully distributed mode

Specifying region servers ZooKeeper setup Using the existing ZooKeeper ensemble

Configuration

hbase-site.xml and hbase-default.xml hbase-env.sh regionserver log4j.properties Example Configuration

hbase-site.xml regionservers hbase-env.sh

Client Configuration

Deployment

Script-Based Apache Whirr Puppet and Chef

Operating a Cluster

Running and Confirming Your Installation Web-based UI Introduction Shell Introduction Stopping the Cluster

3. Client API: The Basics

General Notes CRUD Operations

Put Method

Single Puts The KeyValue class Client-side write buffer List of Puts Atomic compare-and-set

Get Method

Single Gets The Result class List of Gets Related retrieval methods

Delete Method

Single Deletes List of Deletes Atomic compare-and-delete

Batch Operations Row Locks Scans

Introduction The ResultScanner Class Caching Versus Batching

Miscellaneous Features

The HTable Utility Methods The Bytes Class

4. Client API: Advanced Features

Filters

Introduction to Filters

The filter hierarchy Comparison operators Comparators

Comparison Filters

RowFilter FamilyFilter QualifierFilter ValueFilter DependentColumnFilter

Dedicated Filters

SingleColumnValueFilter SingleColumnValueExcludeFilter PrefixFilter PageFilter KeyOnlyFilter FirstKeyOnlyFilter InclusiveStopFilter TimestampsFilter ColumnCountGetFilter ColumnPaginationFilter ColumnPrefixFilter RandomRowFilter

Decorating Filters

SkipFilter WhileMatchFilter

FilterList Custom Filters Filters Summary

Counters

Introduction to Counters Single Counters Multiple Counters

Coprocessors

Introduction to Coprocessors The Coprocessor Class Coprocessor Loading

Loading from the configuration Loading from the table descriptor

The RegionObserver Class

Handling region life-cycle events

State: pending open State: open State: pending close

Handling client API events The RegionCoprocessorEnvironment class The ObserverContext class The BaseRegionObserver class

The MasterObserver Class

The MasterCoprocessorEnvironment class The BaseMasterObserver class

Endpoints

The CoprocessorProtocol interface The BaseEndpointCoprocessor class

HTablePool Connection Handling

5. Client API: Administrative Features

Schema Definition

Tables Table Properties Column Families

HBaseAdmin

Basic Operations Table Operations Schema Operations Cluster Operations Cluster Status Information

6. Available Clients

Introduction to REST, Thrift, and Avro Interactive Clients

Native Java REST

Operation Supported formats

Plain (text/plain) XML (text/xml) JSON (application/json) Protocol Buffer (application/x-protobuf) Raw binary (application/octet-stream)

REST Java client

Thrift

Installation Operation Example: PHP

Avro

Installation Operation

Other Clients

Batch Clients

MapReduce

Native Java Clojure

Hive Pig Cascading

Shell

Basics Commands

General Data definition Data manipulation Tools Replication

Scripting

Web-based UI

Master UI

Main page User Table page ZooKeeper page

Region Server UI

Main page

Shared Pages

7. MapReduce Integration

Framework

MapReduce Introduction Classes

InputFormat Mapper Reducer OutputFormat

Supporting Classes MapReduce Locality Table Splits

MapReduce over HBase

Preparation

Static Provisioning Dynamic Provisioning

Data Sink Data Source Data Source and Sink Custom Processing

8. Architecture

Seek Versus Transfer

B+ Trees Log-Structured Merge-Trees

Storage

Overview Write Path Files

Root-level files Table-level files Region-level files Region splits Compactions

HFile Format KeyValue Format

Write-Ahead Log

Overview HLog Class HLogKey Class WALEdit Class LogSyncer Class LogRoller Class Replay

Single log Log splitting Edits recovery

Durability

Read Path Region Lookups The Region Life Cycle ZooKeeper Replication

Life of a Log Edit

Normal processing Non-Responding slave clusters

Internals

Choosing region servers to replicate to Keeping track of logs Reading, filtering, and sending edits Cleaning logs Region server failover

9. Advanced Usage

Key Design

Concepts Tall-Narrow Versus Flat-Wide Tables Partial Key Scans Pagination Time Series Data Time-Ordered Relations

Advanced Schemas Secondary Indexes Search Integration Transactions Bloom Filters Versioning

Implicit Versioning Custom Versioning

10. Cluster Monitoring

Introduction The Metrics Framework

Contexts, Records, and Metrics Master Metrics Region Server Metrics RPC Metrics JVM Metrics Info Metrics

Ganglia

Installation

Ganglia-related steps

Ganglia monitoring daemon Ganglia meta daemon Ganglia web frontend

HBase-related steps

Usage

JMX

JConsole JMX Remote API

Nagios

11. Performance Tuning

Garbage Collection Tuning Memstore-Local Allocation Buffer Compression

Available Codecs

Snappy LZO GZIP

Verifying Installation

Compression test tool Startup check

Enabling Compression

Optimizing Splits and Compactions

Managed Splitting Region Hotspotting Presplitting Regions

Load Balancing Merging Regions Client API: Best Practices Configuration Load Tests

Performance Evaluation YCSB

12. Cluster Administration

Operational Tasks

Node Decommissioning Rolling Restarts Adding Servers

Pseudodistributed mode

Adding a local backup master Adding a local region server

Fully distributed cluster

Adding a backup master Adding a region server

Data Tasks

Import and Export Tools CopyTable Tool Bulk Import

Bulk load procedure Using the importtsv tool Using the completebulkload Tool Advanced usage

Replication

Additional Tasks

Coexisting Clusters Required Ports

Changing Logging Levels Troubleshooting

HBase Fsck Analyzing the Logs Common Issues

Basic setup checklist

File handles DataNode connections Compression Garbage collection/memory tuning

Stability issues

ZooKeeper problems “Could not obtain block” errors

A. HBase Configuration Properties B. Road Map

HBase 0.92.0 HBase 0.94.0

C. Upgrade from Previous Releases

Upgrading to HBase 0.90.x

From 0.20.x or 0.89.x Within 0.90.x

Upgrading to HBase 0.92.0

D. Distributions

Cloudera’s Distribution Including Apache Hadoop

E. Hush SQL Schema F. HBase Versus Bigtable Index About the Author Colophon

← Prev
Back
Next →

← Prev
Back
Next →