Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Foreword
Preface
Goals and Audience
Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
Acknowledgments
1. Architecture and Data Model
Recent Trends
The Role of Databases
Distributed Applications
Fast Random Access
Accessing Sorted Versus Unsorted Data
Versions
History
Data Model
Rows and Columns
Data Modification and Timestamps
Advanced Data Model Components
Column Families
Column Visibility
Full Data Model
Tables
Introduction to the Client API
Approach to Rows
Exploiting Sort Order
Architecture Overview
ZooKeeper
Hadoop
Accumulo
Tablet servers
Master
Garbage collector
Monitor
Client
Thrift proxy
A Typical Cluster
Additional Features
Automatic Data Partitioning
High Consistency
Automatic Load Balancing
Massive Scalability
Failure Tolerance and Automatic Recovery
Support for Analysis: Iterators
Support for Analysis: MapReduce Integration
Data Lifecycle Management
Compression
Robust Timestamps
Accumulo and Other Data Management Systems
Comparisons to Relational Databases
SQL
Transactions
Normalization
Comparisons to Other NoSQL Databases
Data model
Key ordering
Tight Hadoop integration
High versus eventual consistency
Column visibility and access control
Iterators
Dynamic column families and locality groups
Support for very large rows
Parallelized BatchScanners
Namespaces
Use Cases Suited for Accumulo
A New Kind of Flexible Analytical Warehouse
Building the Next Gmail
Massive Graph or Machine-Learning Problems
Relieving Relational Databases
Massive Search Applications
Applications with a Long History of Versioned Data
2. Quick Start
Demo of the Shell
The help Command
Creating a Table and Inserting Some Data
Scanning for Data
Using Authorizations
Using a Simple Iterator
Demo of Java Code
Creating a Table and Inserting Some Data
Scanning for Data
Using Authorizations
Using a Simple Iterator
A More Complete Installation
Other Important Resources
One Last Example with a Unit Test
Additional Resources
3. Basic API
Development Environment
Obtaining the Client Library
Using Maven
Using Maven with an IDE
Configuring the Classpath
Introduction to the Example Application: Wikipedia Pages
Wikipedia Data
Data Modeling
Obtaining Example Code
Downloading Sample Wikipedia Pages
Downloading All English Wikipedia Articles
Connect
Insert
Committing Mutations
Handling Errors
Insert Example
Using Lexicoders
Writing to Multiple Tables
Lookups and Scanning
Lookup Example
Crafting Ranges
Grouping by Rows
Reusing Scanners
Isolated Row Views
Tuning Scanners
Batch Scanning
Update: Overwrite
Overwrite Example
Allowing Multiple Versions
Update: Appending or Incrementing
Update: Read-Modify-Write and Conditional Mutations
Conditional Mutation API
Conditional Mutation Batch API
Conditional Mutation Example
Delete
Deleting and Reinserting
Removing Deleted Data from Disk
Batch Deleter
Testing
MockAccumulo
MiniAccumuloCluster
4. Table API
Basic Table Operations
Creating Tables
Options for creating tables
Renaming
Deleting Tables
Deleting Ranges of Rows
Deleting Entries Returned from a Scan
Configuring Table Properties
Locality Groups
Locality groups example
Bloom Filters
Key functors
Caching
Tablet Splits
Quickly and automatically splitting
Merging tablets
Compacting
Compaction properties
Additional Properties
Online Status
Cloning
Using cloning as a snapshotting mechanism
Importing and Exporting Tables
Additional Administrative Methods
Table Namespaces
Creating
Renaming
Setting Namespace Properties
Deleting
Configuring Iterators
Configuring Constraints
Testing Class Loading for a Namespace
Instance Operations
Setting Properties
Configuration
Cluster Information
Precedence of Properties
5. Security API
Authentication
Permissions
System Permissions
Namespace Permissions
Table Permissions
Authorizations
Column Visibilities
Limiting Authorizations Written
An Example of Using Authorizations
Using a Default Visibility
Making Authorizations Work
Auditing Security Operations
Custom Authentication, Permissions, and Authorization
Custom Authentication Example
Other Security Considerations
Using an Application Account for Multiple Users
Network
Disk Encryption
6. Server-Side Functionality and External Clients
Constraints
Constraint Configuration API
Constraint Configuration Example
Creating Custom Constraints
Custom Constraint Example
Iterators
Iterator Configuration API
VersioningIterator
Iterator Configuration Example
Adding Iterators by Setting Properties
Filtering Iterators
Built-in filters
Custom filters
Custom filtering iterator example
Combiners
Combiners for incrementing or appending updates
Built-in combiners
Custom combiners
Custom combiner example
Other Built-in Iterators
WholeRowIterator example
Low-level iterator API
Thrift Proxy
Starting a Proxy
Python Example
Generating Client Code
Language-Specific Clients
Integration with Other Tools
Apache Hive
Table options
Serializing values
Additional options
Hive example
Optimizing Hive queries
Apache Pig
Pig example
Apache Kafka
Integration with Analytical Tools
7. MapReduce API
Formats
Writing Worker Classes
MapReduce Example
MapReduce over Underlying RFiles
Example of Running a MapReduce Job over RFiles
Delivering Rows to Map Workers
Ingesters and Combiners as MapReduce Computations
MapReduce and Bulk Import
Bulk Ingest to Avoid Duplicates
8. Table Design
Single-Table Designs
Implementing Paging
Secondary Indexing
Index Partitioned by Term
Querying a Term-Partitioned Index
Combining query terms
Querying for a term in a specific field
Maintaining Consistency Across Tables
Using MultiTableBatchWriter for consistency
Index Partitioned by Document
Querying a Document-Partitioned Index
Indexing Data Types
Using Lexicoders in indexing
Custom Lexicoder example: Inet4AddressLexicoder
Full-Text Search
wikipediaMetadata
wikipediaIndex
wikipedia
wikipediaReverseIndex
Ingesting WikiSearch Data
Querying the WikiSearch Data
Designing Row IDs
Lexicoders
Composite Row IDs
Key Size
Avoiding Hotspots
Designing Row IDs for Consistent Updates
Designing Values
Storing Files and Large Values
Human-Readable Versus Binary Values and Formatters
Designing Authorizations
Designing Column Visibilities
9. Advanced Table Designs
Time-Ordered Data
Graphs
Building an Example Graph: Twitter
Traversing Graph Tables
Traversing the Example Twitter Graph
Blueprints for Accumulo
Titan
Semantic Triples
Semantic Triples Example
Spatial Data
Open Source Projects
Space-Filling Curves
Multidimensional Data
D4M and Matlab
D4M Example
Adding D4M to Octave or Matlab
Loading example data
Load example data using Java
Machine Learning
Storing Feature Vectors
A Machine-Learning Example
Approximating Relational and SQL Database Properties
Schema Constraints
SQL Operations
SELECT
WHERE
JOIN, GROUP BY, and ORDER BY
Strategies for Joins
GROUP BY and ORDER BY
10. Internals
Tablet Server
Write Path
Read Path
Resource Manager
Minor compaction
Major compaction
Merging minor compaction
Splits
Write-Ahead Logs
Recovery
File formats
RFile optimizations
Relative key encoding
Locality groups
Bloom filters
Caching
Master
FATE
Load Balancer
Garbage Collector
Monitor
Tracer
Client
Locating Keys
Metadata Table
Uses of ZooKeeper
Accumulo and the CAP Theorem
11. Administration: Setup
Preinstallation
Operating Systems
Kernel Tweaks
Swappiness
Number of open files
Native Libraries
User Accounts
Linux Filesystem
System Services
Software Dependencies
Apache Hadoop
Apache ZooKeeper
Installation
Tarball Distribution Install
Installing on Cloudera’s CDH
Installing on Hortonworks’ HDP
Installing on MapR
Running via Amazon Web Services
Building from Source
Building a tarball distribution
Building native libraries
Configuration
File Permissions
Server Configuration Files
accumulo-env.sh
accumulo-site.xml
Client Configuration
Deploying JARs
Using lib/ext/
Custom JAR loading example
Using HDFS
Setting Up Automatic Failover
Initialization
To reinitialize
Multiple instances
Running Very Large-Scale Clusters
Networking
Limits
Metadata Table
Tablet Sizing
File Sizing
Using Multiple HDFS Volumes
Handling NameNode hostname changes
Security
Column Visibilities and Accumulo Clients
Supporting Software Security
Network Security
Configuring SSL
Encryption of Data at Rest
Kerberized Hadoop
Application Permissions
12. Administration: Running
Starting Accumulo
Via the start-all.sh Script
Via init.d Scripts
Stopping Accumulo
Via the stop-all.sh Script
Via init.d scripts
Stopping Individual Processes
Starting After a Crash
Monitoring
Monitor Web Service
Overview
Master Server View
Tablet Servers View
Server Activity View
Garbage Collector View
Tables View
Recent Traces View
Documentation View
Recent Logs View
JMX Metrics
Logging
Tracing
Tracing in the shell
Cluster Changes
Adding New Worker Nodes
Removing Worker Nodes
Adding New Control Nodes
Removing Control Nodes
Table Operations
Changing Settings
Altering load balancing
Configuring iterators
Safely deploying custom iterators
Changing Online Status
Cloning
Altering cloned table properties
Cloning for MapReduce
Import, Export, and Backups
Exporting a table
Importing an exported table
Bulk-loading files from a MapReduce job
Data Lifecycle
Versioning
Data Age-off
Ensuring that deletes are removed from tables
Compactions
Using major compaction to apply changes
Compacting specific ranges
Merging Tablets
Garbage Collection
Failure Recovery
Typical Failures
Single machine failure
Single machine unresponsiveness
Network partitions
More-Serious Failures
All NameNodes failing simultaneously
All ZooKeeper servers failing simultaneously
Power loss to the data center
Loss of all replicas of an HDFS data block
Tips for Restoring a Cluster
Replay data
Back up NameNode metadata
Back up table configuration, users, and split points
Turn on HDFS trash
Create an empty RFile
Take Hadoop out of safe mode manually
Troubleshooting
Ensure that processes are running
Check log messages
Understand network partitions
Exception when scanning a table in the shell
Graphs on the monitor are “blocky”
Tablets not balancing across tablet servers
Calculate the size of changes to a cloned table
Unexpected or unexplained query results
Slow queries
Look at ZooKeeper
Use the listscans command
Look at user-initiated compactions
Inspect RFiles
13. Performance
Understanding Read Performance
Understanding Write Performance
BatchWriters
Bulk Loading
Hardware Selection
Storage Devices
Hard disk drives
Storage-area networks
Solid-state disks
Networking
Virtualization
Running in a Public Cloud Environment
Cluster Sizing
Modeling Required Write Performance
Cluster Planning Example
Estimated total volume of data
Types of user requests and indexes required
Compactions
Rate of incoming data
Age-off strategy
Analyzing Performance
Using Tracing
Using the Monitor
Using Local Logs
Tablet Server Tuning
External Settings
HDFS threads used to transfer data
HDFS durable sync
Memory Settings
tserver.memory.maps.max
tserver.memory.maps.native.enabled
Cache settings
Java heap size
tserver.mutation.queue.max
Write-Ahead Log Settings
tserver.wal.replication
tserver.wal.sync
tserver.wal.sync.method
Resource Settings
tserver.compaction.major.concurrent.max
tserver.compaction.minor.concurrent.max
tserver.readahead.concurrent.max
Timeouts
Scaling Vertically
Cluster Tuning
Splitting Tables
Balancing Tablets
Balancing Reads and Writes
Data Locality
Sharing ZooKeeper
A. Shell Commands Quick Reference
Debugging
Exiting
Help
Iterator
Permissions Administration
Shell Execution
Shell State
Table Administration
Table Control
User Administration
Writing, Reading, and Removing Data
B. Metadata Table
Row ID
File Column Family
Scan Column Family
future, last, and loc Column Families
log Column Family
srv Column Family
~tab:~pr Column
Other Columns
C. Data Stored in ZooKeeper
masters, tservers, gc, monitor, and tracers Nodes
problems/problem_info Nodes
root_tablet Node
tables/table_id Nodes
config/system_property_name Node
users/username Nodes
Other Nodes
Index
← Prev
Back
Next →
← Prev
Back
Next →