Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Hadoop: The Definitive Guide Dedication Foreword Preface
Administrative Notes What’s New in the Fourth Edition? What’s New in the Third Edition? What’s New in the Second Edition? Conventions Used in This Book Using Code Examples Safari® Books Online How to Contact Us Acknowledgments
I. Hadoop Fundamentals
1. Meet Hadoop
Data! Data Storage and Analysis Querying All Your Data Beyond Batch Comparison with Other Systems
Relational Database Management Systems Grid Computing Volunteer Computing
A Brief History of Apache Hadoop What’s in This Book?
2. MapReduce
A Weather Dataset
Data Format
Analyzing the Data with Unix Tools Analyzing the Data with Hadoop
Map and Reduce Java MapReduce
A test run
Scaling Out
Data Flow Combiner Functions
Specifying a combiner function
Running a Distributed MapReduce Job
Hadoop Streaming
Ruby Python
3. The Hadoop Distributed Filesystem
The Design of HDFS HDFS Concepts
Blocks Namenodes and Datanodes Block Caching HDFS Federation HDFS High Availability
Failover and fencing
The Command-Line Interface
Basic Filesystem Operations
Hadoop Filesystems
Interfaces
HTTP C NFS FUSE
The Java Interface
Reading Data from a Hadoop URL Reading Data Using the FileSystem API
FSDataInputStream
Writing Data
FSDataOutputStream
Directories Querying the Filesystem
File metadata: FileStatus Listing files File patterns PathFilter
Deleting Data
Data Flow
Anatomy of a File Read Anatomy of a File Write Coherency Model
Consequences for application design
Parallel Copying with distcp
Keeping an HDFS Cluster Balanced
4. YARN
Anatomy of a YARN Application Run
Resource Requests Application Lifespan Building YARN Applications
YARN Compared to MapReduce 1 Scheduling in YARN
Scheduler Options Capacity Scheduler Configuration
Queue placement
Fair Scheduler Configuration
Enabling the Fair Scheduler Queue configuration Queue placement Preemption
Delay Scheduling Dominant Resource Fairness
Further Reading
5. Hadoop I/O
Data Integrity
Data Integrity in HDFS LocalFileSystem ChecksumFileSystem
Compression
Codecs
Compressing and decompressing streams with CompressionCodec Inferring CompressionCodecs using CompressionCodecFactory Native libraries
CodecPool
Compression and Input Splits Using Compression in MapReduce
Compressing map output
Serialization
The Writable Interface
WritableComparable and comparators
Writable Classes
Writable wrappers for Java primitives Text
Indexing Unicode Iteration Mutability Resorting to String
BytesWritable NullWritable ObjectWritable and GenericWritable Writable collections
Implementing a Custom Writable
Implementing a RawComparator for speed Custom comparators
Serialization Frameworks
Serialization IDL
File-Based Data Structures
SequenceFile
Writing a SequenceFile Reading a SequenceFile Displaying a SequenceFile with the command-line interface Sorting and merging SequenceFiles The SequenceFile format
MapFile
MapFile variants
Other File Formats and Column-Oriented Formats
II. MapReduce
6. Developing a MapReduce Application
The Configuration API
Combining Resources Variable Expansion
Setting Up the Development Environment
Managing Configuration GenericOptionsParser, Tool, and ToolRunner
Writing a Unit Test with MRUnit
Mapper Reducer
Running Locally on Test Data
Running a Job in a Local Job Runner Testing the Driver
Running on a Cluster
Packaging a Job
The client classpath The task classpath Packaging dependencies Task classpath precedence
Launching a Job The MapReduce Web UI
The resource manager page The MapReduce job page
Retrieving the Results Debugging a Job
The tasks and task attempts pages Handling malformed data
Hadoop Logs Remote Debugging
Tuning a Job
Profiling Tasks
The HPROF profiler
MapReduce Workflows
Decomposing a Problem into MapReduce Jobs JobControl Apache Oozie
Defining an Oozie workflow Packaging and deploying an Oozie workflow application Running an Oozie workflow job
7. How MapReduce Works
Anatomy of a MapReduce Job Run
Job Submission Job Initialization Task Assignment Task Execution
Streaming
Progress and Status Updates Job Completion
Failures
Task Failure Application Master Failure Node Manager Failure Resource Manager Failure
Shuffle and Sort
The Map Side The Reduce Side Configuration Tuning
Task Execution
The Task Execution Environment
Streaming environment variables
Speculative Execution Output Committers
Task side-effect files
8. MapReduce Types and Formats
MapReduce Types
The Default MapReduce Job
The default Streaming job Keys and values in Streaming
Input Formats
Input Splits and Records
FileInputFormat FileInputFormat input paths FileInputFormat input splits Small files and CombineFileInputFormat Preventing splitting File information in the mapper Processing a whole file as a record
Text Input
TextInputFormat
Controlling the maximum line length
KeyValueTextInputFormat NLineInputFormat XML
Binary Input
SequenceFileInputFormat SequenceFileAsTextInputFormat SequenceFileAsBinaryInputFormat FixedLengthInputFormat
Multiple Inputs Database Input (and Output)
Output Formats
Text Output Binary Output
SequenceFileOutputFormat SequenceFileAsBinaryOutputFormat MapFileOutputFormat
Multiple Outputs
An example: Partitioning data MultipleOutputs
Lazy Output Database Output
9. MapReduce Features
Counters
Built-in Counters
Task counters Job counters
User-Defined Java Counters
Dynamic counters Retrieving counters
User-Defined Streaming Counters
Sorting
Preparation Partial Sort Total Sort Secondary Sort
Java code Streaming
Joins
Map-Side Joins Reduce-Side Joins
Side Data Distribution
Using the Job Configuration Distributed Cache
Usage How it works The distributed cache API
MapReduce Library Classes
III. Hadoop Operations
10. Setting Up a Hadoop Cluster
Cluster Specification
Cluster Sizing
Master node scenarios
Network Topology
Rack awareness
Cluster Setup and Installation
Installing Java Creating Unix User Accounts Installing Hadoop Configuring SSH Configuring Hadoop Formatting the HDFS Filesystem Starting and Stopping the Daemons Creating User Directories
Hadoop Configuration
Configuration Management Environment Settings
Java Memory heap size System logfiles SSH settings
Important Hadoop Daemon Properties
HDFS YARN Memory settings in YARN and MapReduce CPU settings in YARN and MapReduce
Hadoop Daemon Addresses and Ports Other Hadoop Properties
Cluster membership Buffer size HDFS block size Reserved storage space Trash Job scheduler Reduce slow start Short-circuit local reads
Security
Kerberos and Hadoop
An example
Delegation Tokens Other Security Enhancements
Benchmarking a Hadoop Cluster
Hadoop Benchmarks
Benchmarking MapReduce with TeraSort Other benchmarks
User Jobs
11. Administering Hadoop
HDFS
Persistent Data Structures
Namenode directory structure The filesystem image and edit log Secondary namenode directory structure Datanode directory structure
Safe Mode
Entering and leaving safe mode
Audit Logging Tools
dfsadmin Filesystem check (fsck)
Finding the blocks for a file
Datanode block scanner Balancer
Monitoring
Logging
Setting log levels Getting stack traces
Metrics and JMX
Maintenance
Routine Administration Procedures
Metadata backups Data backups Filesystem check (fsck) Filesystem balancer
Commissioning and Decommissioning Nodes
Commissioning new nodes Decommissioning old nodes
Upgrades
HDFS data and metadata upgrades
Start the upgrade Wait until the upgrade is complete Check the upgrade Roll back the upgrade (optional) Finalize the upgrade (optional)
IV. Related Projects
12. Avro
Avro Data Types and Schemas In-Memory Serialization and Deserialization
The Specific API
Avro Datafiles Interoperability
Python API Avro Tools
Schema Resolution Sort Order Avro MapReduce Sorting Using Avro MapReduce Avro in Other Languages
13. Parquet
Data Model
Nested Encoding
Parquet File Format Parquet Configuration Writing and Reading Parquet Files
Avro, Protocol Buffers, and Thrift
Projection and read schemas
Parquet MapReduce
14. Flume
Installing Flume An Example Transactions and Reliability
Batching
The HDFS Sink
Partitioning and Interceptors File Formats
Fan Out
Delivery Guarantees Replicating and Multiplexing Selectors
Distribution: Agent Tiers
Delivery Guarantees
Sink Groups Integrating Flume with Applications Component Catalog Further Reading
15. Sqoop
Getting Sqoop Sqoop Connectors A Sample Import
Text and Binary File Formats
Generated Code
Additional Serialization Systems
Imports: A Deeper Look
Controlling the Import Imports and Consistency Incremental Imports Direct-Mode Imports
Working with Imported Data
Imported Data and Hive
Importing Large Objects Performing an Export Exports: A Deeper Look
Exports and Transactionality Exports and SequenceFiles
Further Reading
16. Pig
Installing and Running Pig
Execution Types
Local mode MapReduce mode
Running Pig Programs Grunt Pig Latin Editors
An Example
Generating Examples
Comparison with Databases Pig Latin
Structure Statements Expressions Types Schemas
Using Hive tables with HCatalog Validation and nulls Schema merging
Functions
Other libraries
Macros
User-Defined Functions
A Filter UDF
Leveraging types
An Eval UDF
Dynamic invokers
A Load UDF
Using a schema
Data Processing Operators
Loading and Storing Data Filtering Data
FOREACH...GENERATE STREAM
Grouping and Joining Data
JOIN COGROUP CROSS GROUP
Sorting Data Combining and Splitting Data
Pig in Practice
Parallelism Anonymous Relations Parameter Substitution
Dynamic parameters Parameter substitution processing
Further Reading
17. Hive
Installing Hive
The Hive Shell
An Example Running Hive
Configuring Hive
Execution engines Logging
Hive Services
Hive clients
The Metastore
Comparison with Traditional Databases
Schema on Read Versus Schema on Write Updates, Transactions, and Indexes SQL-on-Hadoop Alternatives
HiveQL
Data Types
Primitive types Complex types
Operators and Functions
Conversions
Tables
Managed Tables and External Tables Partitions and Buckets
Partitions Buckets
Storage Formats
The default storage format: Delimited text Binary storage formats: Sequence files, Avro datafiles, Parquet files, RCFiles, and ORCFiles Using a custom SerDe: RegexSerDe Storage handlers
Importing Data
Inserts Multitable insert CREATE TABLE...AS SELECT
Altering Tables Dropping Tables
Querying Data
Sorting and Aggregating MapReduce Scripts Joins
Inner joins Outer joins Semi joins Map joins
Subqueries Views
User-Defined Functions
Writing a UDF Writing a UDAF
A more complex UDAF
Further Reading
18. Crunch
An Example The Core Crunch API
Primitive Operations
union() parallelDo() groupByKey() combineValues()
Types
Records and tuples
Sources and Targets
Reading from a source Writing to a target Existing outputs Combined sources and targets
Functions
Serialization of functions Object reuse
Materialization
PObject
Pipeline Execution
Running a Pipeline
Asynchronous execution Debugging
Stopping a Pipeline Inspecting a Crunch Plan Iterative Algorithms Checkpointing a Pipeline
Crunch Libraries Further Reading
19. Spark
Installing Spark An Example
Spark Applications, Jobs, Stages, and Tasks A Scala Standalone Application A Java Example A Python Example
Resilient Distributed Datasets
Creation Transformations and Actions
Aggregation transformations
Persistence
Persistence levels
Serialization
Data Functions
Shared Variables
Broadcast Variables Accumulators
Anatomy of a Spark Job Run
Job Submission DAG Construction Task Scheduling Task Execution
Executors and Cluster Managers
Spark on YARN
YARN client mode YARN cluster mode
Further Reading
20. HBase
HBasics
Backdrop
Concepts
Whirlwind Tour of the Data Model
Regions Locking
Implementation
HBase in operation
Installation
Test Drive
Clients
Java MapReduce REST and Thrift
Building an Online Query Application
Schema Design Loading Data
Load distribution Bulk load
Online Queries
Station queries Observation queries
HBase Versus RDBMS
Successful Service HBase
Praxis
HDFS UI Metrics Counters
Further Reading
21. ZooKeeper
Installing and Running ZooKeeper An Example
Group Membership in ZooKeeper Creating the Group Joining a Group Listing Members in a Group
ZooKeeper command-line tools
Deleting a Group
The ZooKeeper Service
Data Model
Ephemeral znodes Sequence numbers Watches
Operations
Multiupdate APIs Watch triggers ACLs
Implementation Consistency Sessions
Time
States
Building Applications with ZooKeeper
A Configuration Service The Resilient ZooKeeper Application
InterruptedException KeeperException
State exceptions Recoverable exceptions Unrecoverable exceptions
A reliable configuration service
A Lock Service
The herd effect Recoverable exceptions Unrecoverable exceptions Implementation
More Distributed Data Structures and Protocols
BookKeeper and Hedwig
ZooKeeper in Production
Resilience and Performance Configuration
Further Reading
V. Case Studies
22. Composable Data at Cerner
From CPUs to Semantic Integration Enter Apache Crunch Building a Complete Picture Integrating Healthcare Data Composability over Frameworks Moving Forward
23. Biological Data Science: Saving Lives with Software
The Structure of DNA The Genetic Code: Turning DNA Letters into Proteins Thinking of DNA as Source Code The Human Genome Project and Reference Genomes Sequencing and Aligning DNA ADAM, A Scalable Genome Analysis Platform
Literate programming with the Avro interface description language (IDL) Column-oriented access with Parquet A simple example: k-mer counting using Spark and ADAM
From Personalized Ads to Personalized Medicine Join In
24. Cascading
Fields, Tuples, and Pipes Operations Taps, Schemes, and Flows Cascading in Practice Flexibility Hadoop and Cascading at ShareThis Summary
A. Installing Apache Hadoop
Prerequisites Installation Configuration
Standalone Mode Pseudodistributed Mode
Configuring SSH Formatting the HDFS filesystem Starting and stopping the daemons Creating a user directory
Fully Distributed Mode
B. Cloudera’s Distribution Including Apache Hadoop C. Preparing the NCDC Weather Data D. The Old and New Java MapReduce APIs Index Colophon Copyright
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion