Mastering ElasticSearch by Kuć, Rafał -- Read -- Imperial Library of Trantor

Index

Mastering ElasticSearch

Table of Contents Mastering ElasticSearch Credits About the Authors About the Reviewers www.PacktPub.com

Support files, eBooks, discount offers and more

Why Subscribe? Free Access for Packt account holders

Preface

What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support

Downloading the example code Errata Piracy Questions

1. Introduction to ElasticSearch

Introducing Apache Lucene

Getting familiar with Lucene Overall architecture Analyzing your data

Indexing and querying

Lucene query language

Understanding the basics Querying fields Term modifiers Handling special characters

Introducing ElasticSearch

Basic concepts

Index Document Mapping Type Node Cluster Shard Replica Gateway

Key concepts behind ElasticSearch architecture Working of ElasticSearch

The boostrap process Failure detection Communicating with ElasticSearch

Indexing data Querying data Index configuration Administration and monitoring

Summary

2. Power User Query DSL

Default Apache Lucene scoring explained

When a document is matched The TF/IDF scoring formula

The Lucene conceptual formula The Lucene practical formula

The ElasticSearch point of view

Query rewrite explained

Prefix query as an example Getting back to Apache Lucene Query rewrite properties

Rescore

Understanding rescore Example Data Query Structure of the rescore query Rescore parameters To sum up

Bulk Operations

MultiGet MultiSearch

Sorting data

Sorting with multivalued fields Sorting with multivalued geo fields Sorting with nested objects

Update API

Simple field update Conditional modifications using scripting Creating and deleting documents using the Update API

Using filters to optimize your queries

Filters and caching

Not all filters are cached by default Changing ElasticSearch caching behavior Why bother naming the key for the cache? When to change the ElasticSearch filter caching behavior

The terms lookup filter

How does it work? Performance considerations Loading terms from inner objects Terms lookup filter cache settings

Filter and scopes in ElasticSearch faceting mechanism

Example data Faceting and filtering Filter as a part of the query The Facet filter Global scope

Summary

3. Low-level Index Control

Altering Apache Lucene scoring

Available similarity models Setting per-field similarity

Similarity model configuration

Choosing the default similarity model Configuring the chosen similarity models

Configuring TF/IDF similarity Configuring Okapi BM25 similarity Configuring DFR similarity Configuring IB similarity

Using codecs

Simple use cases Let's see how it works Available posting formats Configuring the codec behavior

Default codec properties Direct codec properties Memory codec properties Pulsing codec properties Bloom filter-based codec properties

NRT, flush, refresh, and transaction log

Updating index and committing changes

Changing the default refresh time

The transaction log

The transaction log configuration

Near Real Time GET

Looking deeper into data handling

Input is not always analyzed Example usage Changing the analyzer during indexing Changing the analyzer during searching The pitfall and default analysis

Segment merging under control

Choosing the right merge policy

The tiered merge policy The log byte size merge policy The log doc merge policy

Merge policies configuration

The tiered merge policy The log byte size merge policy The log doc merge policy

Scheduling

The concurrent merge scheduler The serial merge scheduler Setting the desired merge scheduler

Summary

4. Index Distribution Architecture

Choosing the right amount of shards and replicas

Sharding and over allocation A positive example of over allocation Multiple shards versus multiple indices Replicas

Routing explained

Shards and data Let's test routing

Indexing with routing

Querying

Aliases Multiple routing values

Altering the default shard allocation behavior

Introducing ShardAllocator The even_shard ShardAllocator The balanced ShardAllocator The custom ShardAllocator Deciders

SameShardAllocationDecider ShardsLimitAllocationDecider FilterAllocationDecider ReplicaAfterPrimaryActiveAllocationDecider ClusterRebalanceAllocationDecider ConcurrentRebalanceAllocationDecider DisableAllocationDecider AwarenessAllocationDecider ThrottlingAllocationDecider RebalanceOnlyWhenActiveAllocationDecider DiskThresholdDecider

Adjusting shard allocation

Allocation awareness

Forcing allocation awareness

Filtering

But what those properties mean?

Runtime allocation updating

Index-level updates Cluster-level updates

Defining total shards allowed per node

Inclusion Requirements Exclusion

Additional shard allocation properties

Query execution preference

Introducing the preference parameter

Using our knowledge

Assumptions

Data volume and queries specification

Configuration

Node-level configuration Indices configuration The directories layout Gateway configuration Recovery Discovery Logging slow queries Logging garbage collector work Memory setup One more thing

Changes are coming

Reindexing Routing Multiple Indices

Summary

5. ElasticSearch Administration

Choosing the right directory implementation – the store module

Store type

The simple file system store The new IO filesystem store The MMap filesystem store The memory store

Additional properties

The default store type

Discovery configuration

Zen discovery

Multicast Unicast Minimum master nodes Zen discovery fault detection

Amazon EC2 discovery

EC2 plugin's installation

EC2 plugin's configuration Optional EC2 discovery configuration options EC2 nodes scanning configuration

Gateway and recovery configuration Gateway recovery process Configuration properties Expectations on nodes

Local gateway

Backing up the local gateway

Recovery configuration

Cluster-level recovery configuration Index-level recovery settings

Segments statistics

Introducing the segments API

The response

Visualizing segments information

Understanding ElasticSearch caching

The filter cache

Filter cache types Index-level filter cache configuration Node-level filter cache configuration

The field data cache

Index-level field data cache configuration Node-level field data cache configuration Filtering

Adding field data filtering information Filtering by term frequency Filtering by regex Filtering by regex and term frequency The filtering example

Clearing the caches

Index, indices, and all caches clearing Clearing specific caches Clearing fields-related caches

Summary

6. Fighting with Fire

Knowing the garbage collector

Java memory

The life cycle of Java object and garbage collections

Dealing with garbage collection problems

Turning on logging of garbage collection work Using JStat Creating memory dumps More information on garbage collector work Adjusting garbage collector work in ElasticSearch

Using standard startup script Service wrapper

Avoiding swapping on Unix-like systems

When it is too much for I/O – throttling explained

Controlling I/O throttling Configuration

Throttling type Maximum throughput per second Node throttling defaults Configuration example

Speeding up queries using warmers

Reason for using warmers Manipulating warmers

Using the PUT Warmer API Adding warmers during index creation Adding warmers to templates Retrieving warmers Deleting warmers Disabling warmers

Testing the warmers

Querying without warmers present Querying with warmer present

Very hot threads

Hot Threads API usage clarification Hot Threads API response

Real-life scenarios

Slower and slower performance Heterogeneous environment and load imbalance My server is under fire

Summary

7. Improving the User Search Experience

Correcting user spelling mistakes

Test data Getting into technical details

Suggesters Using the _suggest REST endpoint

Understanding the REST endpoint suggester response

Including suggestions requests in a query

Suggester response

The term suggester

Configuration

Common term suggester options Additional term suggester options

The phrase suggester

The usage example Configuration

Basic configuration Configuring smoothing models

Stupid backoff Laplace Linear interpolation

Configuring candidate generators

Direct generators Configuring direct generators

Completion suggester

The logic behind completion suggester Using completion suggester

Indexing data Querying data Custom weights Additional parameters

Improving query relevance

The data The quest for improving relevance

The standard query The Multi match query Phrases comes into play Let's throw the garbage away And now we boost Making a misspelling-proof search Drill downs with faceting

Summary

8. ElasticSearch Java APIs

Introducing the ElasticSearch Java API The code Connecting to your cluster

Becoming the ElasticSearch node Using the transport connection method Choosing the right connection method

Anatomy of the API CRUD operations

Fetching documents

Handling errors

Indexing documents Updating documents Deleting documents

Querying ElasticSearch

Preparing a query Building queries

Using the match all documents query The match query Using the geo shape query

Paging Sorting Filtering Faceting Highlighting Suggestions Counting Scrolling

Performing multiple actions

Bulk The delete by query Multi GET Multi Search

Percolator

ElasticSearch 1.0 and higher

The explain API Building JSON queries and documents The administration API

The cluster administration API

The cluster and indices health API The cluster state API The update settings API The reroute API The nodes information API The node statistics API The nodes hot threads API The nodes shutdown API The search shards API

The Indices administration API

The index existence API The Type existence API The indices stats API Index status Segments information API Creating an index API Deleting an index Closing an index Opening an index The Refresh API The Flush API The Optimize API The put mapping API The delete mapping API The gateway snapshot API The aliases API The get aliases API The aliases exists API The clear cache API The update settings API The analyze API The put template API The delete template API The validate query API The put warmer API The delete warmer API

Summary

9. Developing ElasticSearch Plugins

Creating the Apache Maven project structure

Understanding the basics Structure of the Maven Java project The idea of POM Running the build process Introducing the assembly Maven plugin

Creating a custom river plugin

Implementation details

Implementing the URLChecker class Implementing the JSONRiver class Implementing the JSONRiverModule class Implementing the JSONRiverPlugin class Informing ElasticSearch about the JSONRiver plugin class

Testing our river

Building our river Installing our river Initializing our river Checking if our JSON river works

Creating custom analysis plugin

Implementation details

Implementing TokenFilter Implementing the TokenFilter factory Implementing custom analyzer Implementing analyzer provider Implementing analysis binder Implementing analyzer indices component Implementing analyzer module Implementing analyzer plugin Informing ElasticSearch about our custom analyzer

Testing our custom analysis plugin

Building our custom analysis plugin Installing the custom analysis plugin Checking if our analysis plugin works

Summary

Index

← Prev
Back
Next →

← Prev
Back
Next →