Mastering Elasticsearch · 2nd Edition by Kuc, Rafal -- Read -- Imperial Library of Trantor

Index

Mastering Elasticsearch Second Edition

Table of Contents Mastering Elasticsearch Second Edition Credits About the Author Acknowledgments About the Author Acknowledgments About the Reviewers www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe? Free access for Packt account holders

Preface

What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support

Downloading the example code Errata Piracy Questions

1. Introduction to Elasticsearch

Introducing Apache Lucene

Getting familiar with Lucene Overall architecture

Getting deeper into Lucene index

Norms Term vectors Posting formats Doc values

Analyzing your data

Indexing and querying

Lucene query language

Understanding the basics Querying fields Term modifiers Handling special characters

Introducing Elasticsearch

Basic concepts

Index Document Type Mapping Node Cluster Shard Replica

Key concepts behind Elasticsearch architecture Workings of Elasticsearch

The startup process Failure detection

Communicating with Elasticsearch

Indexing data Querying data

The story Summary

2. Power User Query DSL

Default Apache Lucene scoring explained

When a document is matched TF/IDF scoring formula

Lucene conceptual scoring formula Lucene practical scoring formula

Elasticsearch point of view An example

Query rewrite explained

Prefix query as an example Getting back to Apache Lucene Query rewrite properties

Query templates

Introducing query templates

Templates as strings

The Mustache template engine

Conditional expressions Loops Default values

Storing templates in files

Handling filters and why it matters

Filters and query relevance How filters work

Bool or and/or/not filters

Performance considerations Post filtering and filtered query Choosing the right filtering method

Choosing the right query for the job

Query categorization

Basic queries Compound queries Not analyzed queries Full text search queries Pattern queries Similarity supporting queries Score altering queries Position aware queries Structure aware queries

The use cases

Example data Basic queries use cases

Searching for values in range Simplified query for multiple terms

Compound queries use cases

Boosting some of the matched documents Ignoring lower scoring partial queries

Not analyzed queries use cases

Limiting results to given tags Efficient query time stopwords handling

Full text search queries use cases

Using Lucene query syntax in queries Handling user queries without errors

Pattern queries use cases

Autocomplete using prefixes Pattern matching

Similarity supporting queries use cases

Finding terms similar to a given one Finding documents with similar field values

Score altering queries use cases

Favoring newer books Decreasing importance of books with certain value

Pattern queries use cases

Matching phrases Spans, spans everywhere

Structure aware queries use cases

Returning parent documents having a certain nested document Affecting parent document score with the score of nested documents

Summary

3. Not Only Full Text Search

Query rescoring

What is query rescoring? An example query Structure of the rescore query Rescore parameters

Choosing the scoring mode

To sum up

Controlling multimatching

Multimatch types

Best fields matching Cross fields matching Most fields matching Phrase matching Phrase with prefixes matching

Significant terms aggregation

An example Choosing significant terms Multiple values analysis

Significant terms aggregation and full text search fields

Additional configuration options

Controlling the number of returned buckets Background set filtering Minimum document count Execution hint More options

There are limits

Memory consumption Shouldn't be used as top-level aggregation Counts are approximated Floating point fields are not allowed

Documents grouping

Top hits aggregation An example

Additional parameters

Relations between documents

The object type The nested documents Parent–child relationship

Parent–child relationship in the cluster

A few words about alternatives

Scripting changes between Elasticsearch versions

Scripting changes

Security issues Groovy – the new default scripting language Removal of MVEL language

Short Groovy introduction

Using Groovy as your scripting language Variable definition in scripts Conditionals Loops An example There is more

Scripting in full text context

Field-related information Shard level information Term level information

More advanced term information

Lucene expressions explained

The basics An example There is more

Summary

4. Improving the User Search Experience

Correcting user spelling mistakes

Testing data Getting into technical details Suggesters

Using the _suggest REST endpoint Understanding the REST endpoint suggester response Including suggestion requests in query The term suggester

Configuration Common term suggester options Additional term suggester options

The phrase suggester

Usage example Configuration Basic configuration Configuring smoothing models Configuring candidate generators Configuring direct generators

The completion suggester

The logic behind the completion suggester Using the completion suggester Indexing data Querying data Custom weights Additional parameters

Improving the query relevance

Data The quest for relevance improvement

The standard query The multi match query Phrases comes into play Let's throw the garbage away Now, we boost Performing a misspelling-proof search Drill downs with faceting

Summary

5. The Index Distribution Architecture

Choosing the right amount of shards and replicas

Sharding and overallocation A positive example of overallocation Multiple shards versus multiple indices Replicas

Routing explained

Shards and data Let's test routing

Indexing with routing

Routing in practice

Querying

Aliases Multiple routing values

Altering the default shard allocation behavior

Allocation awareness

Forcing allocation awareness

Filtering

What include, exclude, and require mean

Runtime allocation updating

Index level updates Cluster level updates

Defining total shards allowed per node Defining total shards allowed per physical server

Inclusion Requirement Exclusion Disk-based allocation

Query execution preference

Introducing the preference parameter

Summary

6. Low-level Index Control

Altering Apache Lucene scoring

Available similarity models Setting a per-field similarity Similarity model configuration Choosing the default similarity model

Configuring the chosen similarity model

Configuring the TF/IDF similarity Configuring the Okapi BM25 similarity Configuring the DFR similarity Configuring the IB similarity Configuring the LM Dirichlet similarity Configuring the LM Jelinek Mercer similarity

Choosing the right directory implementation – the store module

The store type

The simple filesystem store The new I/O filesystem store The MMap filesystem store The hybrid filesystem store The memory store

Additional properties

The default store type The default store type for Elasticsearch 1.3.0 and higher The default store type for Elasticsearch versions older than 1.3.0

NRT, flush, refresh, and transaction log

Updating the index and committing changes

Changing the default refresh time

The transaction log

The transaction log configuration

Near real-time GET

Segment merging under control

Choosing the right merge policy

The tiered merge policy The log byte size merge policy The log doc merge policy

Merge policies' configuration

The tiered merge policy The log byte size merge policy The log doc merge policy

Scheduling

The concurrent merge scheduler The serial merge scheduler Setting the desired merge scheduler

When it is too much for I/O – throttling explained

Controlling I/O throttling Configuration

The throttling type Maximum throughput per second Node throttling defaults Performance considerations The configuration example

Understanding Elasticsearch caching

The filter cache

Filter cache types Node-level filter cache configuration Index-level filter cache configuration

The field data cache

Field data or doc values Node-level field data cache configuration Index-level field data cache configuration The field data cache filtering

Adding field data filtering information Filtering by term frequency Filtering by regex Filtering by regex and term frequency The filtering example

Field data formats

String-based fields Numeric fields Geographical-based fields

Field data loading

The shard query cache

Setting up the shard query cache

Using circuit breakers

The field data circuit breaker The request circuit breaker The total circuit breaker

Clearing the caches Index, indices, and all caches clearing

Clearing specific caches

Summary

7. Elasticsearch Administration

Discovery and recovery modules

Discovery configuration

Zen discovery

Multicast Zen discovery configuration The unicast Zen discovery configuration

Master node

Configuring master and data nodes

Configuring data-only nodes Configuring master-only nodes Configuring the query processing-only nodes

The master election configuration

Zen discovery fault detection and configuration

The Amazon EC2 discovery

The EC2 plugin installation The EC2 plugin's generic configuration Optional EC2 discovery configuration options The EC2 nodes scanning configuration

Other discovery implementations

The gateway and recovery configuration

The gateway recovery process Configuration properties Expectations on nodes The local gateway Low-level recovery configuration

Cluster-level recovery configuration Index-level recovery settings

The indices recovery API

The human-friendly status API – using the Cat API

The basics Using the Cat API

Common arguments

The examples

Getting information about the master node Getting information about the nodes

Backing up

Saving backups in the cloud

The S3 repository The HDFS repository The Azure repository

Federated search

The test clusters Creating the tribe node

Using the unicast discovery for tribes

Reading data with the tribe node

Master-level read operations

Writing data with the tribe node

Master-level write operations

Handling indices conflicts Blocking write operations

Summary

8. Improving Performance

Using doc values to optimize your queries

The problem with field data cache The example of doc values usage

Knowing about garbage collector

Java memory

The life cycle of Java objects and garbage collections

Dealing with garbage collection problems

Turning on logging of garbage collection work Using JStat Creating memory dumps More information on the garbage collector work Adjusting the garbage collector work in Elasticsearch

Using a standard start up script Service wrapper

Avoid swapping on Unix-like systems

Benchmarking queries

Preparing your cluster configuration for benchmarking Running benchmarks Controlling currently run benchmarks

Very hot threads

Usage clarification for the Hot Threads API The Hot Threads API response

Scaling Elasticsearch

Vertical scaling Horizontal scaling

Automatically creating replicas Redundancy and high availability Cost and performance flexibility Continuous upgrades Multiple Elasticsearch instances on a single physical machine

Preventing the shard and its replicas from being on the same node

Designated nodes' roles for larger clusters

Query aggregator nodes Data nodes Master eligible nodes

Using Elasticsearch for high load scenarios

General Elasticsearch-tuning advices

Choosing the right store The index refresh rate Thread pools tuning Adjusting the merge process Data distribution

Advices for high query rate scenarios

Filter caches and shard query caches Think about the queries Using routing Parallelize your queries Field data cache and breaking the circuit Keeping size and shard_size under control

High indexing throughput scenarios and Elasticsearch

Bulk indexing Doc values versus indexing speed Keep your document fields under control The index architecture and replication Tuning write-ahead log Think about storage RAM buffer for indexing

Summary

9. Developing Elasticsearch Plugins

Creating the Apache Maven project structure Understanding the basics

The structure of the Maven Java project The idea of POM Running the build process Introducing the assembly Maven plugin

Creating custom REST action

The assumptions Implementation details

Using the REST action class

The constructor Handling requests Writing response

The plugin class Informing Elasticsearch about our REST action Time for testing Building the REST action plugin Installing the REST action plugin Checking whether the REST action plugin works

Creating the custom analysis plugin

Implementation details

Implementing TokenFilter Implementing the TokenFilter factory Implementing the class custom analyzer Implementing the analyzer provider Implementing the analysis binder Implementing the analyzer indices component Implementing the analyzer module Implementing the analyzer plugin Informing Elasticsearch about our custom analyzer

Testing our custom analysis plugin

Building our custom analysis plugin Installing the custom analysis plugin Checking whether our analysis plugin works

Summary

Index

← Prev
Back
Next →

← Prev
Back
Next →