Apache Solr 3 Enterprise Search Server by Smiley, David -- Read -- Imperial Library of Trantor

Index

Apache Solr 3 Enterprise Search Server

Apache Solr 3 Enterprise Search Server Credits About the Authors Acknowledgement Acknowledgement About the Reviewers www.PacktPub.com

Discounts Free eBooks Newsletters Code Downloads, Errata and Support

PacktLib.PacktPub.com Preface

What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support

Downloading the example code Errata Piracy Questions

1. Quick Starting Solr

An introduction to Solr

Lucene, the underlying engine Solr, a Lucene-based search server Comparison to database technology

Getting started

Solr's installation directory structure Solr's home directory and Solr cores Running Solr

A quick tour of Solr

Loading sample data A simple query Some statistics The sample browse interface

Configuration files Resources outside this book Summary

2. Schema and Text Analysis

MusicBrainz.org One combined index or separate indices

One combined index

Problems with using a single combined index

Separate indices

Schema design

Step 1: Determine which searches are going to be powered by Solr Step 2: Determine the entities returned from each search Step 3: Denormalize related data

Denormalizing—'one-to-one' associated data Denormalizing—'one-to-many' associated data

Step 4: (Optional) Omit the inclusion of fields only used in search results

The schema.xml file

Defining field types Built-in field type classes

Numbers and dates Geospatial

Field options Field definitions

Dynamic field definitions

Our MusicBrainz field definitions Copying fields The unique key The default search field and query operator

Text analysis

Configuration Experimenting with text analysis Character filters Tokenization WordDelimiterFilter Stemming

Correcting and augmenting stemming

Synonyms

Index-time versus query-time, and to expand or not

Stop words Phonetic sounds-like analysis Substring indexing and wildcards

ReversedWildcardFilter N-grams N-gram costs

Sorting Text Miscellaneous token filters

Summary

3. Indexing Data

Communicating with Solr

Direct HTTP or a convenient client API Push data to Solr or have Solr pull it Data formats HTTP POSTing options to Solr Remote streaming

Solr's Update-XML format

Deleting documents

Commit, optimize, and rollback Sending CSV formatted data to Solr

Configuration options

The Data Import Handler Framework

Setup The development console Writing a DIH configuration file

Data Sources Entity processors Fields and transformers

Example DIH configurations

Importing from databases Importing XML from a file with XSLT Importing multiple rich document files (crawling)

Importing commands

Delta imports

Indexing documents with Solr Cell

Extracting text and metadata from files Configuring Solr Solr Cell parameters Extracting karaoke lyrics Indexing richer documents

Update request processors Summary

4. Searching

Your first search, a walk-through Solr's generic XML structured data representation Solr's XML response format

Parsing the URL

Request handlers Query parameters

Search criteria related parameters Result pagination related parameters Output related parameters Diagnostic related parameters

Query parsers and local-params Query syntax (the lucene query parser)

Matching all the documents Mandatory, prohibited, and optional clauses

Boolean operators

Sub-queries

Limitations of prohibited clauses in sub-queries

Field qualifier Phrase queries and term proximity Wildcard queries

Fuzzy queries

Range queries

Date math

Score boosting Existence (and non-existence) queries Escaping special characters

The Dismax query parser (part 1)

Searching multiple fields Limited query syntax Min-should-match

Basic rules Multiple rules What to choose

A default search

Filtering Sorting Geospatial search

Indexing locations Filtering by distance Sorting by distance

Summary

5. Search Relevancy

Scoring

Query-time and index-time boosting Troubleshooting queries and scoring

Dismax query parser (part 2)

Lucene's DisjunctionMaxQuery Boosting: Automatic phrase boosting

Configuring automatic phrase boosting Phrase slop configuration Partial phrase boosting

Boosting: Boost queries Boosting: Boost functions

Add or multiply boosts?

Function queries

Field references Function reference

Mathematical primitives Other math ord and rord Miscellaneous functions

Function query boosting

Formula: Logarithm Formula: Inverse reciprocal Formula: Reciprocal Formula: Linear

How to boost based on an increasing numeric field

Step by step… External field values

How to boost based on recent dates

Step by step…

Summary

6. Faceting

A quick example: Faceting release types

MusicBrainz schema changes

Field requirements Types of faceting Faceting field values

Alphabetic range bucketing

Faceting numeric and date ranges

Range facet parameters

Facet queries Building a filter query from a facet

Field value filter queries Facet range filter queries

Excluding filters (multi-select faceting) Hierarchical faceting Summary

7. Search Components

About components The Highlight component

A highlighting example Highlighting configuration

The regex fragmenter The fast vector highlighter with multi-colored highlighting

The SpellCheck component

Schema configuration Configuration in solrconfig.xml

Configuring spellcheckers (dictionaries)

IndexBasedSpellChecker options FileBasedSpellChecker options

Processing of the q parameter Processing of the spellcheck.q parameter

Building the dictionary from its source Issuing spellcheck requests Example usage for a misspelled query

Query complete / suggest

Query term completion via facet.prefix Query term completion via the Suggester Query term completion via the Terms component

The QueryElevation component

Configuration

The MoreLikeThis component

Configuration parameters

Parameters specific to the MLT search component Parameters specific to the MLT request handler Common MLT parameters

MLT results example

The Stats component

Configuring the stats component Statistics on track durations

The Clustering component Result grouping/Field collapsing

Configuring result grouping

The TermVector component Summary

8. Deployment

Deployment methodology for Solr

Questions to ask

Installing Solr into a Servlet container

Differences between Servlet containers

Defining solr.home property

Logging

HTTP server request access logs Solr application logging

Configuring logging output Logging using Log4j Jetty startup integration Managing log levels at runtime

A SearchHandler per search interface? Leveraging Solr cores

Configuring solr.xml

Property substitution Include fragments of XML with XInclude

Managing cores Why use multicore?

Monitoring Solr performance

Stats.jsp JMX

Starting Solr with JMX

Take a walk on the wild side! Use JRuby to extract JMX information

Securing Solr from prying eyes

Limiting server access

Securing public searches Controlling JMX access

Securing index data

Controlling document access Other things to look at

Summary

9. Integrating Solr

Working with included examples

Inventory of examples

Solritas, the integrated search UI

Pros and Cons of Solritas

SolrJ: Simple Java interface

Using Heritrix to download artist pages SolrJ-based client for Indexing HTML SolrJ client API

Embedding Solr Searching with SolrJ Indexing

Indexing POJOs

When should I use embedded Solr?

In-process indexing Standalone desktop applications Upgrading from legacy Lucene

Using JavaScript with Solr

Wait, what about security? Building a Solr powered artists autocomplete widget with jQuery and JSONP AJAX Solr

Using XSLT to expose Solr via OpenSearch

OpenSearch based Browse plugin

Installing the Search MBArtists plugin

Accessing Solr from PHP applications

solr-php-client Drupal options

Apache Solr Search integration module Hosted Solr by Acquia

Ruby on Rails integrations

The Ruby query response writer sunspot_rails gem

Setting up MyFaves project Populating MyFaves relational database from Solr Build Solr indexes from a relational database Complete MyFaves website

Which Rails/Ruby library should I use?

Nutch for crawling web pages Maintaining document security with ManifoldCF

Connectors Putting ManifoldCF to use

Summary

10. Scaling Solr

Tuning complex systems Testing Solr performance with SolrMeter Optimizing a single Solr server (Scale up)

Configuring JVM settings to improve memory usage

MMapDirectoryFactory to leverage additional virtual memory

Enabling downstream HTTP caching Solr caching

Tuning caches

Indexing performance

Designing the schema Sending data to Solr in bulk Don't overlap commits Disabling unique key checking Index optimization factors

Enhancing faceting performance Using term vectors Improving phrase search performance

Moving to multiple Solr servers (Scale horizontally)

Replication Starting multiple Solr servers

Configuring replication

Load balancing searches across slaves

Indexing into the master server Configuring slaves

Configuring load balancing Sharding indexes

Assigning documents to shards Searching across shards (distributed search)

Combining replication and sharding (Scale deep)

Near real time search

Where next for scaling Solr? Summary

A. Search Quick Reference

Quick reference

← Prev
Back
Next →

← Prev
Back
Next →