Introduction to The Solr Enterprise Search Server
Solr in a Nutshell
Solr is a standalone enterprise search server with a web-services like API. You put documents in it (called "indexing") via XML over HTTP. You query it via HTTP GET and receive XML results.
- Advanced Full-Text Search Capabilities
- Optimized for High Volume Web Traffic
- Standards Based Open Interfaces - XML and HTTP
- Comprehensive HTML Administration Interfaces
- Server statistics exposed over JMX for monitoring
- Scalability - Efficient Replication to other Solr Search Servers
- Flexible and Adaptable with XML configuration
- Extensible Plugin Architecture
Solr Uses the Lucene Search Library and Extends it!
- A Real Data Schema, with Numeric Types, Dynamic Fields, Unique Keys
- Powerful Extensions to the Lucene Query Language
- Support for Dynamic Faceted Browsing and Filtering
- Advanced, Configurable Text Analysis
- Highly Configurable and User Extensible Caching
- Performance Optimizations
- External Configuration via XML
- An Administration Interface
- Monitorable Logging
- Fast Incremental Updates and Snapshot Distribution
- Distributed search with sharded index on multiple hosts
- XML and CSV/delimited-text update formats
- Easy ways to pull in data from databases and XML files from local disk and HTTP sources
- Multiple search indices
Detailed Features
Schema
- Defines the field types and fields of documents
- Can drive more intelligent processing
- Declarative Lucene Analyzer specification
- Dynamic Fields enables on-the-fly addition of new fields
- CopyField functionality allows indexing a single field multiple ways, or combining multiple fields into a single searchable field
- Explicit types eliminates the need for guessing types of fields
- External file-based configuration of stopword lists, synonym lists, and protected word lists
- Many additional text analysis components including word splitting, regex and sounds-like filters
Query
- HTTP interface with configurable response formats (XML/XSLT, JSON, Python, Ruby)
- Sort by any number of fields
- Advanced DisMax query parser for high relevancy results from user-entered queries
- Highlighted context snippets
- Faceted Searching based on unique field values and explicit queries
- Spelling suggestions for user queries
- More Like This suggestions for given document
- Constant scoring range and prefix queries - no idf, coord, or lengthNorm factors, and no restriction on the number of terms the query matches.
- Function Query - influence the score by a function of a field's numeric value or ordinal
- Date Math - specify dates relative to "NOW" in queries and updates
- Performance Optimizations
Core
- Pluggable query handlers and extensible XML data format
- Document uniqueness enforcement based on unique key field
- Batches updates and deletes for high performance
- User configurable commands triggered on index changes
- Searcher concurrency control
- Correct handling of numeric types for both sorting and range queries
- Ability to control where docs with the sort field missing will be placed
- "Luke" request handler for corpus information
Caching
- Configurable Query Result, Filter, and Document cache instances
- Pluggable Cache implementations
- Cache warming in background
- When a new searcher is opened, configurable searches are run against it in order to warm it up to avoid slow first hits. During warming, the current searcher handles live requests.
- Autowarming in background
- The most recently accessed items in the caches of the current searcher are re-populated in the new searcher, enabing high cache hit rates across index/searcher changes.
- Fast/small filter implementation
- User level caching with autowarming support
Replication
- Efficient distribution of index parts that have changed via rsync transport
- Pull strategy allows for easy addition of searchers
- Configurable distribution interval allows tradeoff between timeliness and cache utilization
Admin Interface
- Comprehensive statistics on cache utilization, updates, and queries
- Text analysis debugger, showing result of every stage in an analyzer
- Web Query Interface w/ debugging output
- parsed query output
- Lucene explain() document score detailing
- explain score for documents outside of the requested range to debug why a given document wasn't ranked higher.