Building a High Performance Search Engine with Perl and Swish3
Peter Karman
http://peknet.com/slides/swish3/
In the beginning was SWISH
Simple Web Indexing System for Humans
Circa 1995, by Kevin Hughes
Open-sourced in 1997 under the (L)GPL
Used by Apache, O'Reilly, Intel, Xerox, Texas Instruments, academic libraries, et al.
A patchy project
Patches coalesced into a new project, adopted by Roy Tennant, hosted at UC Berkeley: SWISH-E (the "E" is for "Enhanced")
Version 2.0, circa 2000
Version 2.4 with C and Perl libraries, circa 2003
Spelling changed to "Swish-e", circa 2004
License reworked in 2005: GPL with library exception
Current verison 2.4.7, circa 2009
Limitations of 2.4.x
No Unicode support (single-byte encodings only)
Does not scale well past a few million documents
No stable incremental index format
Opaque (undocumented, binary) index format
Swish3
Five years in the making:
http://blog.peknet.com/projects/swish/original_idea
Manifesto:
http://blog.peknet.com/projects/swish/whySwish3
UTF-8 support
Pluggable to other IR libraries (Xapian, KinoSearch, et al)
Built around libxml2 (GNOME XML parser)
Swish3 is like Perl6
Primarily a specification, with multiple possible implementations
It's a taken a loooong time to reach a stable official release of something you can pick up and use
Swish3 is like DBI or CHI
High-level
Defines an internal API, which backend engines must implement
Defines an external API, hiding individual engine syntax
Can still use the native engine's code directly
Aside: Anatomy of a search engine
Every search application has these basic five components:
aggregator
normalizer
analyzer
indexer
searcher
Swish3 does (1) and (2), and optionally (3), deferring (4) and (5) to the backend engines.
One implementation:
SWISH::Prog
Started as a OO wrapper around Swish-e 2.4 with extra aggregators
Added libswish3 Perl bindings in
SWISH::3
Currently has Native (2.4.x), KinoSearch and Xapian backend options
Another implementation:
swish_xapian
C++ program distributed with
libswish3
Automatically built if Xapian is already installed
Built-in facet support
SWISH::Prog
Aggregators for:
filesystem (File::Find)
web (WWW::Mechanize)
database (DBI)
email (Mail::Box)
Perl objects (JSON)
SWISH::Prog (cont...)
Normalization via SWISH::Filter for:
pdf
Office (.doc, .xls, .ppt)
gzip
images (IPTC)
mp3 (ID3)
Real World Project
Replace database fulltext search
Drop search times from ~10 seconds to <1 second
Database changes reflected in search within 5 minutes
Faceted search results
Database heavily normalized
Javascript client for the Cool Factor
What I Did
Denormalize database records
Serialize as XML and write to disk
Index XML, specifying "interesting" fields as MetaNames+PropertyNames
Query database every
N
minutes, serialize new/changed records, add them to index
What I Used
For indexing:
Rose::DBx::Object::Indexed (see also DBIx::Class::Indexed)
Search::Tools
SWISH::Prog
What I Used (cont...)
For searching:
SWISH::Prog
Search::Tools
Lingua::Stem::Snowball (used by all 3 engines for stemming)
Search::Query
CHI (for caching facets)
DBI+SQLite (for logging and caching terms for autosuggest, spelling, thesaurus, etc)
Demo Search::OpenSearch
http://opensearch.org/
ExtJS client
Plack server
JSON or XML (Atom) response formats
Further Reading
my $perl_projects =
http://perl.peknet.com/
http://dev.swish-e.org/wiki/swish3
Slides and demo code:
http://svn.peknet.com/perl/slides/fp/swish3/