Perl and Search: Where are We?

KinoSearch and Xapian compared

Peter Karman

http://www.peknet.com/~karpet/slides/fp/search

Anatomy of a Search Application

Every decent search application has these basic five components:
  • aggregator
  • normalizer
  • parser/analyzer
  • indexer
  • searcher

Aggregator

Gather a document collection. Document collections might originate from:

Normalizer

Documents come in a variety of formats, many of them with MIME types that are not text/*.

Parser/Analyzer

Documents are tokenized into "words" with attention to position, context, length and linguistic quality (stemming, case, stopwords, etc.).

Indexer

Highly optimized storage system aims to preserve the intelligence of the analysis.

Searcher

Parse a user's query and retrieve matching documents from the index. Score and rank hits based on [your magic sauce here].

IR Libraries

KinoSearch

Xapian

Comparison

KinoSearch:
Xapian:

Naive Benchmarks

$ time perl xindex.pl ~/projects/search_bench/

real    1m26.950s
user    1m12.400s
sys     0m10.117s

$ time perl xsearch.pl foo
Searching xapian_index
Running query 'Xapian::Query(foo)'
1 results found
ID 7 100% [ /Users/karpet/projects/search_bench/feldman-cia-worldfactbook-data.txt ]

real    0m0.064s
user    0m0.043s
sys     0m0.018s

$ time perl ksindex.pl ~/projects/search_bench/

real    0m54.725s
user    0m45.425s
sys     0m5.516s

$ time perl kssearch.pl foo
hits: 1
0.071 /Users/karpet/projects/search_bench/feldman-cia-worldfactbook-data.txt

real    0m0.206s
user    0m0.158s
sys     0m0.043s

Addendum #1

Marvin Humphrey, KinoSearch author, wrote after this presentation was given on 16 Feb 2008 and noted the following:
FWIW, since your sample search app only does one iteration and it doesn't reuse the Searcher, 
it's not taking full advantage of KinoSearch's capabilities.  
KS is supposed to be "fast enough" for a scenario just like that one, 
and it seems to have performed acceptably, but searching is a *lot* faster when you cache the Searcher.

Check out the following stats courtesy of Benchmark::Stopwatch.

Regular CGI, at http://www.rectangular.com/cgi-bin/uscon_bench.cgi?q=congress&offset=0:

NAME                        TIME        CUMULATIVE      PERCENTAGE
 load modules                0.121       0.121           73.754%
 init searcher               0.004       0.125           2.626%
 process search              0.032       0.158           19.735%
 fetch hits                  0.006       0.164           3.877%
 _stop_                      0.000       0.164           0.008%

CGI::Fast, at http://www.rectangular.com/fcgi/uscon_search.cgi?q=congress&offset=0:

NAME                        TIME        CUMULATIVE      PERCENTAGE
 process search              0.002       0.002           24.213%
 fetch hits                  0.006       0.008           75.602%
 _stop_                      0.000       0.008           0.186% 

Addendum #2: Swish-e 2.4 benchmark

Current Swish-e release 2.4.5 against same document corpus:
$ time swish-e -i ~/projects/search_bench
real    0m31.833s
user    0m16.493s
sys     0m7.499s

$ time swish-e -w foo
real    0m0.015s
user    0m0.007s
sys     0m0.008s