Kinosearch - Due Diligence

hide


Introduction

Motivation

Socialtext has recently migrated search engines from Plucene to Kinosearch. We thought that others might be interested in Kinosearch, so Matthew O'Connor wrote up the results of the due diligence that led to the migration.

Socialtext had been using Plucene for its search solution for some time. However, Plucene has some serious issues that make it a difficult choice for enterprise environments. We looked for another Perl based search solution and in those shallow waters we found KinoSearch.

Out of the box, KinoSearch was advertized as doing much of what Plucene could do, and more. So it was a reasonable avenue to explore. The author of KinoSearch, Marvin Humphrey was known to some developers and so there was some assurance he might be able to help us if we got stuck.

Overview

KinoSearch looks to be a reasonable search solution for Socialtext's needs. KinoSearch has some quality issues that show its immaturity, but we have been able to work around most of them. Throughout our efforts Marvin Humphrey has been very responsive and friendly, he runs his project well.

We settled on KinoSearch 0.14. As of this writing it is last stable release. However, this version of KinoSearch has some problems with UTF-8 and locking the index. Marvin is aware of both of these issues and later versions will fix the problems, but for our immediate needs we had to come up with some workarounds.

KinoSearch seems like a good solution for simple to moderate search and indexing needs. More complex needs, such as Socialtext's, require more effort. During our efforts bugs were exposed and had to be dealt with, slowing down development time. With proper in-house developer resources, KinoSearch can probably be made to scale to arbitrarily complex needs. It is Open Source and the maintainer is available and active, however the documenation is pretty bad.

Indexing

Stringification

Like mose indexers, and other Lucene work-a-likes, KinoSearch only knows how to index plain text. Indexing other document formats requires marshalling efforts to convert those formats into plain text. We were able to make use of many open source programs to achieve decent stringification. We currently support the following document types using the indicated program/library for convertion:

  • Microsoft Word Documents
    • Convertor: wvText
  • PDF Documents
    • Convertor: pdftotext
  • Postscript Documents
    • Convertor: ps2ascii
  • Microsoft Excel Documents
    • Convertor: xsl2csv
  • Compressed Zip Files
    • Convertor: unzip
      • The contents of the Zip file are then converted to text.
  • MPEG-3 Audio Files
    • Convertor: MP3::Tag Perl library
      • ID3 tags which contain file metadata are extracted.
  • HTML Documents
    • Convertor: lynx
  • Plain Text Documents
    • Convertor: n/a
  • RTF Documents.
    • Convertor: unrtf
  • Generic XML Documents.
    • Convertor: XML::SAX Perl library
  • All Other Documents
    • Convertor: strings(1)

Dollar Amp

Once we had written a basic KinoSearch indexer and bootstrarpped into our code base we started noticing huge performance issues. It turns out this is because $& and family was used by some of the modules we dependend on (e.g. Text::Balanced 1.98).

It is well known among Perl programmers that $& causes a performance penalty on regular expressions. As a result the KinoSearch tokenizer, which uses regexes, became a huge bottleneck. We systematically removed $& from our code and our dependencies, and on large documents we saw several orders of magnitude improvement in index time.

Things are $& are just the reality of dealing with Perl, some of its warts just have to be dealt with. However, I note the issue here in case it comes up again. Vigilence is required, as new dependencies are added or existing ones upgraded.

Locking

KinoSearch has a legitimate bug with how it does locking. KinoSearch uses sysopen() with O_EXCL to create a file in a temporary directory. With O_EXCL, sysopen() will fail if the file already exists. Therefore KinoSearch has to delete the lock file before shutting down.

Managing the lock file is all automatic, you normally do not have to think about locking at all. However, the lock cleanup code is triggered when the index object is DESTROYed. Thus, if your program receives SIGINT (or some other untrapped and unblocked signal) Perl will exit without calling DESTROY on your object, or running any END blocks. The result is a stale lock file, and subsequent attempts to reopen the index will fail until you manually delete the lock file.

Our solution was to patch KinoSearch to add the pid to the lock file. The pid is checked for existance, and if it does not exist the lock file is removed. This happens right before the call to sysopen(), so if a stale lock file exists it will be removed. There is a non-zero chance that the lock file could be stale, but the pid inside the file belongs to some currently active (and likely unrelated) process, this would give a false negative saying the index is locked. This is deemed safe enough since we fail closed in this case, rather than open.

Our solution is current in the process of being pushed back to Marvin, and he will incorporate a version of it in later versions of KinoSearch. We would prefer a flock() based locking solution, but Marvin is shying away from flock() for a variety of reasons.

UTF-8

The version of KinoSearch we are using does not correctly support UTF-8. The gist is that when things enter the C API they lose their UTF-8 flag. To correctly support UTF-8 we had to copy some of the default Analyzers and modify them to call encode() and decode() in the appropriate places.

Internally KinoSearch just deals with bytes, but the analyzers are in Perl and need to operate with character semantics. The issue is actually pervasive and we got away with only rewriting the LCNormalizer, Stemmer, and Tokenizer. One would probably need to make modifications to other parts of KinoSearch if they were to make use of them.

Rewriting those analyzers was pretty simple, and amounted to only a few hundred lines of code with select modifications. The LCNormalize is brain dead simple and the Stemmer uses Lingua::Stem::Snowball for the heavy lifiting. The tokenizer required the most modification.

KinoSearch is byte oriented internally, and it stores its tokens by using offsets into a single string. So if the string is "foo bar" then KinoSearch would reference "foo" by storing the pair (0, 2) to indicate where that token starts and ends. The upshot is that when UTF-8 is being used you have to, manually, make sure multi-byte characters are correctly dealt with.

These UTF-8 issues will be cleared up with later versions of KinoSearch, but in the current release they are the reality.

Performance

Performance is a difficult thing to relate. KinoSearch is absolutely fast enough at indexing for Socialtext's needs. It can index about 22000 documents in 16 minutes, these documents make up about 2GB of raw data. However, given that most indexing happens asyncronously with our main product it turns out that it's fast enough for our modest needs.

To get an idea how well KinoSearch performed when indexing various sized documents I ran some tests. The test was to generate 10 documents for each of 8 different sizes: 1, 10, 100, 1k, 10k, 100k, 1m, 10m words. The documents are generated by selecting words randomly from a dictionary. I ended up with 80 documents.

The below table compares the time results for the various sized documents. All numbers are in seconds, except for # words.

  1. words
mean std dev median min max
1 1.28 0.00 1.28 1.28 1.28
10 1.32 0.01 1.32 1.30 1.33
100 3.00 2.38 1.34 1.30 6.37
1000 1.35 0.00 1.35 1.35 1.36
10000 1.98 0.07 1.96 1.90 2.08
100000 7.50 0.08 7.50 7.40 7.60
1000000 62.38 0.20 62.29 62.20 62.66
10000000 758.95 12.34 758.99 743.81 774.03

The below table compares the memory usage results for the various sized documents. All numbers are in megabytes, except for # words. Note: the memory sizes are the maximum size of the indexing process across its life, these sizes are not always sustained across the process's life.

  1. words
mean std dev median min max
1 154.77 0.05 154.73 154.73 154.84
10 154.84 0.00 154.84 154.84 154.84
100 160.91 8.67 154.84 154.73 173.17
1000 154.96 0.02 154.97 154.94 154.98
10000 159.49 0.14 159.40 159.40 159.69
100000 186.60 0.01 186.60 186.59 186.61
1000000 439.22 0.02 439.22 439.20 439.25
10000000 3077.59 6.67 3072.89 3072.86 3087.01

Searching

General Thoughts

At this time I do not have many thoughts on KinoSearch's ability to search. It just seems to work, and it makes me happy. I think that is the best one can ask for. At Socialtext we have not yet run our implementation past QA, so all quality impression so far are subjective.

My subjective impression is that it is better than Plucene. The search language works well enough, I've flexed complex boolean queries, phrase searching, and searching by fields. From the API point of view we've even made use of the various Query objects to construct queries by hand, and it works in the standard Lucene-ish way, which is to say well.

The only function that Plucene allows, that KinoSearch does not, is Wildcard search. The ability to have "foo*" match "foo", "foobar", and "food". Wildcard search is very slow in Lucene, and our implementation of it is slow as well. The Lucene implemntation and ours (as suggested by Marvin) works by scanning all the terms in the index and constructing a huge conditional query.

For example, "foo*" would become "(foo OR foobar OR food OR ...)". So you incur the cost of scaning the index, parsing that query, and then executing that query. A true triple threat.

Playing around I was able to get wild card search pretty snappy through some pervasive use of caching. The main limiting factor on performance is the size of the index. We have have 22000 documents in our largest index (so far) and that includes 556375 terms.

Performance

The performance of searching seems to be is primarily a function of the number of results returned. Secondarily it's a function of how complex the query is. In practice 99.9999% of our queries are one or two words without any explicit boolen connectors, these kinds of queries are dead simple. So, the majority of the time is spent harvesting the results.

On a dual core 3GHz Intel Xeon with 2 GB of memory, on a 32 MB index with 556375 terms I see the following numbers:

Time (s) Hits Memory (MB) Query
2.1674 3886 157 ross
1.8301 3249 157 linking
1.7922 3197 157 manager
1.6834 3000 157 Include
1.6816 3000 157 include
1.5936 2866 157 developer
1.2348 2160 157 file
1.2275 2155 157 release
1.2198 2160 157 "file:"
1.0356 1815 157 share
0.9715 1660 156 accountability
0.8997 1573 157 search
0.8945 1566 157 %s
0.7906 1393 156 comments
0.7016 1239 156 bugs
0.6323 1097 157 Feedback
0.6267 1098 156 ceo
0.5898 1009 157 generation
0.5573 961 157 publishing

Again, not numbers to write home about but they are enough for our modest needs and, comparatively, they crush Plucene.

Conclusion

KinoSearch is Good Enough for Socialtext's needs. It will require dedicated developers who are not afraid to read the source code. Indeed, the documentation for the API is pretty sparse. Also, developers will need to be able to roll with the punches and expect that KinoSearch has some warts.

Socialtext deployed KinoSearch in a production environment over New Years and is working effectively. Getting something basic going was pretty easy, you're up and running in minutes. Incorporating it into our code was another day or two of effort. Shaking it all out and working around warts was another couple of weeks of work, for about 1.5 developers.


Instead of hoping that added code doesn't use $& and friends, it should be easy enough to add tests to the testsuite based on Devel::SawAmpersand to catch misbehaving code (and help identify it too).

contributed by Scott McWhirter on Feb 11 6:10pm


We have some timing based tests that should catch the introduction of $& into the code or one of its dependencies. We weren't able to get Devel::SawAmpersand to work with latest versions of Perl. Considering the geometric penalty the timing tests have been reliable enough.

contributed by Matthew O'Connor on Feb 12 11:45am


Tags

Incoming Links

There are no pages that link to this page yet.

Attachments

Click this button to save this page to your computer for offline use. Created by Adina Levin on Jan 15 5:59pm. Updated by Matthew O'Connor on Feb 12 11:45am. (6 revisions, 539 views)