faster and better text search

important update (august 5th, 2009) - so i realized that some of the results here (specifically, the java lucene ones) were incorrect. the reason is because as is mentioned on the lucene wiki, the first search has to initialize the caches. as thus, the results aren’t accurate. this seems to be very true. for example, if i run a test query, discard the results, and then run the real query, results for the three classes of queries are now 17, 8, and 4 ms respectively, which is very comparable (if not sometimes better) than that of sphinx. i will probably need to re-run this benchmark to do a better job of giving the backend systems a level playing field to test on.

update (jan 26th, 2009) - as mentioned in the comments, the mysql results aren’t very accurate either because i was probably not properly searching against the index.

i have a set of ~6000 quotes (verses, if you will), along with a multiple set of translations for each of those verses. before, i was searching across these verses using mysql. while this seemed to work, it was very limiting, and i began looking into alternatives.

so i did a little bit of research and tried out lucene and sphinx. for lucene, i specifically used the zend version (i’ll discuss standard lucene (java) towards the end of this post.)

i’ll show the results first, and then explain them after.

![](http://chart.apis.google.com/chart?chs=350x100&cht=bhs&chd=t:8.030,7.542,8.324|8.304,7.898,11.131|106.254,106.561,108.747|285.759,116.222,224.381&chds=0,400&chco=ff0000,C6D9FD,4D89F9,123456&chxt=y&chxl=0:|query3|query2|query1&chdl=sphinx|sphinx alt|mysql|lucene)

the graph above shows a quick overview of the tests run. a set of 3 different queries were run against 4 different backends. the numbers were generated using apache bench (ab) using 100 requests with a concurrency of 1.

backends: lucene: this was the first implementation. in it, each verse was a “document.” each translation was a property of the document. the total number of documents was thus equivalent to the number of verses.

sphinx: this was the second sphinx implementation (see sphinx alt below for the first implementation). this implementation was just done to make the data model similar to that of lucene, which is exactly what it is. although this ended up being the fastest (by < 5ms in the tests run), i prefer the sphinx alt implementation because it’s closest to that of the database schema.

sphinx alt: although it is named “sphinx alt” in the graph above, this is really the initial sphinx implementation. in this model, a translation of one verse was a document. consequently, the total number of documents was (number of translations) * (number of verses). i sort of like this one most (even though it’s not the fastest) because it is the closest to the current database schema.

mysql: this is sort of the baseline, and, to be honest, it’s not fair either. the query used here is something in the nature of getting the row where the text is like ‘%word1%word2%’; the number of results returned by this are far fewer (and less valuable) than those returned by either lucene or sphinx. one would need to do “where text like ‘%word1%word2%’ or text like ‘%word2%word1%’” to get a more accurate estimate, but for baseline purposes, i simply ran the first query. note that the query cache size is 0 (ie query cache is on but effectively off for this set of tests). note that the text field has a fulltext index on it.

results: sphinx wins hands down. however, although it seems that lucene comes in last, this is not really accurate because of the type of mysql query being used. from my limited tests (using a more complicated sql query), lucene and mysql have comparable performance, but lucene of course has the added benefit of more advanced query options, etc.

sphinx times were 8.030 ms, 7.542 ms, 8.324 ms, sphinx alt times were 8.304 ms, 7.898 ms, 11.131 ms, lucene times were 285.759 ms, 116.222 ms, 224.381 ms, and mysql times were 106.254 ms, 106.561 ms, 108.747 ms for queries 1, 2, and 3, respectively. query 1 contained three words (+term1 +term2 +term3), query 2 contained one word (+term4), and query 3 contained two words (+term5 +term6).

additional details: plain vanilla java lucene is usually faster than zend’s lucene implementation. the largest difference can be noted in indexing times (a few seconds for java versus 15+ minutes in php). if i had to index frequently, i’d use java lucene or sphinx because they are insanely faster.

for example, the first query takes 179.84 ms on average in java (over 100 queries) versus about 272.61 ms on average for php. the second query takes 173.22 ms on average in java versus about 103.30 ms in php. the third query takes 178.98 ms on average in java versus about 214.78 ms in php.

php only won at the second query, which also happens to be the simplest query. two things to note - first, the times here don’t include the jvm or php interpreter start times. these are times reported by taking the time before and after the search call and displaying them. second, unlike the first test, this was all run from the command line and not directly via web (didn’t want to bother setting up tomcat or solr, etc).

just for fun, i implemented the “sphinx alt” data scheme in java lucene as well and re-ran the 3 tests 100 times each. the results were 178.54 ms, 160.20 ms, and 172.72 ms - very much comparable to the results with the alternate schema.

the summary of this very long post in 2 words: sphinx rocks.

comments powered by Disqus