Jan 25, 2009 · 5 minute read
codetechnology
important update (august 5th, 2009) - so i realized that some of the results here (specifically, the java lucene ones) were incorrect. the reason is because as is mentioned on the lucene wiki, the first search has to initialize the caches. as thus, the results aren’t accurate. this seems to be very true. for example, if i run a test query, discard the results, and then run the real query, results for the three classes of queries are now 17, 8, and 4 ms respectively, which is very comparable (if not sometimes better) than that of sphinx. i will probably need to re-run this benchmark to do a better job of giving the backend systems a level playing field to test on.
update (jan 26th, 2009) - as mentioned in the comments, the mysql results aren’t very accurate either because i was probably not properly searching against the index.
i have a set of ~6000 quotes (verses, if you will), along with a multiple set of translations for each of those verses. before, i was searching across these verses using mysql. while this seemed to work, it was very limiting, and i began looking into alternatives.
so i did a little bit of research and tried out lucene and sphinx. for lucene, i specifically used the zend version (i’ll discuss standard lucene (java) towards the end of this post.)
i’ll show the results first, and then explain them after.

the graph above shows a quick overview of the tests run. a set of 3 different queries were run against 4 different backends. the numbers were generated using apache bench (ab) using 100 requests with a concurrency of 1.
backends:
lucene: this was the first implementation. in it, each verse was a “document.” each translation was a property of the document. the total number of documents was thus equivalent to the number of verses.
sphinx: this was the second sphinx implementation (see sphinx alt below for the first implementation). this implementation was just done to make the data model similar to that of lucene, which is exactly what it is. although this ended up being the fastest (by < 5ms in the tests run), i prefer the sphinx alt implementation because it’s closest to that of the database schema.
sphinx alt: although it is named “sphinx alt” in the graph above, this is really the initial sphinx implementation. in this model, a translation of one verse was a document. consequently, the total number of documents was (number of translations) * (number of verses). i sort of like this one most (even though it’s not the fastest) because it is the closest to the current database schema.
mysql: this is sort of the baseline, and, to be honest, it’s not fair either. the query used here is something in the nature of getting the row where the text is like ‘%word1%word2%’; the number of results returned by this are far fewer (and less valuable) than those returned by either lucene or sphinx. one would need to do “where text like ‘%word1%word2%’ or text like ‘%word2%word1%’” to get a more accurate estimate, but for baseline purposes, i simply ran the first query. note that the query cache size is 0 (ie query cache is on but effectively off for this set of tests). note that the text field has a fulltext index on it.
results:
sphinx wins hands down. however, although it seems that lucene comes in last, this is not really accurate because of the type of mysql query being used. from my limited tests (using a more complicated sql query), lucene and mysql have comparable performance, but lucene of course has the added benefit of more advanced query options, etc.
sphinx times were 8.030 ms, 7.542 ms, 8.324 ms, sphinx alt times were 8.304 ms, 7.898 ms, 11.131 ms, lucene times were 285.759 ms, 116.222 ms, 224.381 ms, and mysql times were 106.254 ms, 106.561 ms, 108.747 ms for queries 1, 2, and 3, respectively. query 1 contained three words (+term1 +term2 +term3), query 2 contained one word (+term4), and query 3 contained two words (+term5 +term6).
additional details:
plain vanilla java lucene is usually faster than zend’s lucene implementation. the largest difference can be noted in indexing times (a few seconds for java versus 15+ minutes in php). if i had to index frequently, i’d use java lucene or sphinx because they are insanely faster.
for example, the first query takes 179.84 ms on average in java (over 100 queries) versus about 272.61 ms on average for php. the second query takes 173.22 ms on average in java versus about 103.30 ms in php. the third query takes 178.98 ms on average in java versus about 214.78 ms in php.
php only won at the second query, which also happens to be the simplest query. two things to note - first, the times here don’t include the jvm or php interpreter start times. these are times reported by taking the time before and after the search call and displaying them. second, unlike the first test, this was all run from the command line and not directly via web (didn’t want to bother setting up tomcat or solr, etc).
just for fun, i implemented the “sphinx alt” data scheme in java lucene as well and re-ran the 3 tests 100 times each. the results were 178.54 ms, 160.20 ms, and 172.72 ms - very much comparable to the results with the alternate schema.
the summary of this very long post in 2 words: sphinx rocks.
Jul 1, 2008 · 1 minute read
random
it’s been a while since i last posted… i guess microblogging (twitter, brightkite, etc) have sort of gotten the majority of my posts, and i’ve neglected the blog.
the main reason i like the blog is it’s interesting to look back and see what i was writing about n years ago. it just serves as a good documentation. twitter/brightkite do the job as well (esp with uploading pictures to brightkite on the go and tagging location), but the blog allows me to be less concise.
so yeah, perhaps i will resume the posts shortly (tm).
Mar 18, 2008 · 1 minute read
website
so i went through and tried to back-tag all the posts on the site… more difficult than it sounds, especially since i was at a loss as to how to tag some of the things (especially in a consistent manner). i am not showing a tag cloud yet, but maybe soon (tm).
i’ve also added links to my twitter and flickr in the sidebar. i guess i should start looking into some wordpress plugins at some point in the near future for some of the stuff in the sidebar.
update - tried showing a tag cloud, but i don’t really like it… things i don’t care about surfacing get surfaced because they end up being “bucket” categories and things i do care about don’t :p
i like the tagging so i can have, for example, snippets as the tag for all the posts that have code snippets in them, and pictures as the tags for all posts that have pictures in them, and so on… but i guess kind of like del.icio.us, if i think about the correctness and consistency of my tags too much, it starts to bother me :p
Mar 16, 2008 · 2 minute read
code
generally speaking, my set of mp3s is very well tagged. for my personal mp3s, i used to exclusively use easytag to tag them, and now i use a combination of easytag and amarok (which is totally awesome by the way!)
but sometimes, i have to mass edit id3tags for mp3s on the server, and i don’t have the luxury of using such gui tools for the editing. as thus, i’ve been mainly using id3v2 within some perl scripts to tag mp3s. this turns out to work great, but i also wanted to be able to add album art to the mp3s from the command line.
i couldn’t figure out how to do it using id3v2 (perhaps using the custom frames, there’s a way, but nothing extremely simple and obvious from what i was looking at). then i found the solution in the form of a id3lib-ruby, a ruby wrapper for id3lib, the same library that id3v2 is based on.
with this, everything turns out to be extremely easy -
require 'id3lib'
tag = ID3Lib::Tag.new('myfile.mp3')
cover = {
:id => :APIC,
:mimetype => 'image/jpeg',
:picturetype => 3,
:data => File.read('cover.jpg')
}
tag << cover
tag.update!
and that’s it. nice and simple. by the way, a picturetype of 3 denotes a front cover and is the default value (just learned that from a quick search). oh and the output mp3 image cover shows up fine in both linux and on itunes. beautiful!
Mar 13, 2008 · 2 minute read
code
so i was working on some code in which i needed to know whether or not it was dst for a given country and/or timezone or not. luckily, with php5.2, some sparsely documented (yet very useful) classes were introduced - a more thorough documentation can be found here.
so let’s say i want to know whether or not egypt is in dst right now or not… so first i need to know what zoneinfo file egypt uses (for egypt, it’s simple, but this trick is useful for more obscure places, like “isle of man,” for example):
cd /usr/share/zoneinfo
grep -i egypt iso*.tab # get the iso country code for egypt
# the above command returns 'EG' - so...
grep EG zone.tab
# returns 'Africa/Cairo'
in many cases, there are many timezones that exist for a given country. in many cases, it’s obvious which file you need, but in some cases, it’s not very obvious. in those cases, i found it helpful to open the binary files and look at the very last line, in which some hint about the offset of the timezone is given.
anyway… once you have the zoneinfo file that you would use, it’s very easy to find whether or not you are in dst (well, assuming that you know what the standard, non-dst offset from utc is). for example:
$tz = new DateTimeZone('America/New_York');
$date = new DateTime();
$date->setTimezone($tz);
echo $date->;format(DATE_RFC3339) . "\n";
echo $date->getOffset()/3600 . "\n";
running this returns the time in new york, and the offset (-4). since the standard est offset is -5 hours, -4 means we’re +1 which means we are currently on dst.
so if you don’t know the standard offset, another trick that you could do is pass some parameters to the new DateTime() constructor - so for example…
$tz = new DateTimeZone('America/New_York');
$date = new DateTime('2008-12-31');
$date->setTimezone($tz);
echo $date->getOffset()/3600 . "\n";
this returns -5, which is out of dst. anyhow, you could use the above if you don’t know the default offset for a timezone for dst by passing in 2 dates - something towards the middle of the year (july-ish) and something towards the end of the year (december-ish). if the offsets are different, the place probably has dst.
also, do note that some places have things a little differently - so dst in windhoek, namibia, for example, ends in april and starts in september.
Mar 2, 2008 · 1 minute read
technologywebsite
thanks to the iwphone wordpress plugin, the blog now looks a lot better on the iphone!
speaking of the iphone, i am really disappointed at the notion that the long awaited sdk coming out on 3/6 will potentially be locked down. i guess we can’t know for sure until the announcement on thursday, but i personally have gone ahead and re-jailbroken my phone, courtesy of zibri’s ziphone.
Mar 2, 2008 · 1 minute read
technology
i always used to get upset when i send a message or set a status on an english site in arabic, only to have the punctuation all messed up. well, thanks to two unicode control characters, \u200e and \u200f (for ltr and rtl, respectively), i can finally go from writing:
يحيى الإسلام!
to writing:
يحيى الإسلام!
much better :) thanks goes to this wikipedia article and adil allawi, whom i first heard about this from.
Feb 13, 2008 · 1 minute read
gaming
i just beat quake4 (played the whole game under linux). fun!
Feb 10, 2008 · 1 minute read
random
woot! egypt won the 2008 africa cup of nations! beautiful game, i only saw the last 20 minutes, but those were the 20 minutes that mattered. beautiful assist by zidane in a crazily intense moment, passing to abu trayka who scored :) incredible work of art!
i am not a big fan of sports, but this was one game i had to watch :) second cup win in a row, the last being against the ivory coast in 2006.
Feb 9, 2008 · 1 minute read
work
woot! word has it that monday, yahoo will write a “sorry” to microsoft. now let’s just hope they don’t get rid of our search advertising (and, therefore, search in general) to google… as i mentioned before, the best option here is to tough it out this period of time, and things will come back up.
in lolcat terms, courtesy of the flickr anti-microsoft group: this picture is awesome :)