so, i am a night owl...

funny. courtesy of tweet o’clock.

ubiquity rocks!

today, i felt like playing some more with ubiquity, which i had installed for a while now but had not played around with sufficiently. i decided to try to write a simple plugin that will search the quran for a particular set of words. to do this, i felt obliged to expose an api for the alpha version of quranicrealm first, which was good because i needed to do it eventually anyway.

and here’s the mandatory screenshot: ubiquity - quran search preview

it still needs a lot of work… things i still want to do if i get around to it:

  • add a favicon (for the site and for the plugin)

  • more options (ex, “search english,” or “search transliteration,” etc)

  • replace the current text with a link (or translation). this would be useful in im conversations or while writing blog posts.

  • a “get-ayah” command (to say, “get ayah 1 of sura fatiha in arabic,” for example).

anyway, i’ll post up the code when i’ve added some improvements insha’Allah. if you want it before then, post a comment.

thoughts on the g1

for some (crazy) reason, i decided to try out the g1 after reading gina’s article and finding a good deal on craigslist. the summary is - i think i am going to sell it and keep my iphone :)

thoughts so far:

  • it’s nice to have a keyboard - but something doesn’t feel right about it. i can’t quite put my hands on it yet.

  • no arabic fonts in the browser, and no arabization!! :( iphone doesn’t have it either, but third party solutions (iphone islam) exist that work very well.

  • touch screen isn’t multi touch. also, you have to press with your finger (not the finger tip) - pressing with the finger tip is useless.

  • gmail app totally rocks

  • integration with google (for gmail and calendar) rocks

  • integration with contacts is HORRIBLE. seriously. gmail, as you may know, makes a contact for every person you email. so if you sync your contacts as is with the phone, you’re looking at a ridiculous set of contacts. moreover, if you use the built in google syncing within address book, your address book gets sullied with all these random contacts, duplicates, etc. not very cool. i worked around this by using ab2csv exporter, exporting a csv, and importing it into google contacts under a specific group, then only syncing that group with the phone.

  • one odd caveat - you have to use the supplied usb cable to connect the hone to the pc and be able to mount the micro sd card. using any normal cable you may have (from a camera, for example) will charge the phone, but won’t work for mounting the micro sd card. took me a while to figure this out.

i upgraded the firmware, but haven’t played with google latitude yet (nor with the gps).

i may play with it some more, but at the time being, i am thinking of selling this and sticking with my iphone, as it feels a lot more polished. there are definitely some nice things about it that are missing from the iphone - gears, the fact it runs linux, development seems to be easier (java based), cut and paste, better camera, built in voice dialer, etc. but the iphone feels a lot more polished.

update - “compare everywhere” app rocks - iphone has an equivalent (snaptell, and the amazon app is good too), but it doesn’t scan barcodes. the english to arabic dictionary actually renders proper shaped arabic. apparently, some people have gotten arabic (the font and shaping) to work (though i am not sure if it’s throughout all the apps or not). they haven’t documented it all yet but should soon. i doubt it’ll be to the extent that arabization is done on the iphone, however. battery life is sup-par - went from 100% to 82% in a few minutes by installing and trying a handful of apps.

summary - iphone (even the first gen) still wins.

faster and better text search

important update (august 5th, 2009) - so i realized that some of the results here (specifically, the java lucene ones) were incorrect. the reason is because as is mentioned on the lucene wiki, the first search has to initialize the caches. as thus, the results aren’t accurate. this seems to be very true. for example, if i run a test query, discard the results, and then run the real query, results for the three classes of queries are now 17, 8, and 4 ms respectively, which is very comparable (if not sometimes better) than that of sphinx. i will probably need to re-run this benchmark to do a better job of giving the backend systems a level playing field to test on.

update (jan 26th, 2009) - as mentioned in the comments, the mysql results aren’t very accurate either because i was probably not properly searching against the index.

i have a set of ~6000 quotes (verses, if you will), along with a multiple set of translations for each of those verses. before, i was searching across these verses using mysql. while this seemed to work, it was very limiting, and i began looking into alternatives.

so i did a little bit of research and tried out lucene and sphinx. for lucene, i specifically used the zend version (i’ll discuss standard lucene (java) towards the end of this post.)

i’ll show the results first, and then explain them after.

the graph above shows a quick overview of the tests run. a set of 3 different queries were run against 4 different backends. the numbers were generated using apache bench (ab) using 100 requests with a concurrency of 1.

backends: lucene: this was the first implementation. in it, each verse was a “document.” each translation was a property of the document. the total number of documents was thus equivalent to the number of verses.

sphinx: this was the second sphinx implementation (see sphinx alt below for the first implementation). this implementation was just done to make the data model similar to that of lucene, which is exactly what it is. although this ended up being the fastest (by < 5ms in the tests run), i prefer the sphinx alt implementation because it’s closest to that of the database schema.

sphinx alt: although it is named “sphinx alt” in the graph above, this is really the initial sphinx implementation. in this model, a translation of one verse was a document. consequently, the total number of documents was (number of translations) * (number of verses). i sort of like this one most (even though it’s not the fastest) because it is the closest to the current database schema.

mysql: this is sort of the baseline, and, to be honest, it’s not fair either. the query used here is something in the nature of getting the row where the text is like ‘%word1%word2%’; the number of results returned by this are far fewer (and less valuable) than those returned by either lucene or sphinx. one would need to do “where text like ‘%word1%word2%’ or text like ‘%word2%word1%’” to get a more accurate estimate, but for baseline purposes, i simply ran the first query. note that the query cache size is 0 (ie query cache is on but effectively off for this set of tests). note that the text field has a fulltext index on it.

results: sphinx wins hands down. however, although it seems that lucene comes in last, this is not really accurate because of the type of mysql query being used. from my limited tests (using a more complicated sql query), lucene and mysql have comparable performance, but lucene of course has the added benefit of more advanced query options, etc.

sphinx times were 8.030 ms, 7.542 ms, 8.324 ms, sphinx alt times were 8.304 ms, 7.898 ms, 11.131 ms, lucene times were 285.759 ms, 116.222 ms, 224.381 ms, and mysql times were 106.254 ms, 106.561 ms, 108.747 ms for queries 1, 2, and 3, respectively. query 1 contained three words (+term1 +term2 +term3), query 2 contained one word (+term4), and query 3 contained two words (+term5 +term6).

additional details: plain vanilla java lucene is usually faster than zend’s lucene implementation. the largest difference can be noted in indexing times (a few seconds for java versus 15+ minutes in php). if i had to index frequently, i’d use java lucene or sphinx because they are insanely faster.

for example, the first query takes 179.84 ms on average in java (over 100 queries) versus about 272.61 ms on average for php. the second query takes 173.22 ms on average in java versus about 103.30 ms in php. the third query takes 178.98 ms on average in java versus about 214.78 ms in php.

php only won at the second query, which also happens to be the simplest query. two things to note - first, the times here don’t include the jvm or php interpreter start times. these are times reported by taking the time before and after the search call and displaying them. second, unlike the first test, this was all run from the command line and not directly via web (didn’t want to bother setting up tomcat or solr, etc).

just for fun, i implemented the “sphinx alt” data scheme in java lucene as well and re-ran the 3 tests 100 times each. the results were 178.54 ms, 160.20 ms, and 172.72 ms - very much comparable to the results with the alternate schema.

the summary of this very long post in 2 words: sphinx rocks.

pseudo back from hiatus

it’s been a while since i last posted… i guess microblogging (twitter, brightkite, etc) have sort of gotten the majority of my posts, and i’ve neglected the blog.

the main reason i like the blog is it’s interesting to look back and see what i was writing about n years ago.  it just serves as a good documentation.  twitter/brightkite do the job as well (esp with uploading pictures to brightkite on the go and tagging location), but the blog allows me to be less concise.

so yeah, perhaps i will resume the posts shortly ™.

back-tagging the site

so i went through and tried to back-tag all the posts on the site… more difficult than it sounds, especially since i was at a loss as to how to tag some of the things (especially in a consistent manner). i am not showing a tag cloud yet, but maybe soon ™.

i’ve also added links to my twitter and flickr in the sidebar. i guess i should start looking into some wordpress plugins at some point in the near future for some of the stuff in the sidebar.

update - tried showing a tag cloud, but i don’t really like it… things i don’t care about surfacing get surfaced because they end up being “bucket” categories and things i do care about don’t :p

i like the tagging so i can have, for example, snippets as the tag for all the posts that have code snippets in them, and pictures as the tags for all posts that have pictures in them, and so on… but i guess kind of like, if i think about the correctness and consistency of my tags too much, it starts to bother me :p

simple is beautiful - command line id3 tagging

generally speaking, my set of mp3s is very well tagged. for my personal mp3s, i used to exclusively use easytag to tag them, and now i use a combination of easytag and amarok (which is totally awesome by the way!)

but sometimes, i have to mass edit id3tags for mp3s on the server, and i don’t have the luxury of using such gui tools for the editing. as thus, i’ve been mainly using id3v2 within some perl scripts to tag mp3s. this turns out to work great, but i also wanted to be able to add album art to the mp3s from the command line.

i couldn’t figure out how to do it using id3v2 (perhaps using the custom frames, there’s a way, but nothing extremely simple and obvious from what i was looking at). then i found the solution in the form of a id3lib-ruby, a ruby wrapper for id3lib, the same library that id3v2 is based on.

with this, everything turns out to be extremely easy -

require 'id3lib'
tag ='myfile.mp3')
cover = {
   :id => :APIC,
   :mimetype => 'image/jpeg',
   :picturetype => 3,
   :data =>'cover.jpg')
tag << cover

and that’s it. nice and simple. by the way, a picturetype of 3 denotes a front cover and is the default value (just learned that from a quick search). oh and the output mp3 image cover shows up fine in both linux and on itunes. beautiful!

dealing with timezones in php

so i was working on some code in which i needed to know whether or not it was dst for a given country and/or timezone or not. luckily, with php5.2, some sparsely documented (yet very useful) classes were introduced - a more thorough documentation can be found here.

so let’s say i want to know whether or not egypt is in dst right now or not… so first i need to know what zoneinfo file egypt uses (for egypt, it’s simple, but this trick is useful for more obscure places, like “isle of man,” for example):

cd /usr/share/zoneinfo
grep -i egypt iso*.tab        # get the iso country code for egypt

# the above command returns 'EG' - so...
grep EG
# returns 'Africa/Cairo'

in many cases, there are many timezones that exist for a given country. in many cases, it’s obvious which file you need, but in some cases, it’s not very obvious. in those cases, i found it helpful to open the binary files and look at the very last line, in which some hint about the offset of the timezone is given.

anyway… once you have the zoneinfo file that you would use, it’s very easy to find whether or not you are in dst (well, assuming that you know what the standard, non-dst offset from utc is). for example:

$tz = new DateTimeZone('America/New_York');
$date = new DateTime();
echo $date->;format(DATE_RFC3339) . "\n";
echo $date->getOffset()/3600 . "\n";

running this returns the time in new york, and the offset (-4). since the standard est offset is -5 hours, -4 means we’re +1 which means we are currently on dst.

so if you don’t know the standard offset, another trick that you could do is pass some parameters to the new DateTime() constructor - so for example…

$tz = new DateTimeZone('America/New_York');
$date = new DateTime('2008-12-31');
echo $date->getOffset()/3600 . "\n";

this returns -5, which is out of dst. anyhow, you could use the above if you don’t know the default offset for a timezone for dst by passing in 2 dates - something towards the middle of the year (july-ish) and something towards the end of the year (december-ish). if the offsets are different, the place probably has dst.

also, do note that some places have things a little differently - so dst in windhoek, namibia, for example, ends in april and starts in september.

whatstheplot: iphone edition (tm)

thanks to the iwphone wordpress plugin, the blog now looks a lot better on the iphone!

speaking of the iphone, i am really disappointed at the notion that the long awaited sdk coming out on 36 will potentially be locked down. i guess we can’t know for sure until the announcement on thursday, but i personally have gone ahead and re-jailbroken my phone, courtesy of zibri’s ziphone.

unicode control characters

i always used to get upset when i send a message or set a status on an english site in arabic, only to have the punctuation all messed up. well, thanks to two unicode control characters, \u200e and \u200f (for ltr and rtl, respectively), i can finally go from writing:

يحيى الإسلام!

to writing:

يحيى الإسلام!‏

much better :) thanks goes to this wikipedia article and adil allawi, whom i first heard about this from.