Tuesday, April 12, 2011

Delving into the 7th level of Unicode Hell (Ruby program)

When the three brothers defeated the Titans and had the whole cosmos to divide among themselves, it was the oldest ᾍδης who got the Underworld. A kingdom I seem to be quickly descending on my innocent trip to down Ruby programming. I set out to do a little program that would reverse text like ᾍδης particularly useful for Hebrew and Arabic. But then I found myself entering...

A Glance into the Unicode Underworld
Kairon ferried me across while I was sleeping, because suddenly my little program is facing a huge issue. Ruby doesn't fully support Unicode.  As I searched deeper into this Unicode world I found I was in bigger trouble than I originally thought. ANSI, the common encoding of Ruby, is 8-bit encoding. So I thought, well Unicode must be 16-bit encoding, more bits more letters -- problem solved! Just find a way to split off two 8-bit character chunks at a time and stack them on a new line and voilà! We have the text reversed. I even found a way of doing that. If you have a string you can call the character in it in order by feeding it a position number and how many characters you wanted. "string"[position_number, number_of_characters]  But alas...

Deeper into the Unicode Mists.
If you look at your browser you'll invariably see under the 'View' menu an encoding option for UTF-8. This is the most common encoding method for Unicode and has become the virtual default on the web. And this is the encoding I am using. And here is the rub. It's multi-character encoding. It can be one to three characters long, as my little test program shows. So I can't just grab 2-character chunks because that will invariably result in gibberish. The program needs to read each character and see if it is a one character encoding like Á or α (alpha) or a three character encoding like β (capital beta) or a two character encoding like γ (gamma). Because this affects the positions of subsequent characters, it needs to read the character figure out if it is part of the previous character and split it off that way, otherwise it can start splitting midway through a character. And I thought this was going to be easy-peasie-Japanesie.

Cheating Death?
Now my confusion stems in reality before UTF8 there was UCS2 which was exclusively a 2-character encoding. Unfortunately for me UCS2 was superceeded by UTF16 another variable character encoding. However, the initial UCS2 set is part of UTF16 so as long as my file is UTF16 encoded and contains no characters outside the UCS2 subset, I should be able to cheat by telling it to use two-byte character chunks to reverse. This is a little tricky because the program I was using to create the text samples, Notepad++ doesn't encode in UTF16.

Send a Hero
I went to ask for advice on this to LA's Ruby Group and I got a work around that solved this issue. Using scan, which turns a string into an array at predetermined "chop" points, I can use Unicode codes as the chop points. That way I go from a string to an array or list made up of elements, each of which is a single Unicode character. This worked only with UTF8 (not with UTF16) but it worked flawlessly allowing me to simply reverse the order of the array. At at least for now I can be free of the Chthonic world of Unicode.

No comments:

Post a Comment