Skip to main content

Delving into the 7th level of Unicode Hell (Ruby program)

When the three brothers defeated the Titans and had the whole cosmos to divide among themselves, it was the oldest ᾍδης who got the Underworld. A kingdom I seem to be quickly descending on my innocent trip to down Ruby programming. I set out to do a little program that would reverse text like ᾍδης particularly useful for Hebrew and Arabic. But then I found myself entering...

A Glance into the Unicode Underworld
Kairon ferried me across while I was sleeping, because suddenly my little program is facing a huge issue. Ruby doesn't fully support Unicode.  As I searched deeper into this Unicode world I found I was in bigger trouble than I originally thought. ANSI, the common encoding of Ruby, is 8-bit encoding. So I thought, well Unicode must be 16-bit encoding, more bits more letters -- problem solved! Just find a way to split off two 8-bit character chunks at a time and stack them on a new line and voilà! We have the text reversed. I even found a way of doing that. If you have a string you can call the character in it in order by feeding it a position number and how many characters you wanted. "string"[position_number, number_of_characters]  But alas...

Deeper into the Unicode Mists.
If you look at your browser you'll invariably see under the 'View' menu an encoding option for UTF-8. This is the most common encoding method for Unicode and has become the virtual default on the web. And this is the encoding I am using. And here is the rub. It's multi-character encoding. It can be one to three characters long, as my little test program shows. So I can't just grab 2-character chunks because that will invariably result in gibberish. The program needs to read each character and see if it is a one character encoding like Á or α (alpha) or a three character encoding like β (capital beta) or a two character encoding like γ (gamma). Because this affects the positions of subsequent characters, it needs to read the character figure out if it is part of the previous character and split it off that way, otherwise it can start splitting midway through a character. And I thought this was going to be easy-peasie-Japanesie.

Cheating Death?
Now my confusion stems in reality before UTF8 there was UCS2 which was exclusively a 2-character encoding. Unfortunately for me UCS2 was superceeded by UTF16 another variable character encoding. However, the initial UCS2 set is part of UTF16 so as long as my file is UTF16 encoded and contains no characters outside the UCS2 subset, I should be able to cheat by telling it to use two-byte character chunks to reverse. This is a little tricky because the program I was using to create the text samples, Notepad++ doesn't encode in UTF16.

Send a Hero
I went to ask for advice on this to LA's Ruby Group and I got a work around that solved this issue. Using scan, which turns a string into an array at predetermined "chop" points, I can use Unicode codes as the chop points. That way I go from a string to an array or list made up of elements, each of which is a single Unicode character. This worked only with UTF8 (not with UTF16) but it worked flawlessly allowing me to simply reverse the order of the array. At at least for now I can be free of the Chthonic world of Unicode.


Popular posts from this blog

How to configure Ubuntu's keyboard to work like a Mac's

Typing accents on a PC is a complicated Alt + three numbered code affair. One feels like a sorcerer casting a spell. "I summon thee accented é! I press the weird magical key Alt, and with 0191 get the flipped question mark!" For a bilingual person this meant that writing on the computer was a start-and-stop process. With Mac's it a whole lot easier, just Alt + e and the letter you wanted for accents and alt + ? for the question mark. No need to leave the keyboard for the number pad and no need to remember arcane number combinations or have a paper cheat sheet next to the keyboard, as I've seen in virtually every secretaries computer in Puerto Rico.

Linux has a interesting approach to foreign language characters: using a compose key. You hit this key which I typically map to Caps Lock and ' and the letter you want and voilá you get the accent. Kinda makes sense: single quotation mark is an accent, double gets you the ümalaut, works pretty well. Except for the ñ, wh…

Contrasting Styles of Writing: English vs. Spanish

There is interestingly enough a big difference between what's considered good writing in Spanish and English. V.S. Naipul winner of the 2001 Nobel prize for literature publish an article on writing. In it he emphasizes the use of short clear sentences and encourages the lack of adjectives and adverbs. Essentially he pushes the writer to abandon florid language and master spartan communication. This is a desired feature of English prose, where short clipped sentences are the norm and seamlessly flow into a paragraph. In English prose the paragraph is the unit the writer cares about the most.

This is not the case in Spanish where whole short stories (I'm thinking this was Gabriel Garcia Marquez but maybe it was Cortázar) are written in one sentence. Something so difficult to do in English that the expert translator could best manage to encapsulate the tale in two sentences. The florid language is what is considered good writing in Spanish but unfortunately this has lead to what …

Fixing Autocomplete in Github's Atom Text Editor for Ruby

I really like Github's Atom Text Editor. I really like that it's multi-platform allowing me to master one set of skills that is transferable to all platforms and all machines. 

On thing that just burns me of the default set-up in Atom is the Autocomplete feature that seems to change my words as a type them. Because Ruby uses the end of line as a terminus for a statement you usually finish a word with pressing the return button and you get really annoying changes to your finished typed word a la MS Word. I find myself yelling "No that's not what I wrote!" at the screen in busy coffee shops.

I disabled autocomplete for a while but it is a very useful function. Then I found out they changed the package that gave the autocomplete to a new one called "Autocomplete Plus" that gives you more options. All that I needed to change to make autocomplete sane again:

1. Open Atom's Preferences
2. Search the bundled packages for "Autocomplete Plus"

3. Go to t…