Wednesday, April 6, 2011

Unicode -- Uphill both ways (Ruby Programming pt. 7)

I found this cool article on Unicode (it gets UTF-16 wrong but that's ok). However I'm running into a large wall dealing with Unicode in my program. So I'll put it out there so a solution presents itself.

So far my program checks each line of the file to see if it's ASCII only text. If so it reverses it with Ruby's built-in reverse method.

If not what I want to do is to have it read each hex pair (or four-some) decide if it is below U+007F (inclusive) to treat it as plain ASCII and pass the character as one element to an array, if it's between U+0080 and U+FFFF then to take a two byte chunk and pass it as one element to an array. And finally if it is between U+010000 and U+10FFFF then to take a three byte chunk and pass it as one element to an array. Then to read the elements of the array First one In Last one Out (FILO), remove the end of line (/n) marker and put the elements into another array. Join that array add an end of line element and write it to the file.

So the first thing I need to do is find a way of reading the hexadecimal values of the characters. So after a lot of looking I found a hex editor plugin for Notepad++ and though it doesn't do exactly what I want I figure something out. The last character or the U+007F is 7F in the hex value of the file. Apparently Notepad++ hides the 00 of endian-ness. So that's the one I want to move as a one element to another array. And at least for now I can assume that every thing above 80 is a two-byte element, till I figure a way of reading the three-byte ones. It won't be perfect but if it works it will be a step.
Now to try it out.

No comments:

Post a Comment