21 December 2011

Unicode Support Better Now in Ruby

Unicode support in Ruby has historically been one of Ruby's major problems for me.

With the advent of Ruby 1.9 of course, Unicode support started being added to the language.  However, it's not as straightforward as Java, which supported some version of Unicode from the beginning.

Even though it's been a rough decade, things are finally looking up.  In fact, I actually like the way things got factored after all of the mess.

The thing I like is that the minimal amount of support is included in the standard library, and it's easy to compose things in non-standard ways for weird scenarios or data in improperly encoded formats.

The core library has support for strings of codepoints and bytes and a flexible set of encoding facilities.

In addition, there are two libraries of interest:
  • unicode_utils - includes implementations of word-break functionality, grapheme boundaries, etc.
  • jcode (if you're stuck on ruby 1.8.x)

This series of posts gives you a full understanding of the topic.  Highly recommended!

This post gives a high-level view of where things were at around 2006, much of which is valuable background.

This post has a good summary of unicode-related resources, as does this stackoverflow question.

No comments:

Post a Comment