Charlie Harvey

Perl character encoding and UTF-8 in brief

Character encoding. Its enough to make grown people break down and weep bitter tears of rage. Perl provides rather good UTF-8 support, and there are some excellent tutorials out there. So, my contribution is more of a checklist and cookbook than a technical explanation.

UTF-8 In your source code

If you need to include UTF-8 characters in your actual code, you can just use utf8; Simple enough and means that you can use crazy identifiers in your source. Like this my $ᚠᛇᚻ = "На берегу пустынных волн";

"Wide character in print" WTF?

The next issue is when you want to print your lovely runic scalar out and you get wide character in print warnings or fscked looking text. You need to make sure that your STDOUT is set to the UTF8 character encoding with a little binmode STDOUT, 'utf8';

Writing and reading filehandles

You can of course use binmode on your filehandles, the same as with STDOUT above, but its nicer to do it with your open statement thus open my $in, '<:utf8', $file or die("Can't open that: $!")

Input or output not UTF-8?

You can use the Encode module from the CPAN in a couple of helpful ways. Firstly, to force perl to recognize a scalar you know to be utf8 as such, like this my $it_is_definitely_utf8=decode_utf8($input); Secondly, you can convert scalars into utf8 like this my $utf_8_now = Encode::decode( 'iso-8859-1', '£999' ); You can switch between encodings with Encode's from_to() function. Its great.

Convert to numeric or HTML entities

The right way to do it is probably to use HTML::Entities. But you can do it like this if you are brave $ᚠᛇᚻ=~s/([^[:ascii:]])/'&#' . ord($1) . ';'/ge; say $ᚠᛇᚻ; Using chr() to convert back is left as an excercise for the reader.

Can't be arsed? Assume everything is UTF-8

Chances are that you can get away with assuming that everything that goes into or out of your program is going to be UTF-8 anyway. There's a couple of ways to do that. Set the environment variable PERL_UNICODE to S. Or call your perl with the -C switch #!/usr/bin/perl -C Job done.

Final thoughts

There is much more to consider with character encoding, but these are the problems I hit again and again when I write Perl. In general I’ve found it less gnarly to fix things than in other languages, especially if you read the perlunifaq.


Comments

  • Be respectful. You may want to read the comment guidelines before posting.
  • You can use Markdown syntax to format your comments. You can only use level 5 and 6 headings.
  • You can add class="your language" to code blocks to help highlight.js highlight them correctly.

Privacy note: This form will forward your IP address, user agent and referrer to the Akismet, StopForumSpam and Botscout spam filtering services. I don’t log these details. Those services will. I do log everything you type into the form. Full privacy statement.