Character encoding. Its enough to make grown people break down and weep bitter tears of rage. Perl provides rather good UTF-8 support, and there are some excellent tutorials out there. So, my contribution is more of a checklist and cookbook than a technical explanation.
UTF-8 In your source code
If you need to include UTF-8 characters in your actual code, you can just
use utf8; Simple enough and means that you can use crazy identifiers in your source. Like this
my $áš á›‡áš» = "ÐÐ° Ð±ÐµÑ€ÐµÐ³Ñƒ Ð¿ÑƒÑÑ‚Ñ‹Ð½Ð½Ñ‹Ñ… Ð²Ð¾Ð»Ð½";
"Wide character in print" WTF?
The next issue is when you want to print your lovely runic scalar out and you get wide character in print warnings or fscked looking text. You need to make sure that your STDOUT is set to the UTF8 character encoding with a little
binmode STDOUT, 'utf8';
Writing and reading filehandles
You can of course use binmode on your filehandles, the same as with STDOUT above, but its nicer to do it with your open statement thus
open my $in, '<:utf8', $file or die("Can't open that: $!")
Input or output not UTF-8?
You can use the Encode module from the CPAN in a couple of helpful ways. Firstly, to force perl to recognize a scalar you know to be utf8 as such, like this
my $it_is_definitely_utf8=decode_utf8($input); Secondly, you can convert scalars into utf8 like this
my $utf_8_now = Encode::decode( 'iso-8859-1', '£999' ); You can switch between encodings with Encode's from_to() function. Its great.
Convert to numeric or HTML entities
The right way to do it is probably to use HTML::Entities. But you can do it like this if you are brave
$áš á›‡áš»=~s/([^[:ascii:]])/'&#' . ord($1) . ';'/ge;
say $áš á›‡áš»; Using chr() to convert back is left as an excercise for the reader.
Can't be arsed? Assume everything is UTF-8
Chances are that you can get away with assuming that everything that goes into or out of your program is going to be UTF-8 anyway. There's a couple of ways to do that. Set the environment variable PERL_UNICODE to S. Or call your perl with the -C switch
#!/usr/bin/perl -C Job done.
There is much more to consider with character encoding, but these are the problems I hit again and again when I write Perl. In general I’ve found it less gnarly to fix things than in other languages, especially if you read the perlunifaq.