Why PERL_UNICODE makes me SAD

When I first got a bug report that Capture::Tiny was breaking under PERL_UNICODE=SAD, I though it would be an easy thing to fix. I was so wrong… I had no idea what a rabbit hole I was in for.

What the heck is PERL_UNICODE?

Unless you’re American, you’ve probably heard of Unicode. Even if you’re American, hopefully by now you’ve realized that a lot of the world uses languages that require more than the ASCII character set. And if you use Perl, you might be aware that Perl has remarkably good Unicode support. (See the Unicode Support Shootout slides.)

The PERL_UNICODE environment variable provides a default for the -C command line argument to the Perl interpreter, which can set UTF-8 translation layers on various filehandles (and command line arguments).

Specifically, PERL_UNICODE=SAD means that Perl should add the :utf8 layer to the Standard IO handles, to the Argument list, and should be the Default for any other handles opened as well.

Is PERL_UNICODE a good idea?

Maybe. One the one hand, if you work in a world that is exclusively ASCII or Unicode I/O, then you can make a lot of input and output “just work”.

That strength is also the weakness. PERL_UNICODE has a global effect!

Can you be sure that every module you use is ready to have :utf8 on any handles they open? Are you sure that any modules that reopen standard handles set them back correctly later? Turning on :utf8 globally is a huge bet, with odds that get worse the larger your dependency chain is.

[I can tell you from experience that almost no code on CPAN properly understands how to record the layers on a handle and reapply them to another. Capture::Tiny does, except when it’s actually impossible, since tied handles can’t report layers correctly.]

Capture::Tiny and PERL_UNICODE walk into a bar…

The bug report I got for Capture::Tiny regarded a failure in one particular test file, when PERL_UNICODE=SAD was set globally in the environment. As I dug into the bug report, it became clear that the bug was being triggered only under these conditions:

  • Perl prior to v5.12
  • PERL_UNICODE=D
  • STDIN closed
  • Capture::Tiny trying to tee() output

The good news was that newer Perls were unaffected. The bad news was that I couldn’t figure out why it was happening.

Not only was it breaking under those conditions, it was weird.

Down the rabbit hole

One of the strange things happening was that a “no output” capture test was capturing the contents of the utf8.pm file in the Perl core. WTF? Something about PERL_UNICODE was loading utf8.pm, which winds up on file descriptor 0, confusing Capture::Tiny. Sticking require utf8; early in the test code “fixed” that problem.

Even after that fix, it looked like the test was leaking a filehandle. Something else was grabbing file descriptor 0 in the middle of a tee() and not letting go.

Given that leak, it wasn’t just a matter of taking into account the global presence of :utf8 layers – something more fundamental was going wrong.

Knowing when to punt

Reading Perl release notes and grepping through Perl core commit logs wasn’t giving me any insight into what changed. Git bisection of the core turned into a huge headache. I quickly got to the point where I decided I was spending more time on this than the problem was worth.

Since the issue was a real corner case and only on very old Perl’s, I decided to document it as a known issue, bypass the failing tests under the triggering condition, and ship a new release to CPAN. Oh, well.

Lessons to learn from this

Be careful with global effects! It might seem like an easy fix, but you put your entire codebase at risk. It’s much smarter to fix your code locally where you do I/O. Even the open pragma is a better choice than PERL_UNICODE, since you can limit the scope of change to the parts of your code that are actually doing I/O.

The real insight I got from this is how important it is to test under production conditions. If you do use PERL_UNICODE=SAD in production, it’s a very good idea to do your development and testing with that set as well. It will help you find modules that aren’t happy with it.

Finally, this is a great example of why upgrading Perl is a good idea. Hundreds (thousands?) of bugs have been fixed since 5.8 or 5.10. The longer you wait to upgrade, the longer you’ll have to suffer them.

Summary

  • PERL_UNICODE has a global effect, applying :utf8 to layers automatically
  • Global effects can have unexpected side effects
  • Avoid global effects if you can
  • If you must use global effects, test your dependencies under the same conditions
  • Upgrade your Perl