perl-unicode

Re: making utf8-clean CPAN distributions

2004-12-13 16:30:06
Thanks to both David E. Wheeler and David Graff for their answers.

I have now uploaded my first distro that takes advantage of them, Locale-KeyedText-1.01_1.tar.gz, which requires 5.008001.

All files in my distro are officially UTF-8, without a BOM, and with unix line breaks; this is identical to ASCII except where non-ASCII characters are used. All code files "use utf8" and all files having pod use "=encoding utf8".

I tested this distro and it passes its test suite fine, both under Perl 5.8.1RC3 and Perl 5.8.6. However, under both the 'make' stage gives the following error when manifying the POD:

Manifying blib/man3/Locale::KeyedText.3
lib/Locale/KeyedText.pm:10: Unknown command paragraph "=encoding utf8"

Now, the line that is being flagged is identical to both the online perlpod documentation and the example given by David Wheeler. So what is the problem here?

Besides CPAN, the file is also availble here:
http://darrenduncan.net/d/perl/Locale-KeyedText-1.01_1.tar.gz

Thanks for any feedback on the error or other matters.

-- Darren Duncan

At 11:07 PM -0800 12/12/04, David E. Wheeler wrote (on-list):
On Dec 12, 2004, at 10:06 PM, Darren Duncan wrote:

What I would like to do is create my CPAN module distributions such that all of the files in each distro, code and documentation and tests and logs alike, are properly UTF-8 encoded, and do this in such a way that no modern Perl distributions or the automated CPAN tools will break.

Short answer:

use utf8;

=pod

=encoding utf8

=cut

Regards,

David

At 1:20 PM -0500 12/13/04, David Graff wrote (off-list):
darren(_at_)DarrenDuncan(_dot_)net said:
 0. For my main question, is distribution as Unicode files a good idea  at
 all currently, though few if any people do it?

It's a good idea if you need to include text/character data that fall
outside the ASCII range (e.g. pod in languages other than English, etc). Otherwise, since ASCII is a proper subset of utf8, every ASCII-only distro
is, by definition, a utf8 distro.

 1. BBEdit gives me an option to have a byte-order mark in UTF-8 files
 (that happens to be 3 octets long I think), with the recommendation
 being to use it; I also have the choice not to, which makes the file
 more similar to many other ASCII-like encodings.  So should I save  the
 files with the BOM or without?

The BOM comes in very handy for UTF-16 data, and I suppose there may be
some apps that will check the top of a file for a BOM (as LE, BE, or the
three-byte utf8 pattern) in order to "predict" that the file contains the
corresponding sort of unicode data.  But Perl is not one of those apps, nor
are any of the tools that are normally used to install Perl modules. Since
you're not ever using UTF-16, and module recipients won't be either, the
BOM will just get in the way.  Leave it out.

 2. I am given a separate option to use either Unicode linebreaks or  one
 of Unix/Mac/Win; all 4 are given as options to use with a Unicode
 encoding.  In my own tests, Perl 5.8 complained when the Unicode line
 break was used with UTF8, but not the Unix line break (I was not,
 however, using any special pragmas).  So should I use the Unicode
 linebreak or the Unix linebreak, assuming the former can be made to
 work?

 2.1 Will the addition of "use utf8" on the first line of a Perl file
 cause Perl to accept files with Unicode line breaks?

I can't imagine what a "unicode linebreak" in utf8 would be.  Does BBEdit
indicate a code point for this?  In any case, I'd stick with the unix line
breaks (simply \xA), because all the tools normally used to install modules
will recognize and handle this correctly.

 3. Can a "use utf8" be put anywhere besides the first line of a file?
 What if I customarily put POD on the first few lines and the package
 declaration beneath it?  Also, in a script file, which goes first,  the
 #!perl or the use-utf8?

The shebang line goes first, always -- in fact, this is a good reason to
forget about including the BOM in your distro files.  For unix shells to
use the shebang line properly, the two characters "#!" must be the first
ones in the file.

 4. What about plain POD files?  Since they contain no POD, will POD
 extractors know what to do since I can't put the use-utf8 in them?

The interpretation of characters in pod will depend on the display
mechanism being used by the person who runs perldoc.  If the pod includes
utf8 text, and the person runs perldoc in a utf8-capable window (with the
appropriate font(s) available), everything should go just fine, maybe.
Best to just try it out and see what happens.  Results might depend on
environmental things like locale setting, etc.  Since there are tools that
will convert pod to html, etc, it would be worthwhile to see how these
work when the pod contains utf8.  Again, try it and see.

 5. Would the CPAN compare utility adapt to encoding changes, or would  it
 consider an otherwise-identical file with different encodings to  consist
 of one very large change?

Who said anything about having the same module posted on CPAN with
different encodings?  Why do that?  I wouldn't expect CPAN diff tools to
handle this sort of case by trying to factor out encoding differences --
if there ever is a good reason to post the same module with different
encodings, then it's likely the different versions should be treated as
different.

 6. In general, would anything on CPAN break?  What about the automated
 testers?

 7. Are there any other common issues that I should be aware of, and  if
 so then what?

In general, CPAN is just a repository of tar files.  What's to break?
Try it out and see.

        David Graff

<Prev in Thread] Current Thread [Next in Thread>