darren(_at_)DarrenDuncan(_dot_)net said:
0. For my main question, is distribution as Unicode files a good idea at
all currently, though few if any people do it?
It's a good idea if you need to include text/character data that fall
outside the ASCII range (e.g. pod in languages other than English, etc).
Otherwise, since ASCII is a proper subset of utf8, every ASCII-only distro
is, by definition, a utf8 distro.
1. BBEdit gives me an option to have a byte-order mark in UTF-8 files
(that happens to be 3 octets long I think), with the recommendation
being to use it; I also have the choice not to, which makes the file
more similar to many other ASCII-like encodings. So should I save the
files with the BOM or without?
The BOM comes in very handy for UTF-16 data, and I suppose there may be
some apps that will check the top of a file for a BOM (as LE, BE, or the
three-byte utf8 pattern) in order to "predict" that the file contains the
corresponding sort of unicode data. But Perl is not one of those apps, nor
are any of the tools that are normally used to install Perl modules. Since
you're not ever using UTF-16, and module recipients won't be either, the
BOM will just get in the way. Leave it out.
2. I am given a separate option to use either Unicode linebreaks or one
of Unix/Mac/Win; all 4 are given as options to use with a Unicode
encoding. In my own tests, Perl 5.8 complained when the Unicode line
break was used with UTF8, but not the Unix line break (I was not,
however, using any special pragmas). So should I use the Unicode
linebreak or the Unix linebreak, assuming the former can be made to
work?
2.1 Will the addition of "use utf8" on the first line of a Perl file
cause Perl to accept files with Unicode line breaks?
I can't imagine what a "unicode linebreak" in utf8 would be. Does BBEdit
indicate a code point for this? In any case, I'd stick with the unix line
breaks (simply \xA), because all the tools normally used to install modules
will recognize and handle this correctly.
3. Can a "use utf8" be put anywhere besides the first line of a file?
What if I customarily put POD on the first few lines and the package
declaration beneath it? Also, in a script file, which goes first, the
#!perl or the use-utf8?
The shebang line goes first, always -- in fact, this is a good reason to
forget about including the BOM in your distro files. For unix shells to
use the shebang line properly, the two characters "#!" must be the first
ones in the file.
4. What about plain POD files? Since they contain no POD, will POD
extractors know what to do since I can't put the use-utf8 in them?
The interpretation of characters in pod will depend on the display
mechanism being used by the person who runs perldoc. If the pod includes
utf8 text, and the person runs perldoc in a utf8-capable window (with the
appropriate font(s) available), everything should go just fine, maybe.
Best to just try it out and see what happens. Results might depend on
environmental things like locale setting, etc. Since there are tools that
will convert pod to html, etc, it would be worthwhile to see how these
work when the pod contains utf8. Again, try it and see.
5. Would the CPAN compare utility adapt to encoding changes, or would it
consider an otherwise-identical file with different encodings to consist
of one very large change?
Who said anything about having the same module posted on CPAN with
different encodings? Why do that? I wouldn't expect CPAN diff tools to
handle this sort of case by trying to factor out encoding differences --
if there ever is a good reason to post the same module with different
encodings, then it's likely the different versions should be treated as
different.
6. In general, would anything on CPAN break? What about the automated
testers?
7. Are there any other common issues that I should be aware of, and if
so then what?
In general, CPAN is just a repository of tar files. What's to break?
Try it out and see.
David Graff