I'm looking for help from you guys on an important forward-looking question.
What I would like to do is create my CPAN module distributions such
that all of the files in each distro, code and documentation and
tests and logs alike, are properly UTF-8 encoded, and do this in such
a way that no modern Perl distributions or the automated CPAN tools
will break. Note that my modules in question are so bleeding edge
that I don't expect to have any legacy users yet to worry about
breaking compatability with. It's a given that I'm using UTF-8,
which is supposed to be network safe and only have only a single
byte-order; I don't plan to touch UTF-16 or any other variations.
For tools, I have on my machine Mac OS X 10.3.6 (operating system),
and BBEdit 8.0.3 (text editor), both of which are fully Unicode
aware. I'm also using the latest Perl, version 5.8.6, compiled
myself with the OS-bundled dev tools such as GCC 3.3 etcetera. I am
making the distributions formally require 5.008.
0. For my main question, is distribution as Unicode files a good idea
at all currently, though few if any people do it?
The following questions only apply if the answer to the above is "yes".
1. BBEdit gives me an option to have a byte-order mark in UTF-8 files
(that happens to be 3 octets long I think), with the recommendation
being to use it; I also have the choice not to, which makes the file
more similar to many other ASCII-like encodings. So should I save
the files with the BOM or without?
2. I am given a separate option to use either Unicode linebreaks or
one of Unix/Mac/Win; all 4 are given as options to use with a Unicode
encoding. In my own tests, Perl 5.8 complained when the Unicode line
break was used with UTF8, but not the Unix line break (I was not,
however, using any special pragmas). So should I use the Unicode
linebreak or the Unix linebreak, assuming the former can be made to
work?
2.1 Will the addition of "use utf8" on the first line of a Perl file
cause Perl to accept files with Unicode line breaks?
3. Can a "use utf8" be put anywhere besides the first line of a file?
What if I customarily put POD on the first few lines and the package
declaration beneath it? Also, in a script file, which goes first,
the #!perl or the use-utf8?
4. What about plain POD files? Since they contain no POD, will POD
extractors know what to do since I can't put the use-utf8 in them?
5. Would the CPAN compare utility adapt to encoding changes, or would
it consider an otherwise-identical file with different encodings to
consist of one very large change?
6. In general, would anything on CPAN break? What about the automated testers?
7. Are there any other common issues that I should be aware of, and
if so then what?
AFAIK, Perl 6 is going to expect its code files to be Unicode by
default? I know Larry said that some Unicode characters would be
used by the language grammar.
Thanks for any input you can give.
-- Darren Duncan