perl-unicode

making utf8-clean CPAN distributions

2004-12-12 23:30:10
I'm looking for help from you guys on an important forward-looking question.

What I would like to do is create my CPAN module distributions such that all of the files in each distro, code and documentation and tests and logs alike, are properly UTF-8 encoded, and do this in such a way that no modern Perl distributions or the automated CPAN tools will break. Note that my modules in question are so bleeding edge that I don't expect to have any legacy users yet to worry about breaking compatability with. It's a given that I'm using UTF-8, which is supposed to be network safe and only have only a single byte-order; I don't plan to touch UTF-16 or any other variations.

For tools, I have on my machine Mac OS X 10.3.6 (operating system), and BBEdit 8.0.3 (text editor), both of which are fully Unicode aware. I'm also using the latest Perl, version 5.8.6, compiled myself with the OS-bundled dev tools such as GCC 3.3 etcetera. I am making the distributions formally require 5.008.

0. For my main question, is distribution as Unicode files a good idea at all currently, though few if any people do it?

The following questions only apply if the answer to the above is "yes".

1. BBEdit gives me an option to have a byte-order mark in UTF-8 files (that happens to be 3 octets long I think), with the recommendation being to use it; I also have the choice not to, which makes the file more similar to many other ASCII-like encodings. So should I save the files with the BOM or without?

2. I am given a separate option to use either Unicode linebreaks or one of Unix/Mac/Win; all 4 are given as options to use with a Unicode encoding. In my own tests, Perl 5.8 complained when the Unicode line break was used with UTF8, but not the Unix line break (I was not, however, using any special pragmas). So should I use the Unicode linebreak or the Unix linebreak, assuming the former can be made to work?

2.1 Will the addition of "use utf8" on the first line of a Perl file cause Perl to accept files with Unicode line breaks?

3. Can a "use utf8" be put anywhere besides the first line of a file? What if I customarily put POD on the first few lines and the package declaration beneath it? Also, in a script file, which goes first, the #!perl or the use-utf8?

4. What about plain POD files? Since they contain no POD, will POD extractors know what to do since I can't put the use-utf8 in them?

5. Would the CPAN compare utility adapt to encoding changes, or would it consider an otherwise-identical file with different encodings to consist of one very large change?

6. In general, would anything on CPAN break?  What about the automated testers?

7. Are there any other common issues that I should be aware of, and if so then what?

AFAIK, Perl 6 is going to expect its code files to be Unicode by default? I know Larry said that some Unicode characters would be used by the language grammar.

Thanks for any input you can give.

-- Darren Duncan

<Prev in Thread] Current Thread [Next in Thread>