encodings in Pod::Simple

(I'm CCing the perl-i18n list on this, as this will interest some on thelist; and I may need to pick the brains of a few people on the list.)

So as I'm milling over the implementation and tests for =encoding inPod::Simple, I'm starting to settle on some hopefully sane assumptions thatPod::Simple can make, which I'd like to run past you all for comment:

* A Pod file is in one encoding. You can't have a file that's half UTF8and half Shift-JIS.

* You can use only one =encoding directive per file. The exception to thisis "redundant =encoding" commands (i.e., ones that simply redeclare theencoding that we've already declared) are ignored. So if you hadtwo "=encoding iso-8859-6" commands in a file, the second one would besilently forgiven, and ignored. But if you have a "=encoding iso-8859-6"and later a "=encoding shiftjis", this makes the file invalid (and the Podprocessor can probably do something drastic like abort parsing the file).

* If a Pod file is in UTF16, it /must/ flag this by having a BOM at thebeginning of the file. There can be a redundant "=encoding utf16" command,but it will be ignored. No other =encoding directives are permitted in aUTF16 file. (In short, a BOM counts sort of like an =encoding directive,and so it uses up your allowance of one non-redundant =encoding per file.)

* Similarly, if a Pod file is in UTF8, it /can/ signal this with a UTF8BOM, and/or a "=encoding utf8" directive. But it's forbidden to have aUTF8 BOM and to then have an "=encoding" line other than "=encoding utf8".


[end of proposed encoding assumptions]

The reason I'm making UTF16 special, above, is that it's the onlyreally-double-byte-character-set I know of -- i.e., a character encodingwhere ALL characters are expressed as two (or more) bytes long. I knowthere's some Asian encodings where non-USASCII characters are encoded asmultiple bytes, but the letter A is still expressed as a single byte value,a decimal-65 byte.

Or are there some Asian encodings I've forgotten about? Specifically, I'mwondering whether UTF16 is the only attention-worthy encoding where thenine characters "=encoding" take up 18 bytes to express, instead of beingjust the 9 bytes 61 101 110 99 111 100 105 110 103 (i.e., join ' ', mapord($_), split '', '=encoding').


--
Sean M. Burke    http://search.cpan.org/~sburke/