perl-i18n

encodings in Pod::Simple

2003-07-04 19:08:51


(I'm CCing the perl-i18n list on this, as this will interest some on the list; and I may need to pick the brains of a few people on the list.)

So as I'm milling over the implementation and tests for =encoding in Pod::Simple, I'm starting to settle on some hopefully sane assumptions that Pod::Simple can make, which I'd like to run past you all for comment:

* A Pod file is in one encoding. You can't have a file that's half UTF8 and half Shift-JIS.

* You can use only one =encoding directive per file. The exception to this is "redundant =encoding" commands (i.e., ones that simply redeclare the encoding that we've already declared) are ignored. So if you had two "=encoding iso-8859-6" commands in a file, the second one would be silently forgiven, and ignored. But if you have a "=encoding iso-8859-6" and later a "=encoding shiftjis", this makes the file invalid (and the Pod processor can probably do something drastic like abort parsing the file).

* If a Pod file is in UTF16, it /must/ flag this by having a BOM at the beginning of the file. There can be a redundant "=encoding utf16" command, but it will be ignored. No other =encoding directives are permitted in a UTF16 file. (In short, a BOM counts sort of like an =encoding directive, and so it uses up your allowance of one non-redundant =encoding per file.)

* Similarly, if a Pod file is in UTF8, it /can/ signal this with a UTF8 BOM, and/or a "=encoding utf8" directive. But it's forbidden to have a UTF8 BOM and to then have an "=encoding" line other than "=encoding utf8".

[end of proposed encoding assumptions]

The reason I'm making UTF16 special, above, is that it's the only really-double-byte-character-set I know of -- i.e., a character encoding where ALL characters are expressed as two (or more) bytes long. I know there's some Asian encodings where non-USASCII characters are encoded as multiple bytes, but the letter A is still expressed as a single byte value, a decimal-65 byte.

Or are there some Asian encodings I've forgotten about? Specifically, I'm wondering whether UTF16 is the only attention-worthy encoding where the nine characters "=encoding" take up 18 bytes to express, instead of being just the 9 bytes 61 101 110 99 111 100 105 110 103 (i.e., join ' ', map ord($_), split '', '=encoding').

--
Sean M. Burke    http://search.cpan.org/~sburke/


<Prev in Thread] Current Thread [Next in Thread>