perl-unicode

Automagical :text layer (was: My favorite bug to fix for 5.8.0)

2002-03-10 04:40:27
After a discussion with Sean M. Burke, we arrived at the following ideas
regarding a ':text' layer, which attempts to deduce both the line ending
and encoding of a given text stream.

- The Name

We chose ':text' because its deliberate vagueness; the only thing we
assume about that file is that it's not a byte stream, otherwise the
encoding/line ending guessing would not make sense.

Other possible names are ':anytext', ':any', ':magic', and of course
':guess'. I was happy with ':guess', if not for the fact that it can
misleadingly make the impression that it can DWIM arbitary binary data,
which it probably can't.

- Encoding Guessing

According to perlpodspec, one can use the following heuristic:

 * Respect Byte Order Mark first: \xFE\xFF for UTF-16 (big endian),
   \xFF\xFE for UTF-16 (little endian), and \xEF\xBB\xBF for UTF-8.

 * Probe the first few thousand bytes of the file, looking for
   [\xCD-\xFD][\x80-\xBF]. If that occurs, assume it to be UTF-8.

 * Otherwise, treat it as if no :encoding info is set.

We can arguably use locale information to detect encodings in that
locale (after the BOM mark test); each such locale/encoding pairs
will have to supply a multi-byte test range for that. On the other
hand, if one is relying on such a feature, one should prbably use
':locale' instead anyway. Hence I'd like to propose that locale
information should not be used at all.

- Line-End Guessing

Since we want to process MacOS data on Unix and vice versa, there
should probably be :cr and :lf layers (in addition to :crlf and :raw)
that translates :cr and :lf to the logical \n internally (they're
no-op layers on their native platforms).

:text should probe for the first possible line-end sequence in the
first few thousand bytes, and assume :cr, :lf or :crlf accordingly.

- Overriding

In the unlikely event that we want to force a certain encoding or
line ending, simply precede the :text with another layer, like
:crlf or :encoding(big5). Since :text will receive translated
data as logical \n and UTF-8 respectively, its probing will always
DWIM.

- Output

On Input streams, :text can read in a buffer for probing first;
but output is another matter altogether. If I say

    open $fh, '>:text', "File.txt";
    print $fh, v24799.24218.23493.21566.20197.38477.65108;

Sure, that's Unicode data. But what should the representation
on disk be? If I said ':locale' with a LC_ALL as zh_TW.Big5, then
it should be Big5. But ':text' or ':guess' posess no such ability,
so I think it should be a no-op here.

Alternatively, if it uses locale information on Encoding Guessing
of input streams, then it should probably just alias to ':locale'
on output streams.

Does the above even makes sense? :-)

Thanks,
/Autrijus/

Attachment: pgpVajICBf7Jm.pgp
Description: PGP signature