After a discussion with Sean M. Burke, we arrived at the following ideas
regarding a ':text' layer, which attempts to deduce both the line ending
and encoding of a given text stream.
- The Name
We chose ':text' because its deliberate vagueness; the only thing we
assume about that file is that it's not a byte stream, otherwise the
encoding/line ending guessing would not make sense.
Other possible names are ':anytext', ':any', ':magic', and of course
':guess'. I was happy with ':guess', if not for the fact that it can
misleadingly make the impression that it can DWIM arbitary binary data,
which it probably can't.
- Encoding Guessing
According to perlpodspec, one can use the following heuristic:
* Respect Byte Order Mark first: \xFE\xFF for UTF-16 (big endian),
\xFF\xFE for UTF-16 (little endian), and \xEF\xBB\xBF for UTF-8.
* Probe the first few thousand bytes of the file, looking for
[\xCD-\xFD][\x80-\xBF]. If that occurs, assume it to be UTF-8.
* Otherwise, treat it as if no :encoding info is set.
We can arguably use locale information to detect encodings in that
locale (after the BOM mark test); each such locale/encoding pairs
will have to supply a multi-byte test range for that. On the other
hand, if one is relying on such a feature, one should prbably use
':locale' instead anyway. Hence I'd like to propose that locale
information should not be used at all.
- Line-End Guessing
Since we want to process MacOS data on Unix and vice versa, there
should probably be :cr and :lf layers (in addition to :crlf and :raw)
that translates :cr and :lf to the logical \n internally (they're
no-op layers on their native platforms).
:text should probe for the first possible line-end sequence in the
first few thousand bytes, and assume :cr, :lf or :crlf accordingly.
- Overriding
In the unlikely event that we want to force a certain encoding or
line ending, simply precede the :text with another layer, like
:crlf or :encoding(big5). Since :text will receive translated
data as logical \n and UTF-8 respectively, its probing will always
DWIM.
- Output
On Input streams, :text can read in a buffer for probing first;
but output is another matter altogether. If I say
open $fh, '>:text', "File.txt";
print $fh, v24799.24218.23493.21566.20197.38477.65108;
Sure, that's Unicode data. But what should the representation
on disk be? If I said ':locale' with a LC_ALL as zh_TW.Big5, then
it should be Big5. But ':text' or ':guess' posess no such ability,
so I think it should be a no-op here.
Alternatively, if it uses locale information on Encoding Guessing
of input streams, then it should probably just alias to ':locale'
on output streams.
Does the above even makes sense? :-)
Thanks,
/Autrijus/
pgpVajICBf7Jm.pgp
Description: PGP signature