perl-unicode

Re: Automagical :text layer (was: My favorite bug to fix for 5.8.0)

2002-03-11 13:14:07
Autrijus Tang writes:
: --0OAP2g/MAC+5xKAE
: Content-Type: text/plain; charset=big5
: Content-Disposition: inline
: 
: After a discussion with Sean M. Burke, we arrived at the following ideas
: regarding a ':text' layer, which attempts to deduce both the line ending
: and encoding of a given text stream.
: 
: - The Name
: 
: We chose ':text' because its deliberate vagueness; the only thing we
: assume about that file is that it's not a byte stream, otherwise the
: encoding/line ending guessing would not make sense.
: 
: Other possible names are ':anytext', ':any', ':magic', and of course
: ':guess'. I was happy with ':guess', if not for the fact that it can
: misleadingly make the impression that it can DWIM arbitary binary data,
: which it probably can't.

On top of which, Camel III talks about :text, so that's probably the
right way to go.

: - Encoding Guessing
: 
: According to perlpodspec, one can use the following heuristic:
: 
:  * Respect Byte Order Mark first: \xFE\xFF for UTF-16 (big endian),
:    \xFF\xFE for UTF-16 (little endian), and \xEF\xBB\xBF for UTF-8.
: 
:  * Probe the first few thousand bytes of the file, looking for
:    [\xCD-\xFD][\x80-\xBF]. If that occurs, assume it to be UTF-8.
: 
:  * Otherwise, treat it as if no :encoding info is set.
: 
: We can arguably use locale information to detect encodings in that
: locale (after the BOM mark test); each such locale/encoding pairs
: will have to supply a multi-byte test range for that. On the other
: hand, if one is relying on such a feature, one should prbably use
: ':locale' instead anyway. Hence I'd like to propose that locale
: information should not be used at all.

I'd tend to argue that a locale specifying UTF-8 is in a different
class from other locales, in that it's specifying the probably future
lingua franca of the world, and should have privileged treatment
as the desirable default case someday, if not now.

: - Line-End Guessing
: 
: Since we want to process MacOS data on Unix and vice versa, there
: should probably be :cr and :lf layers (in addition to :crlf and :raw)
: that translates :cr and :lf to the logical \n internally (they're
: no-op layers on their native platforms).
: 
: :text should probe for the first possible line-end sequence in the
: first few thousand bytes, and assume :cr, :lf or :crlf accordingly.

My original thought was that :crlf would do line-end guessing, not
assume CRLF, but I can see where that would be confusing.  Should
probably be named :nl after (logical) newline.  But if that's all
subsumed under :text anyway, it probably doesn't matter unless you
want to force it somehow.

In any event, Perl should assume its input is :text of some kind or
another, as it has always done on MS-DOS et al.  That is, after all,
why binmode is called binmode.  If we truly want cross-platform
portability, then :text has to do the right thing everywhere, and
it has to be the default.

: - Overriding
: 
: In the unlikely event that we want to force a certain encoding or
: line ending, simply precede the :text with another layer, like
: :crlf or :encoding(big5). Since :text will receive translated
: data as logical \n and UTF-8 respectively, its probing will always
: DWIM.
: 
: - Output
: 
: On Input streams, :text can read in a buffer for probing first;
: but output is another matter altogether. If I say
: 
:     open $fh, '>:text', "File.txt";
:     print $fh, v24799.24218.23493.21566.20197.38477.65108;
: 
: Sure, that's Unicode data. But what should the representation
: on disk be? If I said ':locale' with a LC_ALL as zh_TW.Big5, then
: it should be Big5. But ':text' or ':guess' posess no such ability,
: so I think it should be a no-op here.
: 
: Alternatively, if it uses locale information on Encoding Guessing
: of input streams, then it should probably just alias to ':locale'
: on output streams.

In the absence of other information, I expect locale probably gives the
best indication (though certainly not the ideal indication) of what's
wanted on output.  I think it should at least be the default default,
with maybe a PERL_MUMBLE environment variable or two to override
locales that have undesirable properties.

: Does the above even makes sense? :-)

I think so.  But mileage always varies from klickage...

Larry