perl-unicode

Re: byte order mark

1999-10-06 12:51:24


From: John Dlugosz

I assume you mean UTF-8 there, not UTF-16.

check.  Typo.

I suppose I should clarify that I wasn't attacking you particularly.
I have no problem with people making the best of a bad situation.  :-)

None taken.
I'm interested in finding out just what =should= be done re different text
encodings, and using that info in my own tools and in what I promote in my
magazine articles.

re ...in every string... as the natrual progression:  I see your point.
My philosophy (with my C++ classes) is that a string is a string, in the
abstract sence.  How it's encoded (Unicode or ANSI code page) is a hidden
implementation detail, and that's hidden from the user.  This grew out of
messy "thermocline" problems when mixing Unicode code with older 8-bit
code.  With my current C++ class, it doesn't matter.  Right now, I tag each
ustring (universal string object) with its structural representation (8 bit
or 16 bit chars) so that the individual characters can be identified
regardless of what the user passed in.  The next generation will also tag
each instance with encoding representation, so a particular string, for
example, is known to be 8859-1 (where I compiled it), not 8859-2 (the
machine at run time).  This is more of an issue with C++ because string
literals are taken as-is at compile time, but matching that against a
Unicode value at run-time doesn't know what code page the programmer had
when he was typing.  Specifically, the value of __FILE__ has this problem.

Re how far does Perl need to go to be useful in the environment I forsee,
where "text" files contain several possible encodings indescriminatly:
Per-file recognition.  When opening the file, stick in the proper filter.
What's read is always Perl's native format.  Vice-versa for writing.

Re BOM's inside text:  no worse than non-printing non-spacing characters
today, or even strange spacing characters like TAB or hard to count
consecutive spaces.  Non-spacing characters (printing or not) will make the
visible location on the listing different from the substr() position.  Or
how about right-to-left passages embedded in a left-to-right string?  In
general, substr() sees the sequence of variable-length UTF-8 codes
(including thouse that UTF-16 treats as Surrogate Pairs), which is not
necessarily one code<=>one column of fixed witdth output.  Given that, an
embedded U+FEFF (Zero Width No-Break Space) is not a unique case -- it's
handled just like any other code that doesn't provide one position of
printed character information.

Embedded ZWNBS's will occur if two files with BOM's are simply
concatenated.

If every text file included a MIME header (every file period, for that
matter), or if there was some implementation-dependant out-of-band data
mechinism that had a portible interface available on all platforms, that
would be great!

I'm doing the best I can, by including a
   // charset: ISO-8859-1-Windows-3.1-Latin-1
comment near the top of my C++ files.  A random user who downloads the code
will at least see this and know how to interpret the high-bit characters.
And because the pattern is exactly like internet headers, an editor/lister
whatever tool might notice that automatically as part of its guessing
algorithm.

IOW, I'd like to put a "charset:" or "charset=" line in each file, by
whatever means allowed by the underlying tool that file is meant for.
Tools that deal with =any= text file (e.g. editor) might notice it; human
readers can notice it.

Just musing here.  Now back to the main discussion.

Though Perl itself still examines the #! line for switches.  I recognize
that my position on this is not entirely consistent.  Part of the reason
I hate people putting other metadata in the file is that I already think
I know which metadata I want in the file...

Perl itself can also know about non-spacing marks, or a BOM in particular.
Also, if more things can be specified directly, there will be less need for
Perl to find command line switches this way.  E.g. use warning  instead of
-w.

I don't think
BOMs are a well-defined concept, because they have no scope.

I see.  So, in generalizing the BOM to be "encoding signature", part of
using that effectivly would be to clearly define the scope.  For example,
what I was implitly thinking is that the setting is good for the life of
the opened file handle.

On the other hand, I don't see how output filters could be made to guess
whether they should install byte-order marks or not.  That would probably
have to be specified, just as we currently have to specify binmode()
on Windows to avoid "\n" to CRLF translation.  Unhidable complexity
again...

That works for me.

--John



<Prev in Thread] Current Thread [Next in Thread>