perl-unicode

Re: byte order mark

1999-10-06 11:41:48
jdlugosz(_at_)kodak(_dot_)com writes:
: > BOMs are an abomination.
: 
: In my experiments and musings concerning the switch from 8-bit text systems
: to Unicode systems, I assumed that the presence of the BOM, which was
: already defined for the purpose to distinguish little endian from big
: endian USC-2 or UTF-16, would work to distinguish all the common and useful
: encodng schemes from each other, as shown by my chart.

I suppose I should clarify that I wasn't attacking you particularly.
I have no problem with people making the best of a bad situation.  :-)

: Since different tools want different formats (Perl wants UTF-16, XXX wants
: UCS-2) and apparantly 8-bit text might be either UTF-16 or ISO 8859-1, I
: naturally started building tools to look at this signature to decide what
: the file type is.  A general purpose text editor, for example, can accept
: any of those formats.

I assume you mean UTF-8 there, not UTF-16.

: If they are an abomination, what else do we have?
: Can you enlighten me on how we =should= mark files on a system that
: contains a mixture of these text file formats?

Mixing metadata in with data is just plain bad news in a
text-processing language, especially when multiple standards are vying
for what has to be first in the file.  And it's worse than that,
because if we're not careful, we're gonna have to worry about what
comes first in every *string*, not just every file.  Should substr()
ignore byte-order marks?  Your poor user is going to look at some
text file with your fancy editor, and say "I can see with my own eyes
that this field starts at the fourth character on every line.  How come
Perl can't pull it out with a substr()?"

The alternative seems to be to make sure that Perl removes every
byte-order mark from its input.  But if every string or file has to
treated as a special case, why didn't we just send the information
out-of-band in the first place?  Where does it stop?  There's just an
awful lot of type information that can be associated with strings--just
look at MIME.

: The BOM is elegant because it already means "ignore me" when found in a
: text stream, so any tool should pass them without trouble.  Put one at the
: beginning of a file, and other tools (like editor or transcoder) can
: unambiguously tell how the file is encoded, rather than guessing.

Assuming that the file is a text file, and assuming the file is all
encoded the same way, and assuming no enterprising Unix geek is going
to use ASCII-based tools to split a UTF-8 file into multiple files, etc.

Byte-order marks are elegant only if you sweep the inelegance under the
carpet of the folks who designed Unicode.  They should either have
mandated that all data be exchanged via a byte-encoding such as UTF-8,
or at least mandated that all 16-bit data exchange happen in
network-byte order.  Hey, it works for TCP/IP...

I hate byte-order marks.  In case you couldn't tell.  :-)

: Under Windows NT, the shebang can be programmed into the shell via
: configuration.  There is a list containing file offsets and masks, and can
: look at signatures anywhere in the file as long as they are in a fixed
: location.  The two bytes #! at offset 0, or the four bytes of the Unicode
: equivilent, are simply a special case.  However, most file types on Win32
: systems are keyed by the extension, so most people ignore #! on Windows
: systems.

Though Perl itself still examines the #! line for switches.  I recognize
that my position on this is not entirely consistent.  Part of the reason
I hate people putting other metadata in the file is that I already think
I know which metadata I want in the file...

: Having Perl come up with a good mixed-text-encoding solution and lead the
: way would encourage similar support from other tools (e.g. BASH on Linux).

A goal of Perl has always been to hide the right amount of complexity
from the programmer.  It's not clear that this will continue to be
a defined concept, the way text processing is going.

: > No, we can do better than that.  We'd swap in a translater and the
: > lexer would never see anything but utf8.
: 
: Cool, so Perl will eventually support various encodings, not just UTF-8 ?

All Unicode is internally in UTF-8, since Perl is, as much as possible,
a Byte-Order Free Zone.  But we can do anything we want at the interface
of Perl to the real world, provided it's well defined.  I don't think
BOMs are a well-defined concept, because they have no scope.  Or rather,
the associated scope is *external* metadata, the size of a file, or a
string, or a TCP stream, or an XML document, or a UDP packet, or a
system call argument, or something.

The world be a much cleaner place if we simply scrapped UCS-2.

: > You suppose the paragraph separator should make a
: > new line too, since you use it insead of a line separator?
: 
: The spec says "Its use allows the creation of plain text files, which can
: be laid out on a different line width at the receiving end."
: 
: That is, a tool that used U+2029 to separate paragraphs would treat U+2028
: as "soft", free to rearrage as needed.
: 
: However, I can imagine using these marks together in a "here document"
: passage or other multi-line quotation, and it would only make sence to have
: the source line numbers reported in errors match the apparance on the
: screen in the text editor.

What do we do if different text editors count the lines differently?
It doesn't do any good to tell the user to look for the error on line 42
if one editor thinks it's line 37 and another editor thinks it's line 582.

This is the sort of complexity that cannot be hidden, no matter how well
your language is designed.

: So yes, treat the Paragraph Separator as a line break, too, even though
: that is not kosher to seperate lines of source code with.  People will end
: up doing it.

I hope so.

: I'm more worried about "smart" end-of-line within file reading.  If I say
: $line=<FH>;, and want U+2028 to be the $/, but say that legacy CR and LF
: are =not= significant?  And should setting U+2028 implicitly recognise
: U+2029 also?

That can be special-cased, I think, especially as we get the ability
to associate different input filters with a given stream at open time.

Input routines don't all have to have the same level of generality.
Some could be guessers, and some could be forcers.

On the other hand, I don't see how output filters could be made to guess
whether they should install byte-order marks or not.  That would probably
have to be specified, just as we currently have to specify binmode()
on Windows to avoid "\n" to CRLF translation.  Unhidable complexity
again...

Larry

<Prev in Thread] Current Thread [Next in Thread>