Re: byte order mark



From: John Dlugosz





To:   gsar(_at_)activestate(_dot_)com (Gurusamy Sarathy)
cc:   perl5-porters(_at_)perl(_dot_)org, perl-unicode(_at_)perl(_dot_)org, 
gsar(_at_)activestate(_dot_)com,
      John Dlugosz <jdlugosz(_at_)kodak(_dot_)com> (bcc: John Dlugosz/KHIS/EKC)
Subject:  Re: byte order mark

BOMs are an abomination.


In my experiments and musings concerning the switch from 8-bit text systems
to Unicode systems, I assumed that the presence of the BOM, which was
already defined for the purpose to distinguish little endian from big
endian USC-2 or UTF-16, would work to distinguish all the common and useful
encodng schemes from each other, as shown by my chart.

Since different tools want different formats (Perl wants UTF-16, XXX wants
UCS-2) and apparantly 8-bit text might be either UTF-16 or ISO 8859-1, I
naturally started building tools to look at this signature to decide what
the file type is.  A general purpose text editor, for example, can accept
any of those formats.

If they are an abomination, what else do we have?
Can you enlighten me on how we =should= mark files on a system that
contains a mixture of these text file formats?

The BOM is elegant because it already means "ignore me" when found in a
text stream, so any tool should pass them without trouble.  Put one at the
beginning of a file, and other tools (like editor or transcoder) can
unambiguously tell how the file is encoded, rather than guessing.

Certainly it's not our job to fix the kernel if the kernel doesn't
recognize byte-order-marks, but we can at least make Perl ignore them,
and possibly switch to utf8 mode automatically.


My thoughts exactly:  we're not here to discuss similar enhancements to the
UNIX shell programs.

Under Windows NT, the shebang can be programmed into the shell via
configuration.  There is a list containing file offsets and masks, and can
look at signatures anywhere in the file as long as they are in a fixed
location.  The two bytes #! at offset 0, or the four bytes of the Unicode
equivilent, are simply a special case.  However, most file types on Win32
systems are keyed by the extension, so most people ignore #! on Windows
systems.

Having Perl come up with a good mixed-text-encoding solution and lead the
way would encourage similar support from other tools (e.g. BASH on Linux).

No, we can do better than that.  We'd swap in a translater and the
lexer would never see anything but utf8.


Cool, so Perl will eventually support various encodings, not just UTF-8 ?

You suppose the paragraph separator should make a
new line too, since you use it insead of a line separator?


The spec says "Its use allows the creation of plain text files, which can
be laid out on a different line width at the receiving end."

That is, a tool that used U+2029 to separate paragraphs would treat U+2028
as "soft", free to rearrage as needed.

However, I can imagine using these marks together in a "here document"
passage or other multi-line quotation, and it would only make sence to have
the source line numbers reported in errors match the apparance on the
screen in the text editor.

So yes, treat the Paragraph Separator as a line break, too, even though
that is not kosher to seperate lines of source code with.  People will end
up doing it.

I'm more worried about "smart" end-of-line within file reading.  If I say
$line=<FH>;, and want U+2028 to be the $/, but say that legacy CR and LF
are =not= significant?  And should setting U+2028 implicitly recognise
U+2029 also?

--John
(pulling arrows out of my back)