perl-unicode

Re: Starnge characters when displaying html files saved in UTF-8 format

2001-12-11 14:24:36


Jarkko Hietaniemi wrote:

On Tue, Dec 11, 2001 at 11:34:09AM -0800, Brian Stell wrote:
Jalal,

Kindly reply via the mailing list so others can see the discussion.
That way others can benefit and/or help.

BOM is the Byte Order Mark used in Unicode to indicate an
important detail about the Unicode data stream.

Perhaps the Perl people can describe how to inhibit the BOM?

I don't think it's Perl putting the BOM in there.

I opened up Notepad in Win2000, wrote "foobar", and saved the file
as "ANSI", "UTF-8", "Unicode", and "Unicode big endian".  Then in UNIX
with this

  perl -e 'print "$ARGV[0]: "; print unpack "H*", <>; print "\n"' file.name

I get

foo.ansi: feff0066006f006f006200610072000d000a
foo.utf8: efbbbf666f6f6261720d0a
foo.unic: fffe66006f006f006200610072000d000a
foo.unib: feff0066006f006f006200610072000d000a

(copied by hand, so typos possible) which looks like little-endian
UTF-16, UTF-8, big-endian UTF-16, and (again) little-endian UTF-16
to me.  For example the "Unicode" is first the BOM, then the 0x66
aka "f", then two 0x6f:s, aka "o", then 0x62, aka "b", and so on.

No Perl was involved in creating these files, but the BOMs are there
(the UTF-8 0xEF 0xBB 0xBF is the BOMin disguise).

Moreover, if the browser claims to do Unicode, it should recognize the
BOM, too, and ignore it in display (but of course use it to figure out
the right endianness).

The BOM is valid as the *first* character. I'm not sure what the
spec says about subsequent chars.

How did the browsers handle the foo.* files?

Of course you will may need to manually set the encoding to get 
proper results since these do not have a charset tag. I do believe
that the Netscape 6.2 universal autodetector should detect it 
automatically (when turned on).

-- 
Brian Stell