perl-unicode

Re: Starnge characters when displaying html files saved in UTF-8 format

2001-12-11 13:35:35
On Tue, Dec 11, 2001 at 11:34:09AM -0800, Brian Stell wrote:
Jalal,

Kindly reply via the mailing list so others can see the discussion.
That way others can benefit and/or help.

BOM is the Byte Order Mark used in Unicode to indicate an
important detail about the Unicode data stream.

Perhaps the Perl people can describe how to inhibit the BOM?

I don't think it's Perl putting the BOM in there.

I opened up Notepad in Win2000, wrote "foobar", and saved the file
as "ANSI", "UTF-8", "Unicode", and "Unicode big endian".  Then in UNIX
with this 

  perl -e 'print "$ARGV[0]: "; print unpack "H*", <>; print "\n"' file.name

I get

foo.ansi: feff0066006f006f006200610072000d000a
foo.utf8: efbbbf666f6f6261720d0a
foo.unic: fffe66006f006f006200610072000d000a
foo.unib: feff0066006f006f006200610072000d000a

(copied by hand, so typos possible) which looks like little-endian 
UTF-16, UTF-8, big-endian UTF-16, and (again) little-endian UTF-16
to me.  For example the "Unicode" is first the BOM, then the 0x66
aka "f", then two 0x6f:s, aka "o", then 0x62, aka "b", and so on.

No Perl was involved in creating these files, but the BOMs are there
(the UTF-8 0xEF 0xBB 0xBF is the BOMin disguise).

Moreover, if the browser claims to do Unicode, it should recognize the
BOM, too, and ignore it in display (but of course use it to figure out
the right endianness).

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen