perl-unicode

Re: Starnge characters when displaying html files saved in UTF-8 format

2001-12-11 14:44:27
Moreover, if the browser claims to do Unicode, it should recognize the
BOM, too, and ignore it in display (but of course use it to figure out
the right endianness).

The BOM is valid as the *first* character. I'm not sure what the

Yes, that is implicit in the definition of BOM.

spec says about subsequent chars.

My spec is at home but I think it's illegal in subsequent text.
(Blindly concatenating text for several files could of course
lead into such a situation.)

How did the browsers handle the foo.* files?

Didn't try.  Hang on... okay, here are some Win2000 results,
browsers with pretty much stock default settings:

                foo.ansi foo.xxx foo.unic foo.unib

IE 6.0.2600.000 -no1-   OK      -no2-   -no2-
Opera 6.0       -no3-   -no3-   -no3-   -no3-
Mozilla 0.9.5   -no3-   -no3-   -no3-   -no3-

-no1-   Tells that the file is of illegal type and won't be displayed.
-no2-   Asks for a display application but then displays nothing.
-no3-   Asks for for a display application but I'm supposed to
        be doing something else and can't be bothered now since
        any autodetection obviously isn't working.

"foo.xxx" is foo.utf8 renamed so that file extension gives no extra
hint.  For *all* the IE6 cases I tried choosing IE itself to be the
program to open the file, but there the UTF-8 case was at least
autodetected.  Other than that, I would say that the Unicode (be it
UTF-8 or UTF-16) autodetection is in a rather sad state for local
files).  (Don't have NS installed right now.)

Of course you will may need to manually set the encoding to get 
proper results since these do not have a charset tag. I do believe
that the Netscape 6.2 universal autodetector should detect it 
automatically (when turned on).

-- 
Brian Stell

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen