Re: Unicode to UTF-8

On Sat, Sep 07, 2002 at 09:05:13PM -0400, Rick Dillon wrote:

Hello.

I am currently populating html pages with content from MS Excel. I am 
using a Java program that literally places the Excel content directly 
into the output code (which is saved as html). It appears that Excel 
is using Unicode characters, which is causing strange glyphs when the 
html is viewed in a browser. Is there a Perl Way to parse the output 
and replace the Unicode characters with asciii, or UTF-8 equivalents?


I don't know the answer to this for sure (but my guess from your
description is that Excel is using 16 bit representation of Unicode, and
your browser expects an 8 bit encoding of some form).

If so, and Excel is only placing Unicode code points in the range 0-255
in your HTML page, then I think something as simple as s/\0(.)/$1/mg in
any perl (probably even perl4) would work. But this is a cheap hack, and
likely to break.

If your data from Excel really has Unicode code points >256, or may do in
the future, then really there's no reliable way to fix your HTML file once
it has a mix of 1 byte and 2 byte characters in it. Either your Java
program should do the conversion to 8 bit (the encoding to UTF8 is not hard,
perl's utf8.h says:

/*

 The following table is from Unicode 3.2.

 Code Points            1st Byte  2nd Byte  3rd Byte  4th Byte

   U+0000..U+007F       00..7F
   U+0080..U+07FF       C2..DF    80..BF
   U+0800..U+0FFF       E0        A0..BF    80..BF
   U+1000..U+CFFF       E1..EC    80..BF    80..BF
   U+D000..U+D7FF       ED        80..9F    80..BF
   U+D800..U+DFFF       ******* ill-formed *******
   U+E000..U+FFFF       EE..EF    80..BF    80..BF
  U+10000..U+3FFFF      F0        90..BF    80..BF    80..BF
  U+40000..U+FFFFF      F1..F3    80..BF    80..BF    80..BF
 U+100000..U+10FFFF     F4        80..8F    80..BF    80..BF

Note the A0..BF in U+0800..U+0FFF, the 80..9F in U+D000...U+D7FF,
the 90..BF in U+10000..U+3FFFF, and the 80...8F in U+100000..U+10FFFF.
The "gaps" are caused by legal UTF-8 avoiding non-shortest encodings:
it is technically possible to UTF-8-encode a single code point in different
ways, but that is explicitly forbidden, and the shortest possible encoding
should always be used (and that is what Perl does).

 */

and the relevant part of utf8.c for code points between 0x80 and 0x10000:

    if (uv < 0x800) {
        *d++ = (U8)(( uv >>  6)         | 0xc0);
        *d++ = (U8)(( uv        & 0x3f) | 0x80);
        return d;
    }
    if (uv < 0x10000) {
        *d++ = (U8)(( uv >> 12)         | 0xe0);
        *d++ = (U8)(((uv >>  6) & 0x3f) | 0x80);
        *d++ = (U8)(( uv        & 0x3f) | 0x80);
        return d;
    }

) or alternatively your Java program should output the HTML file entirely in
16 bit, and then use something else (eg perl) to convert that to UTF8 or
whatever your browser likes. Converting the representation of Unicode from
16 bit UCS-2 to UTF8 is just byte shuffling, so any perl can do it.
Offhand, I don't know if there are modules on CPAN already to do it, but
I'd be surprised if there none - try http://search.cpan.org/

And do I need to upgrade to perl 5.6 to do this?


If you are considering upgrading from something like 5.005, is there any
reason not to consider going straight to 5.8.0? The Unicode 5.8.0 support
in 5.8.0 is much better than 5.6.1, and it also fixes many of the bugs
still present in 5.6.1. (Nothing is perfect - a few new bugs have been
reported in 5.8.0, but generally it does seem stable and of good quality)

Nicholas Clark
-- 
Even better than the real thing:        http://nms-cgi.sourceforge.net/