Re: Encode: "JavaUTF" encoding?

On Saturday, Dec 7, 2002, at 05:18 Asia/Tokyo,<jarkko(_dot_)hietaniemi(_at_)nokia(_dot_)com> wrote:

The "UTF" encoding used by Java's DataOutputStream and DataInputStreamclasses(methods writeUTF() and readUTF()) seems to be a modified form ofUTF-8:
(1) The length of the sequence in bytes is written to the beginning of
a DataOutputStream as a big-endian 16-bit number: e.g. 0x00 0x10for 16 bytes written.
    (Yes, that means that the maximum size is 64 k.)
(2) The \u0000 (Perl \x00} is encoded as 0xc0 0x80, not 0x00 as intrue UTF-8.(The claimed goal being that no Java string has embedded 0x00bytes.)
So, "abc\u0000\u0100" would be encoded as the bytes

0x00 0x07 0x61 0x62 0x63 0xc0 0x80 0xc4 0x80

Maybe there are enough Java users in the world to warrant a special
Encode encoding for this format...? Note that you can't probably usePerl'sUTF-8 routines to read that 0xc0 0x80 without getting a warning,because that's
a forbidden non-shortest encoding form of 0x00.

I don't think JavaUTF should go into Encode main; Java has alreadyscrewed up a lot when it comes to encodings. Confusing Shift_JIS andcp932 was just one good example.


http://archive.develooper.com/perl-unicode(_at_)perl(_dot_)org/msg01030.html
http://developer.java.sun.com/developer/bugParade/bugs/4556882.html
http://www.ingrid.org/java/i18n/encoding/shift_jis.html

At the same time I do not object to the idea that someone releasesEncode::JavaUTF to implement (en|de)code("JavaUTF", $scalar).

IMHO, this is yet another example how screwed up Java is. After allthese years we no longer trust libc. Will the same thing happen toJava? Write once and screw up everywhere so keep writing untileveryone is happy? Ugh!


Dan the Encode Maintainer