On Saturday, Dec 7, 2002, at 05:18 Asia/Tokyo,
<jarkko(_dot_)hietaniemi(_at_)nokia(_dot_)com> wrote:
The "UTF" encoding used by Java's DataOutputStream and DataInputStream
classes
(methods writeUTF() and readUTF()) seems to be a modified form of
UTF-8:
(1) The length of the sequence in bytes is written to the beginning of
a DataOutputStream as a big-endian 16-bit number: e.g. 0x00 0x10
for 16 bytes written.
(Yes, that means that the maximum size is 64 k.)
(2) The \u0000 (Perl \x00} is encoded as 0xc0 0x80, not 0x00 as in
true UTF-8.
(The claimed goal being that no Java string has embedded 0x00
bytes.)
So, "abc\u0000\u0100" would be encoded as the bytes
0x00 0x07 0x61 0x62 0x63 0xc0 0x80 0xc4 0x80
Maybe there are enough Java users in the world to warrant a special
Encode encoding for this format...? Note that you can't probably use
Perl's
UTF-8 routines to read that 0xc0 0x80 without getting a warning,
because that's
a forbidden non-shortest encoding form of 0x00.
I don't think JavaUTF should go into Encode main; Java has already
screwed up a lot when it comes to encodings. Confusing Shift_JIS and
cp932 was just one good example.
http://archive.develooper.com/perl-unicode(_at_)perl(_dot_)org/msg01030.html
http://developer.java.sun.com/developer/bugParade/bugs/4556882.html
http://www.ingrid.org/java/i18n/encoding/shift_jis.html
At the same time I do not object to the idea that someone releases
Encode::JavaUTF to implement (en|de)code("JavaUTF", $scalar).
IMHO, this is yet another example how screwed up Java is. After all
these years we no longer trust libc. Will the same thing happen to
Java? Write once and screw up everywhere so keep writing until
everyone is happy? Ugh!
Dan the Encode Maintainer