On Saturday, Dec 7, 2002, at 05:18 Asia/Tokyo,
The "UTF" encoding used by Java's DataOutputStream and DataInputStream
(methods writeUTF() and readUTF()) seems to be a modified form of
(1) The length of the sequence in bytes is written to the beginning of
a DataOutputStream as a big-endian 16-bit number: e.g. 0x00 0x10
for 16 bytes written.
(Yes, that means that the maximum size is 64 k.)
(2) The \u0000 (Perl \x00} is encoded as 0xc0 0x80, not 0x00 as in
(The claimed goal being that no Java string has embedded 0x00
So, "abc\u0000\u0100" would be encoded as the bytes
0x00 0x07 0x61 0x62 0x63 0xc0 0x80 0xc4 0x80
Maybe there are enough Java users in the world to warrant a special
Encode encoding for this format...? Note that you can't probably use
UTF-8 routines to read that 0xc0 0x80 without getting a warning,
a forbidden non-shortest encoding form of 0x00.
I don't think JavaUTF should go into Encode main; Java has already
screwed up a lot when it comes to encodings. Confusing Shift_JIS and
cp932 was just one good example.
At the same time I do not object to the idea that someone releases
Encode::JavaUTF to implement (en|de)code("JavaUTF", $scalar).
IMHO, this is yet another example how screwed up Java is. After all
these years we no longer trust libc. Will the same thing happen to
Java? Write once and screw up everywhere so keep writing until
everyone is happy? Ugh!
Dan the Encode Maintainer