perl-unicode

Re: Encode: "JavaUTF" encoding?

2002-12-07 03:30:04
On Saturday, Dec 7, 2002, at 05:18 Asia/Tokyo, <jarkko(_dot_)hietaniemi(_at_)nokia(_dot_)com> wrote:
The "UTF" encoding used by Java's DataOutputStream and DataInputStream classes (methods writeUTF() and readUTF()) seems to be a modified form of UTF-8:

(1) The length of the sequence in bytes is written to the beginning of
a DataOutputStream as a big-endian 16-bit number: e.g. 0x00 0x10 for 16 bytes written.
    (Yes, that means that the maximum size is 64 k.)
(2) The \u0000 (Perl \x00} is encoded as 0xc0 0x80, not 0x00 as in true UTF-8. (The claimed goal being that no Java string has embedded 0x00 bytes.)

So, "abc\u0000\u0100" would be encoded as the bytes

0x00 0x07 0x61 0x62 0x63 0xc0 0x80 0xc4 0x80

Maybe there are enough Java users in the world to warrant a special
Encode encoding for this format...? Note that you can't probably use Perl's UTF-8 routines to read that 0xc0 0x80 without getting a warning, because that's
a forbidden non-shortest encoding form of 0x00.

I don't think JavaUTF should go into Encode main; Java has already screwed up a lot when it comes to encodings. Confusing Shift_JIS and cp932 was just one good example.

http://archive.develooper.com/perl-unicode(_at_)perl(_dot_)org/msg01030.html
http://developer.java.sun.com/developer/bugParade/bugs/4556882.html
http://www.ingrid.org/java/i18n/encoding/shift_jis.html

At the same time I do not object to the idea that someone releases Encode::JavaUTF to implement (en|de)code("JavaUTF", $scalar).

IMHO, this is yet another example how screwed up Java is. After all these years we no longer trust libc. Will the same thing happen to Java? Write once and screw up everywhere so keep writing until everyone is happy? Ugh!

Dan the Encode Maintainer

<Prev in Thread] Current Thread [Next in Thread>
  • Re: Encode: "JavaUTF" encoding?, Dan Kogai <=