rra(_at_)stanford(_dot_)edu (Russ Allbery) wrote on 21.02.03 in
<ylvfzd34wp(_dot_)fsf(_at_)windlord(_dot_)stanford(_dot_)edu>:
Andrew Gierth <andrew(_at_)erlenstar(_dot_)demon(_dot_)co(_dot_)uk> writes:
there is a specific reason to avoid punycode in my proposal (proposal
A), which is that it's not easily generated in a simple script; the
easiest way to upgrade several existing news software packages to handle
moderated non-ASCII groups is to wrap the mail-to-moderator program in a
script to do the necessary header changes. (This works for INN, Diablo,
my server, and at least some versions of DNews on some platforms -
basically any server which uses the common approach of doing
mail-to-moderator as an external program call.)
Good point. That should have been mentioned in my summary.
I just had a look at the punycode draft. What a horrible encoding.
Some people were saying that we should use punycode to avoid using UTF-7,
which nobody wants to see more of.
I certainly dislike UTF-7.
But if I had to choose between UTF-7 and punycode, I would not hesitate a
second to use UTF-7. It is *much* less problematic.
As to speed, UTF-7 is essentially a slight variation of base64 for non-
ASCII characters. Punycode - as has been mentioned - does a lot of
shifting around during the encoding. I think there can be no argument that
UTF-7 is significantly faster than punycode.
UTF-7 has, however, one important problem: it uses + as the escape
character. My active file has over 200 groups with a + in their name.
I'm not particularly happy with the idea, but we _could_ define yet
another variant of UTF-7 which uses a different escape character - say,
the = that the UTF-7 RFC said would have been nicer except for 2047
wanting to use it for themselves (for Q encoding, and why would you use
UTF-7 with B encoding?).
Frankly, I think punycode stinks. Its only excuse is that the legal DNS
hostname character set is *really* small (with - as the *only* non-
alphanumeric); I don't see that we have to be quite as restricted with
newsgroup names.
Ok, so let's see what else could be bad about UTF-7. Well, it allows
different encodings for the same string (you can but need not encode
specific characters). We'd obviously have to outlaw that. Also, it encodes
UTF-16.
Hmm.
Ok, so here's a strawman encoding proposal based on the ideas of UTF-7 but
not on its actual definition.
Looking at my active file again, all groups in there use the set
a-zA-Z0-9.+_- - that's 66 characters. For another reference, base64 uses
a-zA-Z0-9./=
/ is bad for people who convert newsgroup names to filenames and don't
expect it.
So:
Anything in the set a-zA-Z0-9.+_- shall be unencoded.
Anything outside that set shall be encoded as follows:
* First, encode the character as UTF-32. Every such character is thus an
Unicode character in the range 0-0x10ffff (actually some of those won't be
used, such as the unencoded chars above and chars not legal in UTF-32.)
* Then, encode each character in base64 as if it were 24 bits (where at
least the three most significant bits are zero), in network bit order
(most signifiant bit first). Thus, each character gives four base64
characters.
* Then replace any / character with a _ character.
* finally, whenever switching from unencoded characters to encoded
characters or back, insert a = character.
Assuming I made no mistake, that makes our beloved test group name into
dk.test.utf8-=AADmAAD4AADl
(only one = because it doesn't switch back).
As an alternate example, hypothetical alt.fan.claus-färber would become
alt.fan.claus-f=AADk=rber
Note that this works just fine with *-only wildmat matching even inside a
component! And it's just as readable as a punycode-encoded version -
possibly more so because you can see where the encoded chars belong.
Sorry, no non-Latin1-covered example.
Yes, it does look ugly. So does every other encoding.
Yes, it could be compressed. I don't believe it's worth it. I prefer
simple.
Oh, yes. We need one additional rule: handle the begin of the string as if
it was preceded by an unencoded character, that is, start with a = if the
first character of the top-level hierarchy isn't unencoded. (It's bound to
happen.)
MfG Kai