Re: Unicode newsgroup name options


rra(_at_)stanford(_dot_)edu (Russ Allbery)  wrote on 21.02.03 in 
<ylvfzd34wp(_dot_)fsf(_at_)windlord(_dot_)stanford(_dot_)edu>:

Andrew Gierth <andrew(_at_)erlenstar(_dot_)demon(_dot_)co(_dot_)uk> writes:

there is a specific reason to avoid punycode in my proposal (proposal
A), which is that it's not easily generated in a simple script; the
easiest way to upgrade several existing news software packages to handle
moderated non-ASCII groups is to wrap the mail-to-moderator program in a
script to do the necessary header changes. (This works for INN, Diablo,
my server, and at least some versions of DNews on some platforms -
basically any server which uses the common approach of doing
mail-to-moderator as an external program call.)


Good point.  That should have been mentioned in my summary.


I just had a look at the punycode draft. What a horrible encoding.

Some people were saying that we should use punycode to avoid using UTF-7,  
which nobody wants to see more of.

I certainly dislike UTF-7.

But if I had to choose between UTF-7 and punycode, I would not hesitate a  
second to use UTF-7. It is *much* less problematic.

As to speed, UTF-7 is essentially a slight variation of base64 for non- 
ASCII characters. Punycode - as has been mentioned - does a lot of  
shifting around during the encoding. I think there can be no argument that  
UTF-7 is significantly faster than punycode.

UTF-7 has, however, one important problem: it uses + as the escape  
character. My active file has over 200 groups with a + in their name.

I'm not particularly happy with the idea, but we _could_ define yet  
another variant of UTF-7 which uses a different escape character - say,  
the = that the UTF-7 RFC said would have been nicer except for 2047  
wanting to use it for themselves (for Q encoding, and why would you use  
UTF-7 with B encoding?).

Frankly, I think punycode stinks. Its only excuse is that the legal DNS  
hostname character set is *really* small (with - as the *only* non- 
alphanumeric); I don't see that we have to be quite as restricted with  
newsgroup names.

Ok, so let's see what else could be bad about UTF-7. Well, it allows  
different encodings for the same string (you can but need not encode  
specific characters). We'd obviously have to outlaw that. Also, it encodes  
UTF-16.

Hmm.

Ok, so here's a strawman encoding proposal based on the ideas of UTF-7 but  
not on its actual definition.

Looking at my active file again, all groups in there use the set
a-zA-Z0-9.+_- - that's 66 characters. For another reference, base64 uses
a-zA-Z0-9./=

/ is bad for people who convert newsgroup names to filenames and don't  
expect it.

So:

Anything in the set a-zA-Z0-9.+_- shall be unencoded.
Anything outside that set shall be encoded as follows:

* First, encode the character as UTF-32. Every such character is thus an  
Unicode character in the range 0-0x10ffff (actually some of those won't be  
used, such as the unencoded chars above and chars not legal in UTF-32.)

* Then, encode each character in base64 as if it were 24 bits (where at  
least the three most significant bits are zero), in network bit order  
(most signifiant bit first). Thus, each character gives four base64  
characters.

* Then replace any / character with a _ character.

* finally, whenever switching from unencoded characters to encoded  
characters or back, insert a = character.

Assuming I made no mistake, that makes our beloved test group name into

        dk.test.utf8-=AADmAAD4AADl

(only one = because it doesn't switch back).

As an alternate example, hypothetical alt.fan.claus-färber would become

        alt.fan.claus-f=AADk=rber

Note that this works just fine with *-only wildmat matching even inside a  
component! And it's just as readable as a punycode-encoded version -  
possibly more so because you can see where the encoded chars belong.

Sorry, no non-Latin1-covered example.

Yes, it does look ugly. So does every other encoding.

Yes, it could be compressed. I don't believe it's worth it. I prefer  
simple.

Oh, yes. We need one additional rule: handle the begin of the string as if  
it was preceded by an unencoded character, that is, start with a = if the  
first character of the top-level hierarchy isn't unencoded. (It's bound to  
happen.)

MfG Kai