Re: Non-latin email addresses dangerous?


On 1-feb-04, at 19:52, Martin Duerst wrote:

This is very dangerous anyone having a non-latin email address can'texpect to receive mail from people who don't master this characterset. So someone who speeks a non-latin language as well as a latinone, would always have to provide two representations of his/heremail address in order to be reachable in both languages.

If you have ever looked at business cards of people in such regions,that'sexactly what they do. Not because they want it that way, but becausethey

know that their customers appreciate it. The main piece of 'innovation'
that you need in order for this to work with email are some clever ways
of sending MUAs to figure out which sender address to use, so that they
don't use a non-latin one when they send you email. The rest of the
problem is mostly pure mechanics, although before we use it, it has
to be written up.

It's not that easy. What if A and B communicate in Japanese, but theyalso both speak English. Now B wants to give C A's email address. IfA's software had determined that B only needs to know the Japanesecharacter email address then C is going to have trouble, especiallywhen there is another step that includes degradation of the encoding.

If if we're going to allow non-latin email addresses, I think the onlyreasonable way to do it is by mandating that the latin equivalent mustalways accompany the non-latin address. Then only in cases where theaddress was typed from paper or some such the latin equivalent isunavailable.

I think we should also explore the possibilities of having numeric-onlyemail addresses, as this nicely solves the whole mess, be it in a waythat isn't all that pretty to look at.

A simple way to create numeric addresses would be to concatenate thenumeric values of all characters in the script used in some way, butthis has the problem that the addresses get very long very fast. (A 15character email address would be something like 50 digits.) A betterway would be to give each tld, domain and user a sequence number,although this has the disadvantage that mapping back and forth is veryhard. (3 digits for tld, 6 - 9 for domain and ~ 3 - 6 for user would be15 digits or less if domains with many users get the short domainnumbers.) We probably want to include one or two checksum digits.

Another way to handle this would be semantics-based character settranslation, where a word in a language with character set A isreplaced by a word with approximately the same meaning in a languagewith characterset B. Note that actually retaining the meaning is onlyan additional benefit when this can be achieved, the real target is tocreate a new email address in the new character set that is relativelyeasy to enter. In order for this to work everyone would have to uselarge translation tables, but compared to what's needed to implementunicode that doesn't seem like a huge problem.