mail-ng
[Top] [All Lists]

Re: Non-latin email addresses dangerous?

2004-02-02 06:49:06

On 1-feb-04, at 19:52, Martin Duerst wrote:

This is very dangerous anyone having a non-latin email address can't expect to receive mail from people who don't master this character set. So someone who speeks a non-latin language as well as a latin one, would always have to provide two representations of his/her email address in order to be reachable in both languages.

If you have ever looked at business cards of people in such regions, that's exactly what they do. Not because they want it that way, but because they
know that their customers appreciate it. The main piece of 'innovation'
that you need in order for this to work with email are some clever ways
of sending MUAs to figure out which sender address to use, so that they
don't use a non-latin one when they send you email. The rest of the
problem is mostly pure mechanics, although before we use it, it has
to be written up.

It's not that easy. What if A and B communicate in Japanese, but they also both speak English. Now B wants to give C A's email address. If A's software had determined that B only needs to know the Japanese character email address then C is going to have trouble, especially when there is another step that includes degradation of the encoding.

If if we're going to allow non-latin email addresses, I think the only reasonable way to do it is by mandating that the latin equivalent must always accompany the non-latin address. Then only in cases where the address was typed from paper or some such the latin equivalent is unavailable.

I think we should also explore the possibilities of having numeric-only email addresses, as this nicely solves the whole mess, be it in a way that isn't all that pretty to look at.

A simple way to create numeric addresses would be to concatenate the numeric values of all characters in the script used in some way, but this has the problem that the addresses get very long very fast. (A 15 character email address would be something like 50 digits.) A better way would be to give each tld, domain and user a sequence number, although this has the disadvantage that mapping back and forth is very hard. (3 digits for tld, 6 - 9 for domain and ~ 3 - 6 for user would be 15 digits or less if domains with many users get the short domain numbers.) We probably want to include one or two checksum digits.

Another way to handle this would be semantics-based character set translation, where a word in a language with character set A is replaced by a word with approximately the same meaning in a language with characterset B. Note that actually retaining the meaning is only an additional benefit when this can be achieved, the real target is to create a new email address in the new character set that is relatively easy to enter. In order for this to work everyone would have to use large translation tables, but compared to what's needed to implement unicode that doesn't seem like a huge problem.