Re: internationalization of mail


Tex Texin wrote:

Hi Kat!
It's been a while...

Hi, Tex,

Good to see you here.

thanks for the mail, it was helpful.

Correct me if I am wrong, but in the Mozilla environment the mail is treated as
a bucket of bytes that just gets passed around.
So for example if one subject hdr is in iso 2022-jp and another is in
shift-jis, you can change the display to show and interpret one or the other
correctly, but not both at the same time. For purposes of search and filtering,
I was thinking of converting to unicode, and then all subjects would display
correctly, as well as be searchable.

Not really. The received msgs are stored as is but as each msg is sentto rendering, it is converted to utf-8 before it gets to rendering. Soin Mozilla mail, one can display headers of all supported encodings atonce in the header viewing pane. I had a demo I used at one of theUnicode conferences which showed mail headers in 15 or so languages. Butthe requirement for this is that the each header must be marked withproper MIME encoding info. This means Mozilla can display shift_jis andiso-2022-jp headers at the same time as long they use MIME encoding. Ifa header of a msg does not use MIME encoding, then Mozilla will use theuser's default msg display encoding.

Sorting and filtering for Mozilla are all done in Unicode as well. Formore info, you can look at our Unicode presentation (by Naoki Hotta andI) here:


http://wp.netscape.com/eng/intl/docs/iuc17/mail/iuc17mail.html

and in particular at this diagram showing the data flow for MIME headerand body with associated MIME encoding.


http://wp.netscape.com/eng/intl/docs/iuc17/mail/slide05.html

If I may add my 2 cents, if one is designing a webmail system, why notuse Unicode for all internal processing? (while keeping the originaldata as is in case you need to refer to them). Most search data arehandled in Unicode anyway now and this will increase compatibility withthe search functions you might want to offer in a webmail service. Iwould even go as far as to say that we may want to unify send encodingof webmail systems to just UTF-8 unless only ASCII data is being sentout. I think Gmail is currently doing just that. In the absence ofcountry specific standards, this is one way to improve the situation.Most popular mail programs can handle UTF-8. If some mailers cannot,then I think it is time to update the code! I think this standardizationcan happen first in webmail -- where simplification is acceptable --ahead of standalone mailers, which may face harder hurdles in unifyingto UTF-8.


- Kat

But it may not be worthwhile if many of the mails are incorrectly identified so
transcoding to utf-8 generates muck, or errors out.

tex

Katsuhiko Momoi wrote:

Tex Texin wrote:

Hi,

(snipped)

2) Although I know the character encodings used in many regions, I note that
mail clients sometimes prefer a different encoding than what other applications
traditionally use. For example, ISO 2022-jp is used more often for mail in
Japan than it is used precentage wise for other kinds of Japanese software (I
believe).

What is the preferred encoding(s) for mail in each country market? (Although we
might prefer to recommend utf-8, I am looking to understand what the market
practice actually is.)

Tex,

It would have been great if we had such a list when we worked on the
Mozilla mail. We tried to restrict the number of outgoing msg encodings
to the ones we can recommend (ISO types mostly unless we heard loud user
complaints).  The list was based on known RFC's and our knowledge of the
market since 1995. The results of this rather tortured attempt can be
found in the Mozilla menu:

Edit > Preferences > Mail & Newsgroups > Composition > Character Encoding

For reading msgs, we did not impose such a restriction and so the user
can choose to correct a msg encoding with all the encodings available
for browsing. For sending msgs, however, the user will have to customize
the above limited list by specifically opening the Customize .. menu --
and then the user can add any encoding he/she chooses.

As far as we knew at the time (a few years back), some RFC's defined
encodings  for for certain languages (e.g. Japanese) but most languages
had no declaration from any authoritative body on this topic. This meant
for us the use of the greatest common denominator encoding(s) for each
language/language group (mostly ISO types) unless users of some
language/language groups complained -- Hebrew and Russian come to mind.

If this situation has changed, I would love to know also.

- Kat

--
Katsuhiko Momoi
e-mail: katmomoi(_at_)pacbell(_dot_)net