ietf-822
[Top] [All Lists]

Re: internationalization of mail

2004-08-27 02:46:25

Hi, Tex,

This issue has may dimensions. In this message, I want to address just one of the questions you asked.

Tex Texin wrote:

Kat,

thanks. I agree with you about using Unicode internally. Where the encoding of
the message is correctly labeled so it can be accurately transcoded it is a
good solution. The thrust of my questions is to find out the percentage of mail
messages is incorrectly labeled or not labeled with an encoding, and may
therefore be risky to transcode.

Secondarily there are many products that support Unicode but not necessarily
the full (or nearly full) character repertoire.
I need to understand whether moving to Unicode means some mail clients won't
support some languages.
As you know many products today don't fully support a variety of languages
properly, but users can still use them by using a native font and ignoring
encoding declarations. For example, displaying Thai with a Thai font even
though the product thinks it is using ISO 8859-1.

However, if the text is transcoded to UTF-8 based on an incorrect encoding,
then it won't be displayable by treating the utf-8 bytes as Thai characters.

So, can I take it as true that for NS mail, the transcoding to Unicode didn't
introduce significant problems and most mail was rendered effectively?
As far as I know, we haven't heard too many complaints from straight Mail users even from those using Simplified Chinese or Russian. Most complaints are from news readers. They complain that languages like Russian and Chinese are often mislabeled with incorrect charset names. This is presumably because in news, there are many small/personal programs for news reading. For this sort of case, we devised a per-folder charset override control. In Mozilla, you can set the default encoding for each mail folder separately as one of the folder properties. You can also in addition select an option to override all MIME labeling with this default charset for that folder. I understand that this works well in this type of cases in newsgroups.

In the latest version of Mozilla/Netscape7.2, you can also set the mail-send encoding to override the original encoding of the message you're replying to. So for example, (assuming that you have set both the folder override and mail-send charset override),

1) You receive a msg in GB2312 but with an incorrect charset label of ISO-8859-1. 2) But the folder override is on and this means regardless of MIME charset info in the msg, this msg will be displayed as GB2312 in this particular folder. So far, so good. 3) Now you want to send a reply to this msg. Instead of honoring the original charset (ISO-8859-1), selecting an override option for the send charset I mentioned above will reply to this in GB2312 thereby correcting the incorrect encoding.

These 2 overriding mechanisms help a lot. I think this type of idea is extensible to other mail systems. A bit more complicated case is where news articles are sent as mail msg to mailboxes.

- Kat

tex


Katsuhiko Momoi wrote:
Tex Texin wrote:

Hi Kat!
It's been a while...


Hi, Tex,

Good to see you here.

thanks for the mail, it was helpful.

Correct me if I am wrong, but in the Mozilla environment the mail is treated as
a bucket of bytes that just gets passed around.
So for example if one subject hdr is in iso 2022-jp and another is in
shift-jis, you can change the display to show and interpret one or the other
correctly, but not both at the same time. For purposes of search and filtering,
I was thinking of converting to unicode, and then all subjects would display
correctly, as well as be searchable.


Not really. The received msgs are stored as is but as each msg is sent
to rendering, it is converted to utf-8 before it gets to rendering. So
in Mozilla mail, one can display headers of all supported encodings at
once in the header viewing pane. I had a demo I used at one of the
Unicode conferences which showed mail headers in 15 or so languages. But
the requirement for this is that the each header must be marked with
proper MIME encoding info. This means Mozilla can display shift_jis and
iso-2022-jp headers at the same time as long they use MIME encoding. If
a header of a msg does not use MIME encoding, then Mozilla will use the
user's default msg display encoding.

Sorting and filtering for Mozilla are all done in Unicode as well. For
more info, you can look at our Unicode presentation (by Naoki Hotta and
I) here:

http://wp.netscape.com/eng/intl/docs/iuc17/mail/iuc17mail.html

and in particular at this diagram showing the data flow for MIME header
and body with associated MIME encoding.

http://wp.netscape.com/eng/intl/docs/iuc17/mail/slide05.html

If I may add my 2 cents,  if one is designing a webmail system, why not
use Unicode for all internal processing? (while keeping the original
data as is in case you need to refer to them).  Most search data are
handled in Unicode anyway now and this will increase compatibility with
the search functions you might want to offer in a webmail service. I
would even go as far as to say that we may want to unify send encoding
of webmail systems to just UTF-8 unless only ASCII data is being sent
out. I think Gmail is currently doing just that. In the absence of
country specific standards, this is one way to improve the situation.
Most popular mail programs can handle UTF-8. If some mailers cannot,
then I think it is time to update the code! I think this standardization
can happen first in webmail -- where simplification is acceptable --
ahead of standalone mailers, which may face harder hurdles in unifying
to UTF-8.

- Kat

But it may not be worthwhile if many of the mails are incorrectly identified so
transcoding to utf-8 generates muck, or errors out.

tex

Katsuhiko Momoi wrote:


Tex Texin wrote:



Hi,

(snipped)

2) Although I know the character encodings used in many regions, I note that
mail clients sometimes prefer a different encoding than what other applications
traditionally use. For example, ISO 2022-jp is used more often for mail in
Japan than it is used precentage wise for other kinds of Japanese software (I
believe).

What is the preferred encoding(s) for mail in each country market? (Although we
might prefer to recommend utf-8, I am looking to understand what the market
practice actually is.)




Tex,

It would have been great if we had such a list when we worked on the
Mozilla mail. We tried to restrict the number of outgoing msg encodings
to the ones we can recommend (ISO types mostly unless we heard loud user
complaints).  The list was based on known RFC's and our knowledge of the
market since 1995. The results of this rather tortured attempt can be
found in the Mozilla menu:

Edit > Preferences > Mail & Newsgroups > Composition > Character Encoding

For reading msgs, we did not impose such a restriction and so the user
can choose to correct a msg encoding with all the encodings available
for browsing. For sending msgs, however, the user will have to customize
the above limited list by specifically opening the Customize .. menu --
and then the user can add any encoding he/she chooses.

As far as we knew at the time (a few years back), some RFC's defined
encodings  for for certain languages (e.g. Japanese) but most languages
had no declaration from any authoritative body on this topic. This meant
for us the use of the greatest common denominator encoding(s) for each
language/language group (mostly ISO types) unless users of some
language/language groups complained -- Hebrew and Russian come to mind.

If this situation has changed, I would love to know also.

- Kat

--
Katsuhiko Momoi
e-mail: katmomoi(_at_)pacbell(_dot_)net





--
Katsuhiko Momoi
e-mail: katmomoi(_at_)pacbell(_dot_)net