Re: language detection

1999-07-07 15:32:54
[ADMIN NOTE]:  This message fell through the cracks and is being
               sent out a little late.  Carl is now subscribed to
               mhonarc-allow, so his future messages should be 
               a little more timely.  

                                                -- Chris

Being completely naive, I pulled up a few non-English emails and
looked for some line in the headers that identified the language. How
incredibly depressing.

Yes. There is a MIME header -- Content-Language, defined in RFC 1766 -- that's
been around for more than four years. But no popular client implements it. I
recently did an exhaustive search of all the E-Mail I had saved over the past
five years (~50,000 pieces) I found four messages that used Content-Language,
all originating from the same person on the IETF-Languages list. And the
language he used was English! Even the author of RFC 1766, Harald Alvestrand,
does not use a client that supports this header field.

There are several efforts to specify the user's language preference for DSNs
and other automatic replies, and these seem to be gaining support. Netscape's
mail client emits an X-Accept-Language header field, for instance. But that's
not the same thing; I might write in Japanese when posting to a Japanese list,
but prefer to get my error replies in English.

The only relevant headers I found were the
character set, which appears common for dozens of langages.

Correct. Especially Unicode. ;-)

What do people do for automatic language detection for email? Are they
stuck with scanning the body for common dictionary words?  Bleah!!

At present, that's all you've got.

 b) Are there any languages that are easily detected
    (perhaps by a unqiue character set?)

A handful, mostly Asian languages have unique charsets. For all of these,
however, ASCII remains a pure subset, which makes language identification a
very hit-or-miss proposition. For example, a native Japanese speaker would use
a client that emits charset=ISO-2022-JP. But the client may emit that charset
tag whether the text was written in Japanese or entirely in English. Or the
user might have a couple of Japanese characters in his signature that forces
the client to use ISO-2022-JP, even though the actual text is English.


<Prev in Thread] Current Thread [Next in Thread>