Re: language detection

1999-06-25 18:21:28
On June 25, 1999 at 04:06, Jeff Breidenbach wrote:

I was thinking about automatic language detection. If mailing
list traffic was predominantly Icelandic, I would like to automaticly
ask MHonArc switch over to a resource file localized for Icelandic.

This is probably best done before invoking mhonarc.  Trying to
change resources on the fly has problems.  If we take your language-type
changing, what should happen if messages are encountered with
multiple languages?

Making resource choice is probably best done at the archive level.

Being completely naive, I pulled up a few non-English emails and
looked for some line in the headers that identified the language. How
incredibly depressing. The only relevant headers I found were the
character set, which appears common for dozens of langages. The only

RFC 2184 provides methods to encode language information in message
header text.  From my reading it is a little unclear if it can
be used for specifying language in message bodies.  For example:

        Content-Type: text/plain; charset="iso-8859-1'en"

You can also have the case where multiple languages exist in the
same message.

other header clue was the domain of the list server, which is hardly a
sure thing, given the pervasiveness of both the English language and
the .com domain name. What do people do for automatic language
detection for email? Are they stuck with scanning the body for common
dictionary words?  Bleah!!

So the question is:

 a) Am I missing something obvious

No.  Multi-language/charset spport is never trivial from my observations.

 b) Are there any languages that are easily detected
    (perhaps by a unqiue character set?) If so, are 
    those languages supported by MHonArc? Oh, and what
    are they? <grin>

Many charsets support multiple languages.  There are some exceptions
where charsets were specifically developed for a given language.
Russian and Japanese are good examples where there are charsets specific
to these languages.

I guess I'll have to scuttle the whole thing; if so that's too bad,
since I really think it would be great to automatically customize
to a particular language.

An approach is to have a mapping of mailing list to language.  Whatever
you have that takes the mail and invokes mhonarc, it will check to
see if there is an entry in the mapping for mailing list, and invoke
mhonarc with the proper rcfile.

Or more simply, having mapping that can map a mailing list name to
a specific rcfile.  If no map is given for the list, use a default
rcfile.  This way, you only have to defined mappings for the exceptions.


<Prev in Thread] Current Thread [Next in Thread>