language detection

1999-06-25 04:49:30

I was thinking about automatic language detection. If mailing
list traffic was predominantly Icelandic, I would like to automaticly
ask MHonArc switch over to a resource file localized for Icelandic.

Being completely naive, I pulled up a few non-English emails and
looked for some line in the headers that identified the language. How
incredibly depressing. The only relevant headers I found were the
character set, which appears common for dozens of langages. The only
other header clue was the domain of the list server, which is hardly a
sure thing, given the pervasiveness of both the English language and
the .com domain name. What do people do for automatic language
detection for email? Are they stuck with scanning the body for common
dictionary words?  Bleah!!

So the question is:

 a) Am I missing something obvious

 b) Are there any languages that are easily detected
    (perhaps by a unqiue character set?) If so, are 
    those languages supported by MHonArc? Oh, and what
    are they? <grin>

I guess I'll have to scuttle the whole thing; if so that's too bad,
since I really think it would be great to automatically customize
to a particular language.


PS Typical non-English language email headers appended.


Return-Path: mail(_at_)mars(_dot_)mmedia(_dot_)is
Delivery-Date: Tue May 25 07:50:36 1999
Return-Path: <mail(_at_)mars(_dot_)mmedia(_dot_)is>
Received: from ( [])
        by (8.8.7/8.8.7) with ESMTP id HAA28750
        for <archive(_at_)marmot(_dot_)jab(_dot_)org>; Tue, 25 May 1999 
07:50:35 -0700
Received: from ( [])
        by (8.8.7/8.8.7) with ESMTP id KAA16373
        for <archive(_at_)jab(_dot_)org>; Tue, 25 May 1999 10:48:37 -0400
Received: (from mail(_at_)localhost)
        by (8.9.0/8.9.0-MMEDIA) id AAA03873
        for kde-isl-list; Tue, 25 May 1999 00:23:36 GMT
Received: from ( [])
        by (8.9.0/8.9.0-MMEDIA) with ESMTP id AAA03857
        for <kde-isl(_at_)mmedia(_dot_)is>; Tue, 25 May 1999 00:23:31 GMT
Received: from [] by (NTMail
        4.20.0009/NU2631.00.d894e447) with ESMTP id kgkacaaa for 
        <kde-isl(_at_)mmedia(_dot_)is>; Tue, 25 May 1999 15:43:15 +0000
Message-ID: <374AC4B8(_dot_)7741(_at_)isholf(_dot_)is>
Date: Tue, 25 May 1999 15:41:44 +0000
From: Jn Gumundsson <sr(_dot_)jong(_at_)isholf(_dot_)is>
Reply-To: sr(_dot_)jong(_at_)isholf(_dot_)is
X-Mailer: Mozilla 3.04 (Win95; I)
MIME-Version: 1.0
To: kde-isl(_at_)mmedia(_dot_)is
Subject: [kde-isl]: Forritun fyrir KDE Hvar eru grunnkarnir!!
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Sender: owner-kde-isl(_at_)mmedia(_dot_)is
Precedence: normal
Organization: Skgrkt R

Einarsson [...]

<Prev in Thread] Current Thread [Next in Thread>