procmail
[Top] [All Lists]

Re: Filter for Japanese double-byte characters?

1999-10-06 22:58:35
Dick,

In a message dated Wed, 6 Oct 1999 16:39:38 -0700 (PDT), you wrote:
If there's something in the headers which tell you stuff is in
Japanese, that's easy enough:

    :0
    * ^Content-Type:\<*text/plain;\<*charset=iso-2022-jp\>
    ! another(_at_)else(_dot_)where(_dot_)jp

This is not in the headers.

Humm... "Content-Type: text/plain; charset=ISO-2022-JP" should appear in
the message header if the message body's character encoding is
ISO-2022-JP (Kanji).  Most Japanese MUA do so properly.  I think that the
above example should work in most cases in general.

Apparently all strings that are code for Japanese begin with "B".  "$"
frequently occurs, but not necessarily adjacent to "B".  "%" also
frequently occurs. 
8<- snip *<-
What I think I really need is a regular expression to find strings
(words?) that begin with "B" and contain at least one non-alphabetic
character somewhere to the right of the "B". This would miss "BBQ", of
course, but strings of all alphabetic characters are rare. The code
string (beginning with "B") is often immediately preceded by
non-alpha-numeric characters such as quotation marks or ">", and also
of course the initial "B" is often the the first character of the line.
Suggestions?

Nope.  In ISO-2022-JP encoding, Kanji-IN sequence is [ESC]$B, and
Kanji-OUT sequence is [ESC](J.  A double-byte potion is sandwiched
between Kanji-IN and Kanji-OUT sequence.  Thus finding Kanji in the
message body can be done by finding Kanji-IN pattern "[ESC]$B" as in
Era's posting.

Please note that Kanji in Subject: is another story. It's MIME encoded
and looks something like:
"Subject: =?ISO-2022-JP?B?GyRCRnxLXDhsJE4lNSVWJTglJyUvJUgbKEI=?="

Hope this helps, too.
________________________________________________
Satoru "Sam" MANITA - Saitama JAPAN
aka <satoru(_at_)manita(_dot_)com>