procmail
[Top] [All Lists]

Re: Filter for Japanese double-byte characters?

1999-10-05 17:22:20
On Tue, 5 Oct 1999 16:31:26 -0700 (PDT), Dick Moores <rdm(_at_)netcom(_dot_)com>
wrote:
I subscribe to a couple of lists that sometimes have Japanese in
their posts. I'd like to set up a recipe that would bounce all and
only these posts to another address. Is this possible?

Probably :-)

If there's something in the headers which tell you stuff is in
Japanese, that's easy enough:

    :0
    * ^Content-Type:\<*text/plain;\<*charset=iso-2022-jp\>
    ! another(_at_)else(_dot_)where(_dot_)jp

Barring that, you can easily set up a filter if -- predictably -- you
can come up with a regular expression or external program which
reliably detects byte sequences which are unique to the encoding in
question.

I don't know how to read ISO-2022-JP kanji but I have a vague
recollection that there's something like (esc)$B occurring a lot in
this encoding. Seeing as this sequence is easy enough to detect, and
unlikely to occur in other text, just look in the body for this
pattern (maybe more than once, or something):

    :0B  # find an editor which lets you type in a literal esc character
    * ^[\$B
    ! another(_at_)else(_dot_)where(_dot_)jp

(I have used ^[ here as a marker -- you have to replace that with a
literal escape character. And of course, the dollar sign has a special
meaning in regular expressions, so it needs to be escaped with a
backslash in order to match a literal dollar sign.)

Returning to the first example, if messages are properly MIME encoded
you probably also want to bounce messages which are MIME multiparts
and which contain body parts which are in iso-2022-jp.

    :0
    * ^Content-Type:\<*message/multipart;
    * B ?? ^Content-Type:\<*text/plain;\<*charset=iso-2022-jp\>
    ! another(_at_)else(_dot_)where(_dot_)jp

MIME is a flexible format so there are lots of interesting
complications you should probably be aware of, ranging from the simple
fact that you can optionally have quotes around the charset parameter:

    charset="?iso-2022-jp"?

and the problem of coping with various content-transfer-encodings --
perhaps you want to translate everything to a canonical format such as
content-transfer-encoding: 8bit before you even try to process it
further, if you need to be able to process the body somehow -- to the
substantial undertaking of allowing message bodies (such as the one of
this message!) to contain various key phrases without matching by
accident. You don't want a match unless the key phrases occur in the
actual MIME headers of body parts. But that is already outside the
scope of what you are prepared to deal with, if I'm allowed a wild
guess :-)

Hope this helps,

/* era */

-- 
 Too much to say to fit into this .signature anyway: <http://www.iki.fi/era/>
  Fight spam in Europe: <http://www.euro.cauce.org/> * Sign the EU petition