procmail
[Top] [All Lists]

Re: Filter for Japanese double-byte characters?

1999-10-06 16:44:16
Era,

Thanks very much.  Please see below what I've done (or couldn't do) with
your suggestions.

On Wed, 6 Oct 1999, era eriksson wrote:

On Tue, 5 Oct 1999 16:31:26 -0700 (PDT), Dick Moores 
<rdm(_at_)netcom(_dot_)com>
wrote:
 > I subscribe to a couple of lists that sometimes have Japanese in
 > their posts. I'd like to set up a recipe that would bounce all and
 > only these posts to another address. Is this possible?

Probably :-)

If there's something in the headers which tell you stuff is in
Japanese, that's easy enough:

    :0
    * ^Content-Type:\<*text/plain;\<*charset=iso-2022-jp\>
    ! another(_at_)else(_dot_)where(_dot_)jp

This is not in the headers.


Barring that, you can easily set up a filter if -- predictably -- you
can come up with a regular expression or external program which
reliably detects byte sequences which are unique to the encoding in
question.

I don't know how to read ISO-2022-JP kanji but I have a vague
recollection that there's something like (esc)$B occurring a lot in
this encoding. Seeing as this sequence is easy enough to detect, and
unlikely to occur in other text, just look in the body for this
pattern (maybe more than once, or something):

Apparently all strings that are code for Japanese begin with "B".  "$"
frequently occurs, but not necessarily adjacent to "B".  "%" also
frequently occurs. 


    :0B  # find an editor which lets you type in a literal esc character
    * ^[\$B
    ! another(_at_)else(_dot_)where(_dot_)jp

So what I have come up with as a partial solution is

:0BHc
* ^TO(honyaku@|JAT-LIST)
* ([^0-9]\%|\$[^0-9])
! me(_at_)there(_dot_)com

This works fairly well because longer (and many short) strings of
Japanese always seem to contain "$" or "%".  An example is
BAj<j$K%&%=$H;W$o$;$k$3$H$G!J$=$NAj<j$K!K0u>]$r;}$C$F$*$\$($F$b$i$($k!#

But here are some short strings that would be missed:
B!!93;@2=!!   (my guess is that "!!" is a space)
B95=|
B93
BBQ
B1*2s(_at_)_7W 

What I think I really need is a regular expression to find strings
(words?) that begin with "B" and contain at least one non-alphabetic
character somewhere to the right of the "B". This would miss "BBQ", of
course, but strings of all alphabetic characters are rare. The code
string (beginning with "B") is often immediately preceded by
non-alpha-numeric characters such as quotation marks or ">", and also
of course the initial "B" is often the the first character of the line.
Suggestions?

I don't really want to search the headers for J code (J sometimes
appears in the Subject: header, but the email program I'm using can't
read this anyway -- just the body), so I should use a recipe that will
first look for mail from those two lists and then use a sub-recipe
(using brackets) to search for the J code. I think I can get from the
man pages how to do this.  Learned this once several years ago.

(I have used ^[ here as a marker -- you have to replace that with a
literal escape character. And of course, the dollar sign has a special
meaning in regular expressions, so it needs to be escaped with a
backslash in order to match a literal dollar sign.)

I use vi, and I believe Ctrl+V will do what you suggest, but escaping
with a simple "\" seems to do the job.


Returning to the first example, if messages are properly MIME encoded
you probably also want to bounce messages which are MIME multiparts
and which contain body parts which are in iso-2022-jp.

I've searched on 2202 in all the headers of all posts so far, and 
it turns up only in threads that _discuss_ iso-2022-jp.


    :0
    * ^Content-Type:\<*message/multipart;
    * B ?? ^Content-Type:\<*text/plain;\<*charset=iso-2022-jp\>
    ! another(_at_)else(_dot_)where(_dot_)jp

MIME is a flexible format so there are lots of interesting
complications you should probably be aware of, ranging from the simple
fact that you can optionally have quotes around the charset parameter:

    charset="?iso-2022-jp"?

and the problem of coping with various content-transfer-encodings --
perhaps you want to translate everything to a canonical format such as
content-transfer-encoding: 8bit before you even try to process it
further, if you need to be able to process the body somehow -- to the
substantial undertaking of allowing message bodies (such as the one of
this message!) to contain various key phrases without matching by
accident. You don't want a match unless the key phrases occur in the
actual MIME headers of body parts. But that is already outside the
scope of what you are prepared to deal with, if I'm allowed a wild
guess :-)

Right. :-)

Hope this helps,

Sure did.

/* era */

Dick Moores  rdm(_at_)netcom(_dot_)com