procmail
[Top] [All Lists]

Re: Filter for Japanese double-byte characters?

1999-10-07 02:52:44
On Thu, 7 Oct 1999, Philip Guenther wrote:

Dick Moores <rdm(_at_)netcom(_dot_)com> writes:
On Wed, 6 Oct 1999, era eriksson wrote:

On Tue, 5 Oct 1999 16:31:26 -0700 (PDT), Dick Moores 
<rdm(_at_)netcom(_dot_)com>
wrote:
 > I subscribe to a couple of lists that sometimes have Japanese in
 > their posts. I'd like to set up a recipe that would bounce all and
 > only these posts to another address. Is this possible?
...
Barring that, you can easily set up a filter if -- predictably -- you
can come up with a regular expression or external program which
reliably detects byte sequences which are unique to the encoding in
question.

I don't know how to read ISO-2022-JP kanji but I have a vague
recollection that there's something like (esc)$B occurring a lot in
this encoding. Seeing as this sequence is easy enough to detect, and
unlikely to occur in other text, just look in the body for this
pattern (maybe more than once, or something):

Apparently all strings that are code for Japanese begin with "B".  "$"
frequently occurs, but not necessarily adjacent to "B".  "%" also
frequently occurs. 
...
So what I have come up with as a partial solution is

:0BHc
* ^TO(honyaku@|JAT-LIST)
* ([^0-9]\%|\$[^0-9])
! me(_at_)there(_dot_)com

This works fairly well because longer (and many short) strings of
Japanese always seem to contain "$" or "%".  An example is
BAj<j$K%&%=$H;W$o$;$k$3$H$G!J$=$NAj<j$K!K0u>]$r;}$C$F$*$\$($F$b$i$($k!#

But here are some short strings that would be missed:
B!!93;@2=!!   (my guess is that "!!" is a space)
B95=|
B93
BBQ
B1*2s(_at_)_7W 

What I think I really need is a regular expression to find strings
(words?) that begin with "B" and contain at least one non-alphabetic
character somewhere to the right of the "B". This would miss "BBQ", of
course, but strings of all alphabetic characters are rare. The code
string (beginning with "B") is often immediately preceded by
non-alpha-numeric characters such as quotation marks or ">", and also
of course the initial "B" is often the the first character of the line.
Suggestions?

This sounds like an good opportunity to use procmail's scoring ability.
How about treating as iso-2022-jp encoded any message which has in its
body, say, more two 20 dollarsigns and percent signs, and at least 5
'words' that begin with a 'B':

      :0 B
      * H ?? ^TO_(honyaku@|JAT-LIST)
      * -20^0
      * 1^1 [%$]
      {
          :0 B
          * -5^0
          * 1^1 (^|[  ])B
          ! me(_at_)there(_dot_)com
      }

What are the distinguishing characteristics of messages with the
iso-2022-jp encoding?  Are a certain percentage of the characters selected
from a limited set (such as $ and %)?

I'd really like to try out something this sophisticated.  Several years
ago you gave me some great help with a matching problem.  But my problem
with J code is finding those posts with maybe only a word or two of
Japanese.  I gave a short list of examples previously.  Things like "B93",
"B1*2s(_at_)_7W", and "B95=|" (there are also many strings with an alphabetic
character following the initial B, but almost all strings will contain a
non-alphabetic character).  The longer the string is the more
likely it is to have one, several, many "$" and/or "%", by my
observation, so my first-try recipe, with "* ([^0-9]\%|\$[^0-9])" does
a pretty good job.  Isn't there an expression usable with egrep that
would do an almost perfect job?  One that finds all words that begin
with "B" and contain at least one non-alphabetic character? (Please
refer to my paragraph about this, quoted above.)  Or could a matching recipe
work with the short words?

Thanks,

Dick Moores  rdm(_at_)netcom(_dot_)com