procmail
[Top] [All Lists]

Re: Filter for Japanese double-byte characters?

1999-10-07 01:52:02
On Wed, 6 Oct 1999 16:39:38 -0700 (PDT), Dick Moores <rdm(_at_)netcom(_dot_)com>
wrote:
Barring that, you can easily set up a filter if -- predictably --
you can come up with a regular expression or external program
which reliably detects byte sequences which are unique to the
encoding in question.
Apparently all strings that are code for Japanese begin with "B".
"$" frequently occurs, but not necessarily adjacent to "B". "%"
also frequently occurs.

Probably you should figure out what exact encoding is used in these
messages, and based on the spec for that encoding, define a slightly
different regular expression.

I managed to find some promising information in RFC2237 but this
describes ISO-2022-JP-1 which as far as I can tell is a new and
slightly different standard.

Anyhow, this one mentions the following strings:

    ESC ( B
    ESC $ @
    ESC $ B
    ESC ( J
    ESC $ ( D

This would translate into

    :0B
    * ^^[(\([BJ]|\$[(_at_)B]|\$\(D)
    ! you(_at_)there(_dot_)com

assuming that the telltale code always occurs at beginning of line.

RFC1468 describes regular ISO-2022-JP; you may want to look at that as
well. Of course, it's perfectly possible that neither describes the
encoding that you are trying to catch -- whatever it is, I would
recommend trying to find out what it really is and then writing more
recipes only when you know more.

A good site about character encodings in general is Roman Czyborra's
site at <http://www.czyborra.com/> -- very comprehensive, and very
interesting. (Some people will recognize Roman as one of the people in
the "thanks" section in Procmail's README file, too.)

But here are some short strings that would be missed:
B!!93;@2=!!   (my guess is that "!!" is a space)
B95=|
B93
BBQ
B1*2s(_at_)_7W 

Are these part of a longer sequence which does contain some sort of
telltale code, or actual complete (but obviously short) messages in
Japanese?

(I have used ^[ here as a marker -- you have to replace that with a
literal escape character. And of course, the dollar sign has a special
meaning in regular expressions, so it needs to be escaped with a
backslash in order to match a literal dollar sign.)
I use vi, and I believe Ctrl+V will do what you suggest, but escaping
with a simple "\" seems to do the job.

Huh? The backslash doesn't do anything before a literal esc, the
question is how to get a literal escape character into the .procmailrc
file in the first place. With vi, that would be ctrl-v esc, yes. What
I was trying to say was that the ^ followed by [ string which I had
used in place for a real escape character -- for readability etc --
was not to be typed literally, but replaced with a real esc character.
(This would otherwise be a syntax error, or a character class at
beginning of line if there was a subsequent ] in the same recipe.)

The stuff about MIME was apparently largely irrelevant to you, as it
would seem the messages you are receiving are not valid MIME messages
anyway.

/* era */

-- 
 Too much to say to fit into this .signature anyway: <http://www.iki.fi/era/>
  Fight spam in Europe: <http://www.euro.cauce.org/> * Sign the EU petition