procmail
[Top] [All Lists]

Re: Filter for Japanese double-byte characters?

1999-10-07 11:05:09
Dick Moores <rdm(_at_)netcom(_dot_)com> writes:
...
I'd really like to try out something this sophisticated.  Several years
ago you gave me some great help with a matching problem.  But my problem
with J code is finding those posts with maybe only a word or two of
Japanese.  I gave a short list of examples previously.  Things like "B93",
"B1*2s(_at_)_7W", and "B95=|" (there are also many strings with an alphabetic
character following the initial B, but almost all strings will contain a
non-alphabetic character).  The longer the string is the more
likely it is to have one, several, many "$" and/or "%", by my
observation, so my first-try recipe, with "* ([^0-9]\%|\$[^0-9])" does
a pretty good job.  Isn't there an expression usable with egrep that
would do an almost perfect job?  One that finds all words that begin
with "B" and contain at least one non-alphabetic character? (Please
refer to my paragraph about this, quoted above.)  Or could a matching recipe
work with the short words?

Okay, let's give a match if there are at least 2 words that begin with
"B" and contain at least one punctuation character (that is, let's allow
number as well):

        :0 BD
        * -1^0
        * 1^1 (^|[      ])B[^   ]*[^    a-zA-Z0-9]
        ! me(_at_)else(_dot_)where


Given the discussion regarding the encoding where 'words' actually
start with "^[$" (ESC dollar) or "^[(" (ESC paren), you may want to
allow those pairs to occur before the 'B':

        :0 BD
        * -1^0
        * 1^1 (^|[      ])(