procmail
[Top] [All Lists]

Re: Filter for Japanese double-byte characters?

1999-10-07 01:42:48
Satoru,

On Thu, 7 Oct 1999, Satoru Manita wrote:

Dick,

In a message dated Wed, 6 Oct 1999 16:39:38 -0700 (PDT), you wrote:
If there's something in the headers which tell you stuff is in
Japanese, that's easy enough:

    :0
    * ^Content-Type:\<*text/plain;\<*charset=iso-2022-jp\>
    ! another(_at_)else(_dot_)where(_dot_)jp

This is not in the headers.

Humm... "Content-Type: text/plain; charset=ISO-2022-JP" should appear in
the message header if the message body's character encoding is
ISO-2022-JP (Kanji).  Most Japanese MUA do so properly.  I think that the
above example should work in most cases in general.

OK, I just checked the full headers again for a bunch of posts from the
honyaku list, and "text/plain; charset=ISO-2022-JP" appears in none of
them. Possibly I should explain my situation and the nature of the post
to these lists.  I live in Bellevue, WA, USA (Not far from Seattle),
I am using a shell account at Netcom (recently bought by Mindspring),
and using Pine as my mail program.  I also have dial-up PPP access with
another ISP, but prefer to use my shell account largely because I like
Procmail.  The two lists (Honyaku and JAT-LIST) are lists for E-to-J
and J-to-E translators both Japanese and non-Japanese, many residing
outside of Japan.  The posts are mostly in English, but many contain
some Japanese text (a lot of questions and answers about "How do I
translate xxxx", where xxxx is in Japanese.  I want a recipe that will
bounce these posts with Japanese to an address where I can read them.
Currently I'm using Yahoo!'s email service, which works well with
Japanese (I use IE5 with the necessary Japanese add-ons).


Apparently all strings that are code for Japanese begin with "B".  "$"
frequently occurs, but not necessarily adjacent to "B".  "%" also
frequently occurs. 
8<- snip *<-
What I think I really need is a regular expression to find strings
(words?) that begin with "B" and contain at least one non-alphabetic
character somewhere to the right of the "B". This would miss "BBQ", of
course, but strings of all alphabetic characters are rare. The code
string (beginning with "B") is often immediately preceded by
non-alpha-numeric characters such as quotation marks or ">", and also
of course the initial "B" is often the the first character of the line.
Suggestions?

I'm still hoping for help in writing this regular expression..

Nope.  In ISO-2022-JP encoding, Kanji-IN sequence is [ESC]$B, and
Kanji-OUT sequence is [ESC](J.  A double-byte potion is sandwiched
between Kanji-IN and Kanji-OUT sequence.  Thus finding Kanji in the
message body can be done by finding Kanji-IN pattern "[ESC]$B" as in
Era's posting.

Well, I sure don't _see_ those Kanji-IN and Kanju-OUT sequences.  All
the Japanese code _appears_ to begin with "B", not "$B".  This "search"
line in my first-try recipe catches most, but not all of the posts
containing Japanese: "* ([^0-9]\%|\$[^0-9])", as I stated in my previous
post here.  I don't think I've seen a post with a string of J code
longer that 15 or 20 characters that didn't contain a "%" or a "$".  And
almost always the "%" is preceded by an non-numeric character (as
opposed to the usual use, e.g., "15%"; similarly for "$")

I've just done some testing attempting to use the pattern "[ESC]$B"
in this recipe:

VERBOSE=on
:0 HBc:
* ^TOrdm
* ($B)
1ESC
VERBOSE=off

Using the vi editor I typed "$B" as Ctrl+V$B, and then forwarded to
myself some of those posts with Japanese.  Doesn't work, but maybe I
don't know what I'm doing.


Please note that Kanji in Subject: is another story. It's MIME encoded
and looks something like:
"Subject: =?ISO-2022-JP?B?GyRCRnxLXDhsJE4lNSVWJTglJyUvJUgbKEI=?="

Yes, some posts have this kind of Subject: header, but only a few. And
even those may have only English in the body.

Hope this helps, too.

Well, you sure got me thinking.

Dick Moores  rdm(_at_)netcom(_dot_)com