procmail
[Top] [All Lists]

Re: Filter for Japanese double-byte characters?

1999-10-09 06:02:16
On Sat, 9 Oct 1999 05:30:11 -0700 (PDT), Dick Moores <rdm(_at_)netcom(_dot_)com>
wrote:
^At^AI^Ie^I^C^AA^N1/4^AI^AR^A}^A^Ah and so on
(the "1/4" is a single character)
I think I could find this stuff by searching on [Ctrl] plus, say
capital A with a circumflex accent (I believe this is character code
194, right?), or capital A with an acute accent (code 193?). How can I

If you can show those characters in their "raw" version, that would
help. 

Ctrl is just a "classification" thing, the character codes zero throuh
31 decimal are "control" characters because that's what they are used
for in the original ASCII encoding (on a teletype, ctrl-s would stop
the terminal from printing out output temporarily, ctrl-q would
resume, for example -- that sort of "control"). So when you type
ctrl-s you transmit a byte whose value is 19 and when you type A you
transmit a byte whose value is 65 and so forth, and the "controlness"
of the first is simply because its number is in the "control" range.

With that explanation, "control-accented character" doesn't really
make sense at all (although in a wicked sort of way, it makes sense
for characters in the range 128-159 in various ISO-8859 character sets
such as Latin-1 aka ISO-8859-1).

In the Latin-1 character set, uppercase A with an acute accent is
character number 193, so you got that right, but it's not "contol A
acute", it's just "A acute" and the character you are seeing in front
of it is probably a regular caret character (byte value 94).

There are programs such as viz(1) or cat -A or od(1) which let you
view the exact unambiguous byte values. Getting an od dump of the
string above would probably help a little bit.

Just for example, here are the first few bytes of a binary file in od
-ch format:

 $ od -ch /vmunix | head -2
 0000000 203 001  \b  \0   ¦   ö 215   7 200   ×   Q  \0  \0  \0  \0  \0
         0183 0008 f6a6 378d d780 0051 0000 0000

The first line is a character rendering where non-ASCII characters are
shown as control codes (with backslashes) or in octal notation, and
the second row is hexadecimal. It is also possible to get decimal and
a number of other formats out of od -- see the manual page for details.


   * -1^0
   * 1^1 [Ctrl] plus A with a circumflex accent

   * -5^0
   * 1^1 [Ctrl]

Because the caret, like the dollar sign, has a special meaning to
Procmail, you have to backslash-escape it. Things get more complicated
still because Procmail's parser is sort of broken when the first thing
in a condition is a backslash, so we avoid putting a backslash as the
first character in the regular expression by putting it inside a set
of parentheses (which don't have any other meaning; parentheses are
used for grouping generally but it doesn't usually make sense to
"group" a single character):

    * -1^0
    *  1^1 (\^)A

    * -5^0
    *  1^1 (\^)

I have also replaced the cirumflex a with a regular one here.

Hope this helps,

/* era */

-- 
 Too much to say to fit into this .signature anyway: <http://www.iki.fi/era/>
  Fight spam in Europe: <http://www.euro.cauce.org/> * Sign the EU petition