Re: Who is the procmail maintainer? (revisited 2005)

Bart Schaefer:

Professional Software Engineering:

I'd really like to see procmail be able to parse multipart messages,
handle encodings (imagine being able to search for text and match it
even if it is encoded, just by indicating you want to search the
"decoded" target


These are features that I'd support, because they're more than just
sugar.  However, it's important to have a realistic understanding of
just how difficult this is to get right.  For example, what happens
when the text you've decoded is in a different character set than the
one in which your recipe file is written?


Unicode is the central encoding. As you might know, that brings in
problems of its own, but let's choose to ignore (most of) those.

Below I use ASCII (7bits, 0-127) as the basic subset (0-127) of Unicode.
The same can be achieved by choosing ISO-8859-1 (8bits, 0-255).

The Unicode-layer (often as UTF-8 or UCS-2 in memory) should only be
activated when necessary, that is when codes with more than 7 bits
(ASCII) are used in search patterns. Perl 5.8 does that in a very
elegant way.
So when a message comes in with some base64-encoded Unicode text, that
doesn't mean that there is Unicode needed at the search level, neither
in data nor in search commands. The encoded data can often without
functional loss be decoded down to the encoding that the search uses.
This elegance (not limited to time and space) can all happen hidden from
the user.

When a search command contains 'intricate' non-ASCII, the data will
remain in (or is transformed to) Unicode format, and the search command
is transformed to its Unicode equivalent. So the user can enforce
Unicode-mode by using things like '\x{0640}' (ARABIC TATWEEL) in a
pattern, or by using a local code that has no obvious representation in
ASCII (c.q. ISO-8859-1).

A '\x{0020}' (SPACE) will never enforce Unicode-mode, since it has an
ASCII representation.
A '\x{20AC}' (EURO SIGN) might enforce Unicode (because it is not in
ASCII and not even in ISO-8859-1).
A more general notion like '[[:currency_symbol:]]' does not need to
enforce Unicode because it can often be mapped down to '\x{0024}'
(DOLLAR SIGN) or to '\x{00A4}' (CURRENCY SIGN).
The aforementioned elegance can (if feasible and if not blocked by a
configuration option) translate a '\$' or '\N{EURO SIGN}' in the search
pattern, to such a '[[:currency_symbol:]]'.

For more currency symbols:
http://www.unicode.org/charts/PDF/Unicode-3.2/U32-20A0.pdf

For the direction of PCRE, don't hesitate and read
http://www.perl.com/pub/a/2002/06/04/apo5.html

-- 
Ruud


____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail