That's furrin', pronounced like yar an Amerikun who has never been near a
'puter or that enternet thang. Y'all may not evun have runnin' water.
Right up front let me point out that I personally won't bother responding
to any replies which unnecessarily quote this post in its entirety. Be a
good net.citizen and excerpt just what is necessary for reply context.
Pursuant to that ugly exchange here last week wherein someone dropped in
and advertised some other list, I went and checked to see whether there was
anything worth noting over there. I can't say that it appears that there's
much there (and certainly no existing rcfile base to refer to), though
anyone interested should really check for themselves rather than taking my
word for it> However, I couldn't help but spot the following posts (among
the ~5 messages there):
<http://www.unix.com/showthread.php?s=03e01900ee65b3b27885ce184e078cae&threadid=9707>
Specifically, the Subject: regexp is faulty (zero or more colons), and the
first of the character set tests includes:
euc-kr|3Deuc-kr|euc-kr
(the alert reader will note that a single, plain, 'euc-kr' will match for
all three, including the unnecessary duplicate). This same lineup exists
in the rewrite, as does the Subject: regexp problem.
'3D' appears prefixed to several of the character set declarations in other
conditions as well - that's the MIME quoted-printable hex character code
for '=' -- I've only seen that within HTML within a MIME quoted-printable
block, where the MIME headers themselves should properly have a character
encoding identifier, thus an HTML META tag within the MIME quoted-printable
body isn't generally very significant (at least, not in my experience
anyway). The flags as used (or rather, not used) on those receipes mean
that their author is limiting the check to the headers, so checking for
quoted-printable content _within_ the Content-Type: header further confuses
matters, since that header _should_not_ be escaped. Again, perhaps his
experience has shown otherwise.
Also overlooked by the recipe provided there is that From: often has
character set escaping, much like Subject: can, though the chap does check
Subject:, he doesn't check it against the same list of encodings he's
checking the header for (tossing them all into a variable and then re-using
them between the conditions would be a whole lot easier, as most
experienced users know - separatley maintaining multiple hand-coded copies
of a regexp is error prone).
So, it seems that perhaps I should share my furrin' charset recipes, so
ain't no wun waste no mo' time writun a charset recipee or using one with
more bugs than momma's fine grits. Yah hear? Ol' Jethro here don't read
chinese or tajiki, and he don't subscribe to non-english discussion lists
neither, so anything he be receiv'un that are in a different character set,
he don't want to waste his time wif.
(My apologies if the attempt at dirtwater country boy slang used to provide
some texture to this otherwise starchy message might confuse some of those
here for whom English is not a native language.)
This is free for use and discussion here on the official procmail
list. Note that I define quite a few character set encodings (and have
attempted to be rather complete in doing so, perhaps retentively so), which
are grouped by language or (lacking a desire to make a separate category
for each language, and not being sufficiently familiar with them all to
better categorise them) something akin to geopolitical origins. If anyone
has suggestions on better divisions, I'm all ears, but as-is, it works well
enough for me while still affording some meaningful groupings.
Please pardon the linewraps. The file can be downloaded from the URL
provided there in the header. There's no need to re-download it often
trying to stay on top of things - I don't update it frequently. The update
which prompted this posting was a change to the method which I emit logging
data from these recipes (tacking it on one variable instead of per-event
emitting, which previously dumped the main spamrc script version with each
event).
Any miscategorization of a language/characterset is not intended as a
slight against those of that nationality, nor is the decision to categorize
that character set as unwanted. Seeing as I don't communicate in the
written word much outside of the language I use, many foreign character
sets - not just Chinese and Korean - are largely meaningless to me.
I couldn't find an encoding for Aramaic...
I should point out that a DNSBL which I employ (at the SMTP level)
presently blocks connections from the complete IP block assignments of
China (CN), Taiwan (TW), Korea (KR), Indonesia (ID), and India (IN), so a
fair number of the more "virtulent" open relays and the furrin' spam they
propogate never reaches my account to be subjected to charset
filtering. The SMTP bounces direct senders to a policy page, and those
senders (if legitimate) can utilize alternate means of contact (perhaps a
yahoo account).
# ==========================================================================
# File: furrin.rc
# Description: procmail script for foreign character sets
# Author: Sean B. Straw
# Source: <http://www.professional.org/procmail/furrin.rc>
# Copyright: Portions copyright (c) 2000-2003, Sean B. Straw
# Disclaimer: <http://www.professional.org/procmail/disclaimer.html>
# Licensing: Free for use by the procmail community.
# Support: Visit the official procmail discussion list to ask
# procmail questions. If you need custom procmail
# work performed (including modifications to this
# rcfile), the author is available for paid consulting.
#
#
# This is a procmail recipe file to handle rejecting messages identified as
# employing certain character sets. Although as used here, it identifies
# messages as "spam" for the authors own purposes, it should NOT be assumed
# that a message in a foreign character encoding is in fact spam.
#
# Users should keep in mind that several character sets are functional
# supersets of the Latin-1 (or similar) character set, and can therefore be
# used to communicate Western European languages in addition to their own
# intended language.
#
# DO NOT AUTO-SUBMIT MESSAGES TO SPAM DATABASES BASED SOLELY UPON THE
# RESULTS OF THIS SCRIPT.
#
# This script may rely upon macros defined outside of this file.
# Additionally, some variables set here are expected to be acted upon by
# subsequent recipes, rather that dealing with the "spam" right within this
# rcfile.
#
#
# Useful references (no particular order):
# <http://www.iana.org/assignments/character-sets>
# <http://www.unicode.org>
# <http://www.unicode.org/charts/>
# <http://www.iso.ch>
# <http://www.goof.com/pcg/data/marc/iso/locale.txt>
# <http://anubis.dkuug.dk/i18n/charmaps/>
# <http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html>
# <http://www.w3.org/International/O-charset-list.html>
# <http://www.microsoft.com/globaldev/reference/cphome.mspx>
# <http://msdn.microsoft.com/workshop/database/tdc/reference/charset.asp>
#
<http://msdn.microsoft.com/workshop/Author/dhtml/reference/charsets/charset4.asp>
# <http://clisp.cons.org/impnotes/encoding.html>
# <http://www.cwi.nl/~aeb/linux/man2html/man7/charsets.7.html>
# <http://www.mozilla.org/quality/intl/chardetect.html>
#
<http://www.mozilla.org/docs/l10n/l10nkits/client/windows/docs/nav40/xpencui.htm>
# <http://www.li18nux.org/docs/html/CodesetAliasTable-V10.html>
# <http://www.terena.nl/library/multiling/ml-docs/wincharsets.html>
# <http://www.chilkatsoft.com/ChilkatIConv.asp>
# <http://java.sun.com/j2se/1.4.1/docs/guide/intl/encoding.doc.html>
#
# See also various RFCs, including 1489 and 1557.
#
# ==========================================================================
# Lets start by defining the character sets, grouped by language.
# All of 'em we can lay our hands on, whether you receive them or not.
#
# Obviously, some character sets encompass more than one language set.
# It is adviseable to group them according to the more common language,
# favouring the languages which you're likely to RETAIN.
#
# Out of necessity, some character sets (notably, Cyrillics) have been
# grouped to geo-policial origins as this author understands them. I'm not
# a linguist, nor do I have a deep understanding of some of these languages.
# If you have information pertaining to proper reassignment of some of these
# character sets, please contact the author (see website).
#
# Properly, the rexexps these are used in are bounded on both sides, so
# "roman" and "romanian" should not collide.
#
CHARSET_JP="WINDOWS-932|EUC-JP|(cs-?)?ISO-?2022-?JP(-[12])?|ISO-2022-D|SHIFT[-_]JIS|JIS[-_]?X[-_]?02(08|01|12|13)|sjis|jis7|ms-kanji|(x-)?mac(-)?japanese|x-EBCDIC-Japanese(Katakana|AndUSCanada|AndJapaneseLatin|AndKana)"
CHARSET_CN="WINDOWS-(936|950)|EUC-CN|(hz-|x-euc-tw)?GB[-_]2312|(cn-)?(BIG5|gb)|ISO-2022-([EGHIJKLM]|cn|cn-ext)|ISO-IR-165|GB8565\.2(-1988)?|x-euc-tw|hz|iso-ir-58|gbk|big5-hkscs|gb18030|(x-)?mac(-)?chinese(trad|imp)|iso-ir-58|x-EBCDIC-(Traditional|Simplified)Chinese|x-Chinese-(CNS|eten)"
# non-standards compliant variations of chinese
CHARSET_CN_BOGUS="CHINESEBIG5|BIG-5"
CHARSET_KR="WINDOWS-949|EUC-KR|KS[-_ ]?C[-_ ]?5601([-_
]?1987)?|ISO-2022-(C|kr)|KS[-_]?X[-_]?1001|ksc5636|iso-646-kr|uhc|johab|(x-)?mac(-)?korean|iso-ir-149|x-EBCDIC-(KoreanAnd)?KoreanExtended"
# some mailer actually sets this
CHARSET_BOGUS="X-UNKNOWN|USER-DEFINED"
# Not recommended to block these - they're all rather encompassing
CHARSET_UNICODE="UTF(-)?(7|8|16)]|UCS(-)?(2|4)|UNICODE-1-1-UTF-7|ISO-10646-UCS-2|UNICODE-(16|32)(LITTLE|BIG)-ENDIAN)?|unicodeFFFE|JAVA|x-EBCDIC-International(-euro)?"
# If you're english, you probably don't want to block this one either.
CHARSET_ENG="US-ASCII|ASCII|iso-ir-6|iso646-us|x-EBCDIC-(cp-us|UK)(-euro)?"
# Western European (English, but also French and many others. Standard)
CHARSET_WESTEURO="WINDOWS-1252|ISO-?8859-(1|15)|iso-ir-100|(x-)?mac(-)?roman|latin-?(1|9)|macintosh|x-IA5(-German)?|x-ebcdic-(spain|italy|germany|france)(-euro)?|x-europa"
# Central/Eastern European (non-english)
CHARSET_SLAVIC="WINDOWS-1250|ISO-?8859-(2|16)|iso-ir-(87|102)|(x-)?mac(-)?(central-europe|ce|croatian)|latin-?2|CP870"
# uncommon stuff and/or generally obsoleted. Includes maltese (eh, sorry
if that's you)
CHARSET_FUNKYLATIN="ISO-?8859-[34]|iso-ir-109|latin-?3"
# Russian, et-al.
# KOI8-T is Tajiki (Tajikistan)
# armscii-8 is Armenian
CHARSET_CYRILLIC="WINDOWS-1251|ISO-?8859-5|KOI8(-(RU|[RTU]))?|ISO-IR-(101|111|144|147)|IBM866|(x-)?mac(-)?(romanian?|cyrillic|ukran(e|ian))|nunacom-8|armscii-8|x-EBCDIC-Cyrillic(SerbianBulgarian|Russian)"
# Arabic
CHARSET_ARABIC="WINDOWS-1256|ISO-?8859-6|iso-ir-127|(x-)?mac(-)?arabic|asmo-708|x-EBCDIC-Arabic"
# Greek
CHARSET_GREEK="WINDOWS-1253|ISO-?8859-7|(x-)?mac(-)?greek|iso-ir-(126|150)|x-EBCDIC-Greek(Modern)?"
# Hebrew
CHARSET_HEBREW="WINDOWS-1255|ISO-?8859-8(-i)?|(x-)?mac(-)?hebrew|iso-ir-138|x-EBCDIC-Hebrew"
# Turkish
CHARSET_TURKISH="WINDOWS-1254|ISO-?8859-9|(x-)?mac(-)?turkish|iso-ir-(109|148)|latin-?5|x-EBCDIC-Turkish|CP1026"
# Icelandic/Nordic (i.e. Iceland, Greenland, Norway, Sweden...)
CHARSET_NORDIC="ISO-?8859-10|(x-)?mac(-)?iceland(ic)?|iso-ir-60|x-IA5-(Norwegian|Swedish)|x-EBCDIC-(FinlandSweden|DenmarkNorway|Icelandic)(-euro)?"
# Thai (ISO not _actually_ used, but draft standard is same)
CHARSET_THAI="WINDOWS-874|TIS[-_]?620|ISO-?8859-11|mulelao-1|ibm-cp1133|(x-)?mac(-)?thai|x-EBCDIC-Thai"
# ISO-8859-12 is bogus (was suggested to be vietnamese, but can't fit).
# However, I've seen this encoding specified in spam though, and lacking an
# official designation, I'm hocking it here.
CHARSET_VIETNAM="WINDOWS-1258|ISO-?8859-12|viscii|tcvn5712|vps"
# Baltic Rim
CHARSET_BALTIC="WINDOWS-1257|ISO-?8859-13|iso-ir-110"
# Celtic (Irish and Welsh)
CHARSET_CELTIC="ISO-?8859-14"
# Other stuff which escapes categorization at this time
CHARSET_MISC="isiri-3342|x-iscii-(as|be|de|gu|ka|ma|or|pa|ta|te)"
# Include desired subsets (which are defined above) here. This defines
# the languages encodings we do not want to recieve.
# Make sure OR condition exists only between those which you employ (i.e.
# that there are not EMPTY OR condition sets)
# As provided, this particular set includes all the languages which
# this author does not correspond using.
# DO NOT simply utilize this configuration without first reviewing it.
CHARSETS="${CHARSET_CN}|${CHARSET_CN_BOGUS}|${CHARSET_KR}|${CHARSET_JP}|${CHARSET_BOGUS}|${CHARSET_SLAVIC}|${CHARSET_FUNKYLATIN}|${CHARSET_CYRILLIC}|${CHARSET_ARABIC}|${CHARSET_GREEK}|${CHARSET_HEBREW}|${CHARSET_TURKISH}|${CHARSET_THAI}|${CHARSET_VIETNAM}|${CHARSET_BALTIC}|${CHARSET_MISC}"
# Ok, that absolute DOOZIE of a regexp is now defined. Let's go use it...
# ==========================================================================
# Actual recipes using the defined regexp
# Messages identifying the character set in the From: or Subject:
:0
* $ ^(From|Subject):${wsstar}=\?\/(${CHARSETS})\?[QB]
{
# This scrubs the delimiters from the MATCH string,
# leaving us with just the text of the matched charset descriptor.
:0
* MATCH ?? ()\/[^?]+
{
SPAMNOTES="${SPAMNOTES}SPAM: Foreign character set encoding (${MATCH})
used in From or Subject.${NL}"
SPAMMISHNESS="${SPAMMISHNESS}+300"
}
}
# Messages identifying the character set in the Content-Type: *HEADER*
# (you can expand this to cover the body as well as headers, by adding
# "HB" flags)
:0
* $ ^Content-Type:.*charset=(\")?\/(${CHARSETS})(\")?\>
{
# This scrubs the delimiters from the MATCH string,
# leaving us with just the text of the matched charset descriptor.
:0
* MATCH ?? ()\/[^?";]+
{
SPAMNOTES="${SPAMNOTES}SPAM: Foreign character set encoding (${MATCH}) in
body.${NL}"
SPAMMISHNESS="${SPAMMISHNESS}+300"
}
}
# Check for hibit characters in the subject
# (character class contains 0x80 - 0xff character range)
## also try in From: and To:
:0
* ^(Subject|From|To):\/.*[?-ÿ]
{
:0
* -2^0
* 1^1 MATCH ?? [?-ÿ]
{
ISSPAM=1
SPAMNOTES="${SPAMNOTES}SPAM: raw 8-bit characters in the
Subject/From/To${NL}"
}
}
# ==========================================================================
# The module which includes this one should take action based on variables
# which are set in the recipes above.
---
Sean B. Straw / Professional Software Engineering
Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
Please DO NOT carbon me on list replies. I'll get my copy from the list.
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail