Andrew Gierth wrote:
"Bruce" == Bruce Lilly <blilly(_at_)erols(_dot_)com> writes:
Bruce> That would only work if it could be absolutely guaranteed that
Bruce> no untagged data in any other character set might appear. No
Bruce> such guarantee is possible, and in fact Usenet abounds with
Bruce> untagged charsets, of which, according to Andrew Gierth, "no
Bruce> significant amount" is utf-8.
and as I also pointed out on USEFOR, the untagged utf-8 that _does_
appear can be distinguished from the other untagged 8-bit charsets by
means of a trivial heuristic with an extremely low error rate (no
false negatives, very few false positives).
Two observations:
1. if only an insignificant amount (of untagged utf-8) currently appears,
extrapolating the rate of error rates is quite risky; i.e. one might
find much higher error rates if more untagged utf-8 were used.
2. non-zero error rates are probably acceptable for non-critical
purposes (e.g. display), but are generally unacceptable for critical
use (as in transmission via gateways to/from domains where strict
transmission protocols are in effect).