Re: RFC 2047 and gatewaying

"Bruce" == Bruce Lilly <blilly(_at_)erols(_dot_)com> writes:

and as I also pointed out on USEFOR, the untagged utf-8 that
_does_ appear can be distinguished from the other untagged 8-bit
charsets by means of a trivial heuristic with an extremely low
error rate (no false negatives, very few false positives).


 Bruce> Two observations:
 Bruce> 1. if only an insignificant amount (of untagged utf-8)
 Bruce> currently appears, extrapolating the rate of error rates is
 Bruce> quite risky; i.e. one might find much higher error rates if
 Bruce> more untagged utf-8 were used.

the error rate of false _positives_ (data assumed to be utf-8 which in
fact is not) is not dependent on the amount of utf-8 data present; in
fact precisely the reverse, it depends on the amount of non-utf-8
unlabelled 8-bit, of which I had a large sample to work from.

The false-negative error rate would depend on the amount of utf-8 data
present if it were not for the fact that it is zero by definition (since
the heuristic is "assume the data is utf-8 if in fact it is valid utf-8").

-- 
Andrew.

Previous by Date:	Re: RFC 2047 and gatewaying, Bruce Lilly
Next by Date:	Re: UTF-8 versions (was: Re: RFC 2047 and gatewaying), Bruce Lilly
Previous by Thread:	Re: RFC 2047 and gatewaying, Bruce Lilly
Next by Thread:	Re: RFC 2047 and gatewaying, D. J. Bernstein
Indexes:	[Date] [Thread] [Top] [All Lists]