ietf-822
[Top] [All Lists]

Re: RFC 2047 and gatewaying

2003-01-10 17:03:33

"Bruce" == Bruce Lilly <blilly(_at_)erols(_dot_)com> writes:

and as I also pointed out on USEFOR, the untagged utf-8 that
_does_ appear can be distinguished from the other untagged 8-bit
charsets by means of a trivial heuristic with an extremely low
error rate (no false negatives, very few false positives).

 Bruce> Two observations:
 Bruce> 1. if only an insignificant amount (of untagged utf-8)
 Bruce> currently appears, extrapolating the rate of error rates is
 Bruce> quite risky; i.e. one might find much higher error rates if
 Bruce> more untagged utf-8 were used.

the error rate of false _positives_ (data assumed to be utf-8 which in
fact is not) is not dependent on the amount of utf-8 data present; in
fact precisely the reverse, it depends on the amount of non-utf-8
unlabelled 8-bit, of which I had a large sample to work from.

The false-negative error rate would depend on the amount of utf-8 data
present if it were not for the fact that it is zero by definition (since
the heuristic is "assume the data is utf-8 if in fact it is valid utf-8").

-- 
Andrew.

<Prev in Thread] Current Thread [Next in Thread>