"Bruce" == Bruce Lilly <blilly(_at_)erols(_dot_)com> writes:
and as I also pointed out on USEFOR, the untagged utf-8 that
_does_ appear can be distinguished from the other untagged 8-bit
charsets by means of a trivial heuristic with an extremely low
error rate (no false negatives, very few false positives).
Bruce> Two observations:
Bruce> 1. if only an insignificant amount (of untagged utf-8)
Bruce> currently appears, extrapolating the rate of error rates is
Bruce> quite risky; i.e. one might find much higher error rates if
Bruce> more untagged utf-8 were used.
the error rate of false _positives_ (data assumed to be utf-8 which in
fact is not) is not dependent on the amount of utf-8 data present; in
fact precisely the reverse, it depends on the amount of non-utf-8
unlabelled 8-bit, of which I had a large sample to work from.
The false-negative error rate would depend on the amount of utf-8 data
present if it were not for the fact that it is zero by definition (since
the heuristic is "assume the data is utf-8 if in fact it is valid utf-8").
--
Andrew.