perl-unicode

Re: ISO 2022 versus UTF-8 autodetection heuristics

1999-11-03 05:40:00
Bram Moolenaar wrote on 1999-11-03 12:38 UTC:
  The ISO 2022 code for announcing UTF-8 is

    ESC %G

Hmm, this means that actual characters are used here.  The application must
know about this, to avoid that they are interpreted as ordinary text
characters.  That will make it more difficult for older programs, and can
break some things.  Escape codes can have nasty side effects when sent to a
terminal.

There exists a strict syntax for ESC codes specified in ECMA 35 and ECMA
48 (ISO 2022 and ISO 6429). This allows applications to reliably jump
over ESC sequences that that do not know. In a nutshell, an ESC sequence
starts with ESC and ends with a letter (see the standards for the
precise details). This is widely implemented in terminal emulators (at
least in the good ones where the authors read the standards ;-).

The technique that mined98 uses seems to be fairly reliable. In
practice, >98% of all ISO 8859 files contain malformed UTF-8 sequences

98% isn't very reliable.  I would aim for 99.9% at least.

I said >98%, not =98%! It is very likely that it works for >99.99% of
all files. It certainly will certainly detect for >>99.99% of all German
ISO 8859-1 files that they are obviously not in UTF-8.

I challenge you to send me a orthographically correct sentence in one of
the languages listed in the ISO 8859-1 standard, encoded in Latin-1,
that does not contain a malformed UTF-8 sequence, i.e. which could not
trivially be identified as not being UTF-8.

Hints: An ISO 8859-1 file that could be misdetected as a UTF-8 file by
the obvious heuristic must contain any non-ASCII characters *only* in
the following form:

 - A start character out of "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß"
   followed by one continuation character out of
   " ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿"

 - A start character out of "àáâãäåæçèéêëìíîï" followed by two
   continuation characters out of " ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿".

 - A start character out of "ðñòóôõö÷" followed by 3 continuation
   characters out of " ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿".

 - A start character out of "øùúû" followed by 4 continuation
   characters out of " ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿".

Good luck with deliberately constructing a realistic and convincing
counter-example file that you believe represents more than 0.01% of all
plaintext files.

(Note that several hundred times more than 0.01% of all MIME messages
come with incorrect character set headers.)

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>