Re: ISO 2022 versus UTF-8 autodetection heuristics


Markus -

Bram Moolenaar wrote on 1999-11-03 12:38 UTC:

  The ISO 2022 code for announcing UTF-8 is

    ESC %G


Hmm, this means that actual characters are used here.  The application must
know about this, to avoid that they are interpreted as ordinary text
characters.  That will make it more difficult for older programs, and can
break some things.  Escape codes can have nasty side effects when sent to a
terminal.


There exists a strict syntax for ESC codes specified in ECMA 35 and ECMA
48 (ISO 2022 and ISO 6429). This allows applications to reliably jump
over ESC sequences that that do not know. In a nutshell, an ESC sequence
starts with ESC and ends with a letter (see the standards for the
precise details). This is widely implemented in terminal emulators (at
least in the good ones where the authors read the standards ;-).


Read which standard?  This can't be the only one.  Why else would there be a
termcap/terminfo database with so many entries?

Anyway, I don't know a single application that ignores these escape sequences.
Try "grep %G" on the file that includes the ESC %G from above.
All programs I know just handle the escape sequences like normal text, they
are not ignored and not recognized.  Didn't try many programs, perhaps there
is an obvious one that does recognize them.

I would state that these escape sequences are not useful in a file.  They
could be useful when communicating with a terminal emulator though.  Is there
a termcap/terminfo entry that specifies that the terminal accepts these codes?

The technique that mined98 uses seems to be fairly reliable. In
practice, >98% of all ISO 8859 files contain malformed UTF-8 sequences


98% isn't very reliable.  I would aim for 99.9% at least.


I said >98%, not =98%! It is very likely that it works for >99.99% of
all files. It certainly will certainly detect for >>99.99% of all German
ISO 8859-1 files that they are obviously not in UTF-8.


Well, why do you say >98% when you really mean >99.9%? :-)

I challenge you to send me a orthographically correct sentence in one of
the languages listed in the ISO 8859-1 standard, encoded in Latin-1,
that does not contain a malformed UTF-8 sequence, i.e. which could not
trivially be identified as not being UTF-8.


Ah, a challenge!  Well, here's one: OC筆  That's the name of the company I
used to work for with an (R) after it.  Almost any name can be followed by an
(R), thus this has quite a big change for being found in files.  Also, 慌\xB3 
are likely to be used to refer to a footnote, which can also appear after many
of the start characters.

Need I continue?  Anyway, I have no idea how often these character
combinations occur, but they do exist.  When using another character set than
ISO 8859-1 the chance would be different.  Perhaps there is a specific set
with a high probability?  Perhaps there is some often used Polish word that
happens to be a valid UTF-8 sequence.  Hopefully there is an invalid sequence
in the same file to detect that it's not UTF-8 then.

I would say that these sequences do appear, but we don't know how often.
It can still be annoying though.  For example, files I made for Oc\xE9 would
contain plain text (English or Dutch) and that OC筆 sequence in the footer at
every page.  If this is recognized as UTF-8 it causess a mess.

--
hundred-and-one symptoms of being an internet addict:
81. At social functions you introduce your husband as "my domain server."

--/-/---- Bram Moolenaar ---- Bram(_at_)moolenaar(_dot_)net ---- 
Bram(_at_)vim(_dot_)org ---\-\--
  \ \    www.vim.org/iccf      www.moolenaar.net       www.vim.org    / /