ietf-822
[Top] [All Lists]

Re: internationalization of mail

2004-08-27 03:57:02

Laird Breyer writes, answering Tex Texin:
Have you considered using a statistical character set detector? A quick google search reveals http://trific.ath.cx/software/enca/, but if the license isn't acceptable, there are probably others out there.

There are. I've worked on two, myself. The problem with them all is that many character sets are so similar. The values that are lower-case in one is often lowercase in another, the ones that are illegal in one is illegal in another.

Look at this:

$ grep 0xC4 8859-{?,??}.TXT
8859-1.TXT:0xC4 0x00C4  #       LATIN CAPITAL LETTER A WITH DIAERESIS
8859-2.TXT:0xC4 0x00C4  #       LATIN CAPITAL LETTER A WITH DIAERESIS
8859-3.TXT:0xC4 0x00C4  #       LATIN CAPITAL LETTER A WITH DIAERESIS
8859-4.TXT:0xC4 0x00C4  #       LATIN CAPITAL LETTER A WITH DIAERESIS
8859-5.TXT:0xC4 0x0424  #       CYRILLIC CAPITAL LETTER EF
8859-6.TXT:0xC4 0x0624  #       ARABIC LETTER WAW WITH HAMZA ABOVE
8859-7.TXT:0xC4 0x0394  #       GREEK CAPITAL LETTER DELTA
8859-9.TXT:0xC4 0x00C4  #       LATIN CAPITAL LETTER A WITH DIAERESIS
8859-10.TXT:0xC4        0x00C4  #       LATIN CAPITAL LETTER A WITH DIAERESIS
8859-11.TXT:0xC4        0x0E24  #       THAI CHARACTER RU
8859-13.TXT:0xC4        0x00C4  #       LATIN CAPITAL LETTER A WITH DIAERESIS
8859-14.TXT:0xC4        0x00C4  #       LATIN CAPITAL LETTER A WITH DIAERESIS
8859-15.TXT:0xC4        0x00C4  #       LATIN CAPITAL LETTER A WITH DIAERESIS
8859-16.TXT:0xC4        0x00C4  #       LATIN CAPITAL LETTER A WITH DIAERESIS

(I picked C4 at random.)

One hand, it means that if you choose the wrong Latin-* encoding, the error often doesn't matter, because C4 is the same letter in all the Latin-* encodings. On the other, it means that if the text contains a character which differs among the sibling encodings, the text is subtly, almost undetectably, altered.

$ grep 0xF9 8859-{?,??}.TXT
8859-1.TXT:0xF9 0x00F9  #       LATIN SMALL LETTER U WITH GRAVE
8859-2.TXT:0xF9 0x016F  #       LATIN SMALL LETTER U WITH RING ABOVE
8859-3.TXT:0xF9 0x00F9  #       LATIN SMALL LETTER U WITH GRAVE
8859-4.TXT:0xF9 0x0173  #       LATIN SMALL LETTER U WITH OGONEK
8859-5.TXT:0xF9 0x0459  #       CYRILLIC SMALL LETTER LJE
8859-7.TXT:0xF9 0x03C9  #       GREEK SMALL LETTER OMEGA
8859-8.TXT:0xF9 0x05E9  #       HEBREW LETTER SHIN
8859-9.TXT:0xF9 0x00F9  #       LATIN SMALL LETTER U WITH GRAVE
8859-10.TXT:0xF9        0x0173  #       LATIN SMALL LETTER U WITH OGONEK
8859-11.TXT:0xF9        0x0E59  #       THAI DIGIT NINE
8859-13.TXT:0xF9        0x0142  #       LATIN SMALL LETTER L WITH STROKE
8859-14.TXT:0xF9        0x00F9  #       LATIN SMALL LETTER U WITH GRAVE
8859-15.TXT:0xF9        0x00F9  #       LATIN SMALL LETTER U WITH GRAVE
8859-16.TXT:0xF9        0x00F9  #       LATIN SMALL LETTER U WITH GRAVE

So for latin text, U-grave can be substituted for U-stroke. Such changes have a good chance of passing a first examination undetected and being archived as correct. Rather complicates searching, of course.

In some cases, an error which is clearly obvious at once is preferable to one which may sneak by undetected. I don't know whether this applies to Tex's problem.

This would permit high confidence decisions about when to convert an incoming message, although after conversion the problems you described would still exist, if the detection was incorrect.

Indeed. There might be fewer errors remaining (depending on this and that), but they would be harder to find and more difficult to handle.

Arnt