Re: internationalization of mail


Laird Breyer writes, answering Tex Texin:

Have you considered using a statistical character set detector? Aquick google search reveals http://trific.ath.cx/software/enca/, butif the license isn't acceptable, there are probably others out there.

There are. I've worked on two, myself. The problem with them all is thatmany character sets are so similar. The values that are lower-case inone is often lowercase in another, the ones that are illegal in one isillegal in another.


Look at this:

$ grep 0xC4 8859-{?,??}.TXT
8859-1.TXT:0xC4 0x00C4  #       LATIN CAPITAL LETTER A WITH DIAERESIS
8859-2.TXT:0xC4 0x00C4  #       LATIN CAPITAL LETTER A WITH DIAERESIS
8859-3.TXT:0xC4 0x00C4  #       LATIN CAPITAL LETTER A WITH DIAERESIS
8859-4.TXT:0xC4 0x00C4  #       LATIN CAPITAL LETTER A WITH DIAERESIS
8859-5.TXT:0xC4 0x0424  #       CYRILLIC CAPITAL LETTER EF
8859-6.TXT:0xC4 0x0624  #       ARABIC LETTER WAW WITH HAMZA ABOVE
8859-7.TXT:0xC4 0x0394  #       GREEK CAPITAL LETTER DELTA
8859-9.TXT:0xC4 0x00C4  #       LATIN CAPITAL LETTER A WITH DIAERESIS
8859-10.TXT:0xC4        0x00C4  #       LATIN CAPITAL LETTER A WITH DIAERESIS
8859-11.TXT:0xC4        0x0E24  #       THAI CHARACTER RU
8859-13.TXT:0xC4        0x00C4  #       LATIN CAPITAL LETTER A WITH DIAERESIS
8859-14.TXT:0xC4        0x00C4  #       LATIN CAPITAL LETTER A WITH DIAERESIS
8859-15.TXT:0xC4        0x00C4  #       LATIN CAPITAL LETTER A WITH DIAERESIS
8859-16.TXT:0xC4        0x00C4  #       LATIN CAPITAL LETTER A WITH DIAERESIS

(I picked C4 at random.)

One hand, it means that if you choose the wrong Latin-* encoding, theerror often doesn't matter, because C4 is the same letter in all theLatin-* encodings. On the other, it means that if the text contains acharacter which differs among the sibling encodings, the text issubtly, almost undetectably, altered.


$ grep 0xF9 8859-{?,??}.TXT
8859-1.TXT:0xF9 0x00F9  #       LATIN SMALL LETTER U WITH GRAVE
8859-2.TXT:0xF9 0x016F  #       LATIN SMALL LETTER U WITH RING ABOVE
8859-3.TXT:0xF9 0x00F9  #       LATIN SMALL LETTER U WITH GRAVE
8859-4.TXT:0xF9 0x0173  #       LATIN SMALL LETTER U WITH OGONEK
8859-5.TXT:0xF9 0x0459  #       CYRILLIC SMALL LETTER LJE
8859-7.TXT:0xF9 0x03C9  #       GREEK SMALL LETTER OMEGA
8859-8.TXT:0xF9 0x05E9  #       HEBREW LETTER SHIN
8859-9.TXT:0xF9 0x00F9  #       LATIN SMALL LETTER U WITH GRAVE
8859-10.TXT:0xF9        0x0173  #       LATIN SMALL LETTER U WITH OGONEK
8859-11.TXT:0xF9        0x0E59  #       THAI DIGIT NINE
8859-13.TXT:0xF9        0x0142  #       LATIN SMALL LETTER L WITH STROKE
8859-14.TXT:0xF9        0x00F9  #       LATIN SMALL LETTER U WITH GRAVE
8859-15.TXT:0xF9        0x00F9  #       LATIN SMALL LETTER U WITH GRAVE
8859-16.TXT:0xF9        0x00F9  #       LATIN SMALL LETTER U WITH GRAVE

So for latin text, U-grave can be substituted for U-stroke. Such changeshave a good chance of passing a first examination undetected and beingarchived as correct. Rather complicates searching, of course.

In some cases, an error which is clearly obvious at once is preferableto one which may sneak by undetected. I don't know whether this appliesto Tex's problem.

This would permit high confidence decisions about when to convert anincoming message, although after conversion the problems youdescribed would still exist, if the detection was incorrect.

Indeed. There might be fewer errors remaining (depending on this andthat), but they would be harder to find and more difficult to handle.


Arnt