ietf
[Top] [All Lists]

RFC 1345 mnemonics table not consistent with Unicode 3.2.0

2007-08-23 14:58:21
Howdy all,

In attempting to implement some of the RFC 1345 mnemonic sequences,
I've been trying to match up the information in that document with the
Unicode data for version 3.2.0. Unfortunately, at some point it seems
the two have become inconsistent with each other.

I searched the list archives but couldn't find any previous discussion
of this specific issue, so I put some time into collecting information
on the differences. I hope these findings can lead to an update to RFC
1345 that makes the mnemonics table consistent with the Unicode
standard.


The attached program 'check-rfc1345' (also uploaded to
<URL:http://pastebin.ca/668483>) runs under Python 2.4 or later, and
uses the standard library module 'unicodedata' to check the table of
mnemonics in RFC 1345. Its input is the table, stripped of any blank
lines or other data not part of the table, and ensuring one complete
entry per line.

For each line of input, it will attempt to parse the space-delimited
fields of the described table, and then attempt to match up the
ordinal number (the hexadecimal character number) and character name
against the corresponding data in the Unicode data set. Where there is
a discrepancy it will output a message describing the problem.


The file 'rfc1345-mnemonics.txt' (uploaded to
<URL:http://pastebin.ca/668469>) is the table I've reconstructed from
'rfc1345.txt' as described above. Running this table through
'check-rfc1345' gives errors for 267 of the table entries. The output
I get is attached to this message as 'rfc1345-unicode-errors.txt'
(uploaded to <URL:http://pastebin.ca/668473>).

These errors can be grouped as follows, with one or more examples of
the program output and my comments on possible remedies:


Entries which are not conformant with the table definition
----------------------------------------------------------

<stdin>:1854: ordinal 'indicates' name 'unfinished (Mnemonic)'
        Unicode 3.2.0 has no name 'unfinished (Mnemonic)'
        Not a valid ordinal

There is only one entry in this category. It has only two fields
instead of the specified three.

To aid automatic parsing of the table it would be helpful if this
entry was either removed (since there is no mnemonic listed), or
somehow made conformant with the specification for the table.

Entries with extraneous information in the name
-----------------------------------------------

<stdin>:248: ordinal '0138' name 'LATIN SMALL LETTER KRA (Greenlandic)'
        Unicode 3.2.0 has no name 'LATIN SMALL LETTER KRA (Greenlandic)'
        Unicode 3.2.0 ordinal 0138 has name 'LATIN SMALL LETTER KRA'

<stdin>:461: ordinal '0402' name 'CYRILLIC CAPITAL LETTER DJE (Serbocroatian)'
        Unicode 3.2.0 has no name 'CYRILLIC CAPITAL LETTER DJE (Serbocroatian)'
        Unicode 3.2.0 ordinal 0402 has name 'CYRILLIC CAPITAL LETTER DJE'

Entries of this sort have extraneous commentary in the "name" field,
but without this the name does match the corresponding Unicode
character.

To aid automatic parsing of the table, these entries should have only
the Unicode character name in the name field.

Entries with different names for the same character
---------------------------------------------------

<stdin>:241: ordinal '0131' name 'LATIN SMALL LETTER I DOTLESS'
        Unicode 3.2.0 has no name 'LATIN SMALL LETTER I DOTLESS'
        Unicode 3.2.0 ordinal 0131 has name 'LATIN SMALL LETTER DOTLESS I'

<stdin>:974: ordinal '2103' name 'DEGREE CENTIGRADE'
        Unicode 3.2.0 has no name 'DEGREE CENTIGRADE'
        Unicode 3.2.0 ordinal 2103 has name 'DEGREE CELSIUS'

These entries have names that are, for some reason, different to the
current Unicode name for the character; however it is clear to a human
reader that they describe the same character.

To aid automatic parsing of the table, these entries should simply be
correctly named as per the Unicode data.

Entries with different names that may describe a different character
--------------------------------------------------------------------

<stdin>:380: ordinal '0388' name 'GREEK CAPITAL LETTER EPSILON WITH ACUTE'
        Unicode 3.2.0 has no name 'GREEK CAPITAL LETTER EPSILON WITH ACUTE'
        Unicode 3.2.0 ordinal 0388 has name 'GREEK CAPITAL LETTER EPSILON WITH 
TONOS'

<stdin>:1413: ordinal '3004' name 'IDEOGRAPHIC DITTO MARK'
        Unicode 3.2.0 has no name 'IDEOGRAPHIC DITTO MARK'
        Unicode 3.2.0 ordinal 3004 has name 'JAPANESE INDUSTRIAL STANDARD 
SYMBOL'

It isn't clear, without a thorough knowledge of orthography and
related issues, whether these differing names describe the same
character.

If these should be determined to be the same character, the mnemonic
table entry name should be made the same as the current Unicode name.

If not, a decision needs to be made as to whether the entry should be
removed, replaced with the corresponding Unicode character using the
same mnemonic, or added as a separate entry (with the existing
mnemonic) along with the current Unicode character (with a separate,
new mnemonic).

Entries with different names that clearly describe different characters
-----------------------------------------------------------------------

<stdin>:898: ordinal '1f01' name 'GREEK PSILI AND ACUTE ACCENT'
        Unicode 3.2.0 has no name 'GREEK PSILI AND ACUTE ACCENT'
        Unicode 3.2.0 ordinal 1f01 has name 'GREEK SMALL LETTER ALPHA WITH 
DASIA'

<stdin>:903: ordinal '1f06' name 'GREEK DIAERESIS AND VARIA'
        Unicode 3.2.0 has no name 'GREEK DIAERESIS AND VARIA'
        Unicode 3.2.0 ordinal 1f06 has name 'GREEK SMALL LETTER ALPHA WITH 
PSILI AND PERISPOMENI'

These entries have a name that does not appear in the current Unicode
data, and the ordinal is clearly assigned to an altogether different
character from the one described by the table.

If the named character exists in the current Unicode standard, its
correct ordinal and name should be updated in the table.

If not, a decision needs to be made as to whether the entry should be
removed, replaced with the corresponding Unicode character using the
same mnemonic, or added as a separate entry (with the existing
mnemonic) along with the current Unicode character (with a separate,
new mnemonic).

Entries whose character appears at a different ordinal
------------------------------------------------------

<stdin>:902: ordinal '1f05' name 'GREEK PSILI AND PERISPOMENI'
        Unicode 3.2.0 ordinal 1f05 has name 'GREEK SMALL LETTER ALPHA WITH 
DASIA AND OXIA'
        Unicode 3.2.0 name 'GREEK PSILI AND PERISPOMENI' has ordinal 1fcf

<stdin>:1680: ordinal 'fe90' name 'ARABIC LETTER BEH INITIAL FORM'
        Unicode 3.2.0 ordinal fe90 has name 'ARABIC LETTER BEH FINAL FORM'
        Unicode 3.2.0 name 'ARABIC LETTER BEH INITIAL FORM' has ordinal fe91

These entries name a Unicode character whose current ordinal is
different to that specified in the table.

The table should be updated with the correct ordinal number.

Entries that may not be Unicode characters
------------------------------------------

<stdin>:1819: ordinal '001e' name 'RECORD SEPARATOR (IS2)'
        Unicode 3.2.0 has no name 'RECORD SEPARATOR (IS2)'
        Unicode 3.2.0 has no character ordinal 001e

<stdin>:1888: ordinal 'e022' name 'ARABIC LETTER ALEF FINAL FORM COMPATIBILITY 
(IBM868 144)'
        Unicode 3.2.0 has no name 'ARABIC LETTER ALEF FINAL FORM COMPATIBILITY 
(IBM868 144)'
        Unicode 3.2.0 has no character ordinal e022

These entries give a character name not in the Unicode data, and the
ordinal is not in the Unicode ISO 10646 character set as specified in
the RFC.

If characters from different character sets are to be specified in
this table, then perhaps each table entry should specify which
character set contains the character at that ordinal.

Alternatively, the decision could be made that characters not found in
ISO 10646 (as specified in the description of the table) should not
have an entry in the table.

It is also possible that these *are* Unicode characters, but neither
the name nor ordinal match the current data. In this case, the table
entry should be updated to contain the correct Unicode ordinal and
exact character name.


Thanks for reading this far; I hope this analysis is useful, and look
forward to an update for RFC 1345 that addresses these issues.

-- 
 \    "Simplicity and elegance are unpopular because they require hard |
  `\                work and discipline to achieve and education to be |
_o__)                                appreciated."  -- Edsger Dijkstra |
Ben Finney

Attachment: check-rfc1345
Description: Python program to check RFC 1345 mnemonic table against Unicode data

_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf