[Top] [All Lists]

Re: Serious problem with nonspacing character mnemonics

1992-01-21 15:21:58
Keld writes:

Well, 10646says they cover all ECMA registered character sets,
and I think also that UNICODE does the same.

Well, yes, Unicode says they "cover" all these characters, but you have to
define what you mean by "cover". It is certainly possible to produce any
character in this range using Unicode. However, there may not be a single
16-bit representation for all ECMA registered characters in Unicode, and I
don't think anyone claims that there is. (How do you represent a
forward-combining tilde accent in Unicode, for example?) And even if ECMA is
covered in this way there are character sets that certainly aren't (some of the
complex APL overstruck characters are a good example of this).

Given that there may not be a single unique represention of a character
in Unicode, you immediately have the problem of canonicalization of characters.
And to do this properly the exact semantics of each character must be very
clear and well-defined.

This also means that when you set up something like mnemonic you cannot depend
on Unicode to have an exact way of identifying each individual character in
another character set. All you can depend on is that there's one or more
multi-character sequences in Unicode that can be used to represent each
permissible sequence in the original character set.

So what should be read into this?
My understanding is that the 10646 character meaning something
combining *after* the letter, and the T.61 non-spacing diacritic
coming *before* the letter is actually the same character,
it depends on the character set how to interpret them.

I suppose you could view this as a characteristic of the character set. But
this then means that each character is potentially unique modulo its character
set and no conversion between character sets is possible since you don't know
what the exact meaning of a character is without knowing the character set it
came from. This destroys all the advantage of mnemonic.

It also does not allow for character sets that contain some characters that
combine forward and some that combine backward. I don't know of such a 
character set but I'm far from sure there isn't such a thing in use somewhere.
In fact, it is easy to think of one possibility -- Unicode includes some
private use areas. It would make a lot of sense for someone who wants to encode
T.61 in Unicode to define some of these private areas as accents that combine
forwards. How would you represent a table for such a thing correctly in

I think you are putting too much meaning into the characters,
this would be like assigning the BACKSPACE character always the
meaning of ISO 646 where it can be used to make combined characters, or
to say that some control characters always mean something,
if they come in a special sequence.

I am assigning just enough meaning of characters to convert from one character
set to another. I wish to emphasize that I am doing nothing special and nothing
fancy. If I cannot do this reliably what is the point of RFC-CHAR?

RFC-CHAR does not go that far, currently.
RFC-CHAR only specifies what characters are at what codepoints.

Sorry, it does not specify what characters are at what codepoints, if the
combination direction is not clear from the what appears in the table. A
forward-combining tilde and a backward-combining tilde are totally different

The other thing is that RFC-CHAR does not cover 10646 nor UNICODE,
partly due to their unfinished state.

While RFC-CHAR does not cover Unicode it depends implicitly on Unicode's
definition of what various characters mean.

You cannot have it both ways. You cannot depend on getting definitions from
Unicode and then say you don't deal with the fact that the definitions in
the document you cite don't agree with your usage of them.

This is what I was trying to get at with my earlier posting. You defended
the general lack of definitions for various things in RFC-CHAR by saying that
the definitions were in ancillary documents. Now you say that these definitions
are not covered in RFC-CHAR.

The third thing is that I plan to include a more
elaborate description of T.61 etc with the allowed combinations
that it has. T.61 only allows certain combinations of
floating diacritics and letters, a combination RING-ABOVE and
<i> is not valid, for instande.

This is indeed a problem and it would be very nice to have this information.
However, it does not matter nearly as much since the meaning of such illegal
sequences in T.61 is clear if they do appear, and in practice there are no
characters in the other character sets you table that don't have equivalent
two-character sequences in T.61. I realize that this is not something that
you can depend on in the future and that it should be fixed, but I view the
problem of what existing tabled information means as being more important.


<Prev in Thread] Current Thread [Next in Thread>