pem-dev
[Top] [All Lists]

Sorting Teletex strings.

1994-03-10 14:01:00
Forwarded by Ella Gardner:
-----------------------------

Hi, Ella! 

The short answer is that there IS no standard way of sorting on T61Strings!  
The '88 standards do not specify ordering rules, and as far as I can remember,
the same goes for the '93 ones.  There is a quite fundamental problem, which 
is that, even with respect to Latin characters, different countries sort in 
different ways.  For example, in Sweden, V and W are used interchangeably in 
the telephone Directory.  Things as cultural as this are not amenable to 
international standardisation. 

The implication of '93 Part 3 clause 7.9 b) (which is, I think, the only place
where character-by-character sorting is implied) leaves a lot to the 
implementation. 

Many implementations will have some kind of natural sorting (perhaps as basic 
as ordering on the character string treated as an OCTET STRING).  This works 
sort-of OK in the States, particularly if upper and lower case are taken into 
account, but (as pointed out in the message) it quickly runs into local 
anomalies (e.g. the French would like C-Cedilla to be in among or close to C).

All the best 

Anthony 

----------------------

Arrgh -- what a mess!

Since the Internet presumably intends to support French-speaking Canada,
and soon perhaps Mexico, much less the rest of the world, what should we do?

Maybe, hopefully, the sorted order of a list of entries obtained through a 
browse
won't matter much, so the DUAs used by each individual could present the
list in the "natural" order for that person's language. Problems would occur
if someone refers to the 7th item on a list, however.

This question arose originally when I was talking to an attorney in the office
of the Secretary of the Commonwealth of Mass., asking what kinds of names
were considered acceptable for corporations and organizations. He said that
they used to allow $ within names, but now they don't, so $50 Savings Club, Inc.
is no longer valid.

One of the reasons was that someone might want $ams $uper $aver grocery
store to appear under the S's, and it might not. I suppose in the UK,
someone might want to register Larry's Legal Loans, where the L's were
replaced with the pound sterling.

Bob


-------------

Ella,

Do you (or anyone else) happen to know how Teletex strings
are to be sorted?The only reference that I can find is
the caseIgnoreOrderingMatch attribute, which returns True
if if the same collation order results after lower case
characters are replaced with upper case, but it doesn't
say what the collation order is.

If diacritical marks are considered nonspacing and
appear BEFORE the character they modify, sorting
the octets will lead to a very strange result --
all of the standard characters will appear first, followed by
all of the diacritical marks. That certainly isn't what one
would expect in a standard dictionary.

For example, my French-English dictionary lists the following 
words in order (I hope that all of the foreign characters survive
the mail process):

bbtard, bateau, bbtiment, bbtir, bbton, batterie, biatitude.

[They obviously didn't survive. Without the diacritical marks,
the works were batard, bateau, batiment, batir, baton, batterie, and beatitude.]

My German-English dictionary lists Strapaze, Stra_, [Strasse] Strategie, so
the s-zet (looks like beta) character is sorted as though it were 
the double-s which it replaces.

In Spanish, canto is followed by caqon [canon],  then caoba (mahagony),
so the n-tilde follows n.

I have no idea what the standard lexicographic ordering is for
punctuation signs, currency symbols, etc.

Surely this is defined somewhere? Glad I don't have to write the
code to support it!

Bob

<Prev in Thread] Current Thread [Next in Thread>