perl-unicode

RE: Correct use of UTF-8 under Unix

1999-11-04 04:46:18
Hi!

        Larry is right in that there is (already, also under Unix)
other ways of separating lines: namely form feed, but also vertical
tab. I must admit that I have never used vertical tab, and very
rarely form feed... Anyway C9x says: "\v (vertical tab) Moves the
active position to the initial position of the next vertical tab
position." And there is a similar statement about form feed. I
assume that is not too far off from what other standards might say.  

        So, the interoperable line (or 'stronger') separators in
"plain text" are:

        \X{2028}|\X{2029}|\r\n|\n|\r|\f|\v|\X{85}

(I'm probably mixing Perl and C (and flex) syntax here.) Some
of them are "stronger" in some senses than line separation,
but for the purposes of counting logical lines, and deciding
logical line begin and logical line end, there should be no
difference.  A single logical line may be *dynamically* wrapped 
into several displayed lines, but that is a different matter.

        Note that there are some "legacy" encodings which do not
have any or all of \f|\v|\X{85}.

        (I still think the idea of having two different kinds
of "plain text" is a bad idea.  I haven't heard anyone else
entertain it either.)

                Kind regards
                /Kent K


Larry Wall wrote:
...
The only problem I see offhand with allowing both styles in the same
file is that different tools might count lines differently.  If Perl
says there's a syntax error at line 582, it might mean it has seen 581
instances of /\012 | \015\012 | \015 | \X{2028} | \X{2029}/x 
before the
error.  (For folks listening in, that works out to Unix 
newline, Windows
newline, Mac newline (!), Unicode line separator and Unicode paragraph
separator.)  If your "normal plain text" editor then counts only \012
(Unix newline), the programmer isn't going to be able to find 
the error.

On the other hand, maybe Perl would just count newlines, and your
editor counts it the other way.  More likely, some editors count one
way, and other editors count another.  Maybe they count LS but not PS,
just as Perl currently counts \n but not \f as a line transition.
There are many possiblities.

All I'm really arguing here is that it would be good to establish a
line counting convention.  But if that convention pretends there won't
be files mixing the two line delimitation styles, that will have other
ramifications, including possibly an adverse impact on portability.
Counting line numbers right is already pretty complicated 
when you have
NFS mounts from foreign systems.  Adding in Unicode will only make
things more complicated.  There will be some pressure to use Unicode
LS/PS in portable code, and I'm not sure you want to spend the rest of
your life resisting that pressure.  A lot of the "fixes" in Perl are
only there because we got tired of people asking the same questions
over and over.

I think assuming that files will only be one style or the other will
put us into that sort of a situation, and it would be nice to head it
off early, for some definition of early.  Just telling people by fiat
that they can't mix the two styles is not likely to work in 
the absence
of universal education.  Unfortunately, the education of the 
illegitimi
tends to result in carborundum.

Larry
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

<Prev in Thread] Current Thread [Next in Thread>