perl-unicode

Re: Correct use of UTF-8 under Unix

1999-11-03 18:34:31
Markus Kuhn writes:
: There is nothing wrong, with having these note-pad style unformatted
: plain text files as well supported under Unix, but it is important to
: make clear that this is an entirely new file type with no relationship
: to the existing plaintext notion.
: 
: The distinction of the two file types is easy: If it contains at least
: one LF character, it is a normal plain text file, if it does not contain
: a single LF character (but zero or more PS and/or LS characters), then
: is is a new/style unformatted plaintext file. Either way, you'll find
: out soon enough when reading the file at the end of the first line
: (formatted) or paragraph (unformatted).

I have one quibble with your hard and fast distinction between the two
file types here.  And that is that Perl scripts themselves might want
to be both types simultaneously!  It's considered good style to put the
documentation into the same file as the code it documents, and while
the code certainly wants to be newline delimited, the documentation is
in POD format, and it would be perfectly fine to treat POD text paragraphs
as a word processor would.  In fact, POD was specifically designed
so that filled paragraphs could be distinguished from non-filled text on
the basis of the first character of the paragraph.

The only problem I see offhand with allowing both styles in the same
file is that different tools might count lines differently.  If Perl
says there's a syntax error at line 582, it might mean it has seen 581
instances of /\012 | \015\012 | \015 | \X{2028} | \X{2029}/x before the
error.  (For folks listening in, that works out to Unix newline, Windows
newline, Mac newline (!), Unicode line separator and Unicode paragraph
separator.)  If your "normal plain text" editor then counts only \012
(Unix newline), the programmer isn't going to be able to find the error.

On the other hand, maybe Perl would just count newlines, and your
editor counts it the other way.  More likely, some editors count one
way, and other editors count another.  Maybe they count LS but not PS,
just as Perl currently counts \n but not \f as a line transition.
There are many possiblities.

All I'm really arguing here is that it would be good to establish a
line counting convention.  But if that convention pretends there won't
be files mixing the two line delimitation styles, that will have other
ramifications, including possibly an adverse impact on portability.
Counting line numbers right is already pretty complicated when you have
NFS mounts from foreign systems.  Adding in Unicode will only make
things more complicated.  There will be some pressure to use Unicode
LS/PS in portable code, and I'm not sure you want to spend the rest of
your life resisting that pressure.  A lot of the "fixes" in Perl are
only there because we got tired of people asking the same questions
over and over.

I think assuming that files will only be one style or the other will
put us into that sort of a situation, and it would be nice to head it
off early, for some definition of early.  Just telling people by fiat
that they can't mix the two styles is not likely to work in the absence
of universal education.  Unfortunately, the education of the illegitimi
tends to result in carborundum.

Larry