perl-unicode

RE: Correct use of UTF-8 under Unix

1999-11-02 08:37:25

There are many nice ideas written up in the Unicode standard and the
associated technical reports, however they are not a dogma

In reference to UTR 13, that report is only a recommendation with
no normative status.  It is, however, the intention to increase the
interoperability ability regarding the "NLF"s on various systems.

Unix never had any new line ambiguity. It was always LF and 
only LF.

Few, if any, Unix systems live in a splendidly pure Unix world
these days.  Which is why various interoperability recommendations
are needed.
 
If the outside world does something different (they always have, you
listed the three most popular other newline conventions CR, CRLF, and
NL, yourself), then we will continue to convert, either automatically or
manually, as appropriate.

These conversions are part of the problem.  One should not need
to do this manually, nor rely on some rather content blind low
level connection to do character encoding conversions of any kind.
The result is too likely to come out very wrong.  Those low level
mechanisms (like remote file system mounts) should not change
any of the contents of any file.

---

I find the suggestion to have two kinds of "plain text" to be
rather strange.  And what if LF and PS are mixed in one file?.
I don't, however, find it strange to edit, say, a shell script
on a non-Unix system, save it as a file on a Unix system
(via a mounted file system, just as easy as I store it locally),
and then be able to directly run that shell script without further
ado whatever the NLF used.  It works nicely with C source files with
at least two different kinds of NLFs mixed.  Why should there be
any problem with any of the other tools or other NLFs?

Had everyone on the still surviving systems agreed on one "NLF"
there would have been no interoperability problems regarding it.
But that is unfortunately not the case.  Insisting on LF only
will lead to interoperability problems.  Needless interoperability
problems.  Maybe one day the intent of the PS and LS characters
will be fulfilled.  But then all has to migrate towards a common
way of handling this.  And the (non-normative) recommendation,
via UTR 13, is to migrate towards using LS and PS in Unicode
based plain text. In the meantime, why not interoperate with the
existing NLFs without the need for error-prone conversions in
strange or awkward places where the file content is best left
alone.  Those conversion should be at the application (or library)
level, not at the file mount/similar level.

True, you may need to convert the file names at the file system
mount level depending on the the encodings used for the file name
on various file systems, but not the file *content*, as a byte
stream, in any way at that level.

                Kind regards
                /kent k
<Prev in Thread] Current Thread [Next in Thread>