Re: Handling LS/PS

Larry Wall wrote on 1999-12-18 21:01 UTC:

So you weed \n out of
filenames.  What will the shell do if it sees a LS or PS character in a
filename instead?  If the shell interprets it as a newline, you've got
the same problem as if you'd let \n through.


I highly recommend that POSIX applications such as the shell shall NOT
change the traditional line breaking semantics for UTF-8. Lines are
terminated by \n and nothing else. Everything else (including PS and LS)
are just characters within a line. PS and LS should be treated exactly
like e.g. the also unknown character U+FFF0. We open an endless can of
ugly worms if we change the encoding of line terminators for POSIX. We
would essentially send the ASCII compatibility of UTF-8 to hell, so we
could as well switch directly to UTF-16 as we missed the whole point of
what UTF-8 was about.

Interpreting LS/PS as line-breaks by the shell and other classic Unix
plain-text I/O tools is as severe a change as switching from ASCII to
UTF-16. Do not even think about it!

I hope, the authors of the relevant POSIX document on using UTF-8 can
clarify this point in the next revision (Keld?), because this seems to
be a frequently arising dangerous question of people who have read the
Unicode standard instead of ISO 10646.

LS/PS and UTF-8 do not fit together, because UTF-8 is ASCII compatible
and LS/PS is not. The only place in POSIX systems where you have to
worry about LS/PS is if you import UTF-16 files from non-POSIX systems
(WinNT, non-plaintext word processing files), etc., where you might have
to make a lossy reencoding of the form LS -> \n and PS -> \n\n. But once
the data is in UTF-8, LS/PS really should be treated as just other
Unicode characters without special meaning on POSIX systems.

While the POSIX world is in need of a new character encoding, it is
definitely not in need of new line terminator semantics. The two are
fully orthogonal issues, and the Unicode standard has nothing useful to
offer for POSIX on the line terminator issue.

The line/paragraph separator parts of Unicode 3.0 are another reason,
why I generally prefer to refer to "ISO 10646 = UCS" and not "Unicode",
because the UCS standard doesn't talk about control characters (or just
does some handwaving towards ISO 6429 as it does in ISO 8859 and ISO
646) and can therefore be understood much more clearly as just a
replacement encoding for the graphical characters and everything else
stays the same.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>