perl-unicode

RE: Correct use of UTF-8 under Unix

1999-11-02 06:24:34
(Note: I don't subscribe to perl-unicode(_at_)perl(_dot_)org, only to
linux-utf8(_at_)nl(_dot_)linux(_dot_)org, and I don't have Markus's original
message that is quoted below.)


:   - This means that lines in UTF-8 plaintext files are terminated
:     in one and only one way: 0x0a = LF.

That is not true.  "lines" in UTF-8 text files may be terminated by 
LINE FEED, CARRIAGE RETURN, CARRIAGE RETURN+LINE FEED, NEXT LINE,
or end-of-file, or be separated by LINE SEPARATOR or PARAGRAPH SEPARATOR
(which is in some sense 'stronger' than line separator).

(I don't know what originally came before the "This means that" in
Markus's message.)


                                                      Neither U+2028 (line
separator,
:     introduced for use inside *.doc-style word processing binary files)

That is not true.  LINE SEPARATOR and PARAGRAPH SEPARATOR were once
introduced in the hope that they would "clear up the line ending mess".
(Whether they are used in ".doc"-style documents is a separate issue.)
That hope has not come to fruition yet, and it will take time before
the "line ending mess" is overcome whatever way is used to overcome it.
Unicode Technical Report 13, Unicode Newline Guidelines
(http://www.unicode.org/unicode/reports/tr13/), gives some guidelines
on how to increase the interoperability with regard to "new line
function" (NLF) and LS/PS handling.  Basically the recommendation is
to accept all commonly occurring NLFs: CR, CR+LF, LF, the EBCDIC
originated NL (NEXT LINE; U+0085; admittedly rare), as well as
LS and PS (and allow EOF to 'terminate a line').  I think they
should be accepted in any mixture.

Most(?) C compilers already appear to handle at least both LF and CR+LF
(mixed) fairly well.  This makes it easier to handle C source files in a
"mixed environment".  Shell scripts, yacc/bison files, etc. are still
problematic since their lexers still expect only LF.


:     nor overly long UTF-8 sequences for LF such as 0x80 0x8a must be
accepted

True, unduly long UTF-8 encodings in general should be considered malformed.


:     as line terminators, otherwise we would get into the horrible
:     scenario that programs start to disagree what exactly a line is
:     (which a whole load of new security risks associated). Programs
:     such as "wc -l" must on UTF-8 files without any modification
:     whatsoever! There is no reason to change the Unix line semantics
when
:     moving from ASCII to UTF-8. U+2028 is treated just like any other
:     character and has no special meaning in a Unix plaintext file.

U+2028 and U+2029 should be handled as just another way of
indicating line separation/end (as should end-of-file) for the
purposes of perl/C/lex/bison/Ada/etc. Neither of these need to
distinguish between line and paragraph separation, and all of
these ways of terminating/separating lines should be treated
the same, for increased interoperability.

Of course, to be able to detect NL, LS, and PS one needs to know
the character encoding first, since they have different codes and
are indeed not possible to represent in all encodings.  But the
same goes for NL and CR too really, if UTF-16 is allowed, which
it should be in at least some circumstances. (No, I don't like
little endianism nor "BOM".)

Note that several programming languages, e.g. Java, Ada, and C,
allow non-ASCII in identifiers, with identifier identity defined
via the UCS. But they don't require a particular character
encoding for the source files, so compilers for these programming
languages MUST 'know' the character encoding of an individual
source file (via a compiler flag, system/individual/folder default,
or similar) in order to compile the source code correctly anyway.
Similarly for XML and its tag and attribute names, but each XML
file should self-declare which character encoding it is in.

Which way of ending/terminating lines should be prefered on output?
Might depend on a preference setting, or an editing change (like
"turn all NLFs into LS").

                Kind regards
                /Kent Karlsson
<Prev in Thread] Current Thread [Next in Thread>