perl-unicode

RE: Correct use of UTF-8 under Unix

1999-11-04 11:32:10
From: Larry Wall [mailto:larry(_at_)wall(_dot_)org]
Karlsson Kent - keka writes:
:     So, the interoperable line (or 'stronger') separators in
: "plain text" are:
: 
:     \X{2028}|\X{2029}|\r\n|\n|\r|\f|\v|\X{85}

This is slightly wrong if you broaden the picture to more than Unix.
You shouldn't really use \r\n to mean \015\012 because \n is 
(according to K&R) a logical newline, not \012.  On a Mac, the 
...

Sorry, so put in hex codes for all. (Oh, why did I not...)

:     Note that there are some "legacy" encodings which do not
: have any or all of \f|\v|\X{85}.

As I mentioned earlier, Perl doesn't count \f as a new line for line
counting purposes.  This seems to be how editors treat them (or don't
treat them, depending on how you look at it.)  It's also 
consistent with what wc thinks.

Though FORM FEED is not part of Unicode, nor 10646 (neither of
them formally contain any C0 or C1 control character, nor DELETE),
it still has a (default since FORM FEED is not formally there)
line breaking property (BK) same as that of LINE SEPARATOR.
See Unicode Techical Report 14 
(http://www.unicode.org/unicode/reports/tr14/tr14-5.pdf)
and the associated  data file
(ftp://ftp.unicode.org/Public/3.0-Update/LineBreak-5.txt).
That NL (NEXT LINE, 0085) doesn't is a (very small) consistency bug.
Note though that NL has BiDi property B, which is block separating,
same as LF and CR has. (That is only a default, since NL is
formally not in Unicode as such.)

The other interesting thing is that we *removed* support for \v from
Perl some time ago, since nobody we were acquainted with had any idea
what it really meant, or if anyone actually used it for anything.
There have been no complaints.  Paint \v dead.

Though not plain text, it appears to be used as line separator
(as opposed to paragraph separator LF+CR) in .doc files.  It
got into UTR13 too, since apparently it is (rarely) used as a
newline character in plain text on some system (according to
the author).  It's propably dead in plain text, but for some
strange reason the '\v' "escape sequece" is still in C99.

As for \X{85}, I've never heard of it.  But then, I'm not of Latin
extraction.

Apparently some (not all) conversions from EBCDIC can generate
this.  Or so IBM says.

I agree that VT and NL are very rare, and could probably be
ignored, if one so desires.  FORM FEED is not that rare yet,
though I hope very few would use it in a source code file.

                Kind regards
                /kent k
<Prev in Thread] Current Thread [Next in Thread>