ietf-openpgp
[Top] [All Lists]

Re: Let's resolve the end-of-line and whitespace question

2004-02-20 20:36:48

Poking around at www.unicode.org, I found this character names list:

http://www.unicode.org/Public/UNIDATA/NamesList.txt

Grepping for "space", and cross-checking against the code charts at
http://www.unicode.org/charts/, produces the following list of Unicode
whitespace characters:

0020    SPACE
00A0    NO-BREAK SPACE
2002 - 200B     (various widths of spaces)
202F    NARROW NO-BREAK SPACE
205F    MEDIUM MATHEMATICAL SPACE
2060    WORD JOINER
3000    IDEOGRAPHIC SPACE
FEFF    ZERO WIDTH NO-BREAK SPACE

You'll notice that tab is not present, nor are carriage return, line
feed, form feed, etc.  These are considered "control" characters.

Now, of these space characters, NO-BREAK SPACE should not occur at the
end of a line, because that is what a no-break space means, a space
between two words where no line break should occur.  Neither should
NARROW NO-BREAK SPACE, WORD JOINER (which is a no-break space), or ZERO
WIDTH NO-BREAK SPACE.  Therefore I think all of these should be hashed
even if they do occur at the end of a line.

The 2002-200B variable-width spaces include "N" space, "M" space, "thin"
space, "hair" space, etc.  These are presumably used for typographic
purposes to precisely specify the layout.  If they are at the end
of a line, I think we can assume they are there for a purpose and
should be hashed.  Likewise with MEDIUM MATHEMATICAL SPACE, which is
four-eighteenths of an M space.

The only one left is IDEOGRAPHIC SPACE, which I suspect is the default
space character in ideographic languages (although it's possible they use
ordinary SPACE).  I could imagine it being put at the end of a line by
accident, by a Chinese typist or poorly designed word processing program,
so I'd suggest that it should be stripped before hashing.

This is the only one I would suggest adding, along with SPACE.

As far as control characters, there are many of them, but most of them
either should not be present in text documents or if they are, they are
significant and should be hashed.

However, in addition to whitespace we also need to think about line
terminators, because that's where we want to strip whitespace.
What character is used in vertically laid out languages to mean
"go to the next column"?  Is it one of carriage return or line feed?
Or something else?

Hal Finney