Poking around at www.unicode.org, I found this character names list:
http://www.unicode.org/Public/UNIDATA/NamesList.txt
Grepping for "space", and cross-checking against the code charts at
http://www.unicode.org/charts/, produces the following list of Unicode
whitespace characters:
0020 SPACE
00A0 NO-BREAK SPACE
2002 - 200B (various widths of spaces)
202F NARROW NO-BREAK SPACE
205F MEDIUM MATHEMATICAL SPACE
2060 WORD JOINER
3000 IDEOGRAPHIC SPACE
FEFF ZERO WIDTH NO-BREAK SPACE
You'll notice that tab is not present, nor are carriage return, line
feed, form feed, etc. These are considered "control" characters.
Now, of these space characters, NO-BREAK SPACE should not occur at the
end of a line, because that is what a no-break space means, a space
between two words where no line break should occur. Neither should
NARROW NO-BREAK SPACE, WORD JOINER (which is a no-break space), or ZERO
WIDTH NO-BREAK SPACE. Therefore I think all of these should be hashed
even if they do occur at the end of a line.
The 2002-200B variable-width spaces include "N" space, "M" space, "thin"
space, "hair" space, etc. These are presumably used for typographic
purposes to precisely specify the layout. If they are at the end
of a line, I think we can assume they are there for a purpose and
should be hashed. Likewise with MEDIUM MATHEMATICAL SPACE, which is
four-eighteenths of an M space.
The only one left is IDEOGRAPHIC SPACE, which I suspect is the default
space character in ideographic languages (although it's possible they use
ordinary SPACE). I could imagine it being put at the end of a line by
accident, by a Chinese typist or poorly designed word processing program,
so I'd suggest that it should be stripped before hashing.
This is the only one I would suggest adding, along with SPACE.
As far as control characters, there are many of them, but most of them
either should not be present in text documents or if they are, they are
significant and should be hashed.
However, in addition to whitespace we also need to think about line
terminators, because that's where we want to strip whitespace.
What character is used in vertically laid out languages to mean
"go to the next column"? Is it one of carriage return or line feed?
Or something else?
Hal Finney