Re: Let's resolve the end-of-line and whitespace question




Jon Callas wrote:

... It sounds like Unicode whitespace may be a huge can of worms.



From the little research I did, I couldn't find
any help in "defining Unicode whitespace," so,
yes, I agree it will be a huge can of worms.
(Unicode in any context is a mess, but it seems
clear that we need something, so UTF-8 it is.)

Alternatively, we could just say trim anything that's <= 0x20, which isa simple enough thing that solves some obvious attacks with backspacingand bare CRs to overstrike.



Good point, that would get my vote.  I'd prefer the
"chars <= 0x20" test as it is very clear, easy to
code, and it resolves what one does when some
spurious unprintable char we haven't thought about
comes into play.  I hadn't thought about backspaces...


(other mail:)
> The 2440 change in text signatures (adding in whitespace trimming) was
> one of a number of small things there that were debated as to what the
> right thing should be, rather than what went before. There are many good
> reasons for removing trailing whitespace at the end of anything that's
> text mode. It's the sort of thing that gets mangled easily and
> undetectably, as well as a covert channel. (I come from an era in which
> is was common practice for text editors to trim trailing whitespace when
> saving a file, and consider it a feature rather than a bug.)


It may be that the era is not dead.  There are
a number of scenarios that either add or change
whitespace on the ends of lines;  cut&paste
does it in circumstances which are too boring
to research.


> However, I'd be perfectly happy to settle it once and for all by saying
> only normalize line ends, even. We can just not worry about the
> whitespace. In short, if the consensus here is that however well-meaning
> that change was, it was a bad idea, it's easy to fix. I can see that
> Unicode issues might turn this into a swamp in ways that just trimming
> spaces and tabs isn't.


My vote would be to trim whitespace and normalise
line endines to CR/NL, where whitespace is <=0x20:

    Also, any trailing whitespace (characters <= 0x20) at the
    end of any line is ignored when the cleartext signature is
    calculated.

I think there should be a comment in there that
indicates what to do with Unicode, just to show
we thought about it, and not waste people's time
asking the question when they are implementing.
Something like:


    Unicode whitespace, where defined, SHOULD NOT be ignored.

Or,

    No Unicode whitespace characters are defined.


Leaving open the possibility of defining them in
an update?

iang