Re: Text canonicalization


In computer science, there are some interesting unsolved problems. A number
of them revolve around semantic issues -- what does data mean -- as opposed
to syntactic issues.

One of these is the notion of text, as opposed to raw data. There are
others (like, for example the unsolved question of what a digital signature
means). I'm going to blither about text, and its relevancy to OpenPGP in
this missive.

In the dim and ancient past, everyone had their own character set.
Gradually, there came to be standard character sets. Arguably, ASCII was
the first standard character set. Then, over the years, it was enhanced to
ISO Latin-1 (ISO 8859-1), and then all the 8859 follow-ons for blocks of
extra characters. I find it charming that for example that ISO Latin-5 (ISO
8859-5) is Cyrillic. There are also lots of other quasi-standard character
sets that are enhancements on ASCII. Then there are all the interesting
ones for Asian languages, and so on and so forth, leading up to Unicode and
ISO 10646, which in theory should eventually be *the* standard character
set.

But in all of this, there's been no standardization of what either the
syntax or semantics of a "line end" is. I'm going to stay away from the
semantics issue, as fun as that could be, because it's really not relevant.
(That discussion would cover how a line end differs between languages that
go left-to-right and right-to-left, and then progress into how you deal
with vertical lines, too.)

There are at least four different ways to denote a line end. For the
purposes of the discussion, I'm going to assume that when a line ends, you
start writing at the left margin, and down one line. I'm doing this for no
other reason than that's what I'm using now. These four mechanisms that I
can think of are:

* Two characters, one that says to go all the way left, one that says go
down one line. The "standard" way to do that is with a CR and LF (0x0D,
0x0A). This is also the way that IETF protocols denote line ends.

* One character that does both things. People have used each of CR and LF
to do this, so this is two mechanisms under one bullet.

* Line ends are meta-data. Record-oriented (as opposed to stream-oriented)
systems do this. For example, imagine storing a file as a stream of counted
strings, each one denoting a line. There's no actual line end in this file,
they're inferred by the sizes of each line.

It really doesn't matter how you denote any of this stuff (not even
character set) until you want to move a chunk of data from one system to
another and preserve its semantic content. There are a couple of ways to do
this:

* Declare an interchange format, and always translate your chunk into this
format when you transmit it, and out of it when you receive it.

* When you transmit your chunk, tell your partner what format it's in, and
let them translate it.

It is my belief that in OpenPGP, we do the former -- we have an interchange
format for text. That format consists of the ISO 10646 character set that
has been encoded into UTF-8. Additionally, we use CRLF line-ends, and
perhaps controversially, trim trailing white space from the end of each
line. Note that there is at least one more semantic ambiguity in this
interchange format -- tabs. Let's not go there right now, but arguably, we
should do something about that, too.

It is also my belief that when a chunk of data is marked as being "text"
that it is an assurance that the chunk of data has been translated into
that interchange format. This is what we mean by "text canonicalization."

An interesting side-effect of this is that when I verify a signature over a
chunk of data, it doesn't matter whether it's "text" or "binary." You can
just hash the chunk, compute the signature, and poof, you're done. Also,
interestingly, it means that minor errors in canonicalization (let's
suppose that someone's canonicalizer occasionally leaves in a trailing
space, or it sometimes encodes two blank lines as CR, LF, LF -- I've seen
both these errors over the years) don't affect signature checks.

Text mode, as we call it, should only come into play when you display text,
or want to use that chunk as text (like writing it out into a file). If you
verify a signature over a text file (call it foo.c), and then want to write
it out on a unix system, you need to translate the CRLFs into LFs. You have
to deal with what it means to have a bare CR, what to do about the
character set (your user might be using Shift-JIS or Big5), and you'll have
to massage that chunk to preserve its semantic content.

I believe that when a blob is marked as text mode, it's an *assurance* that
you've transformed the blob into "text mode." It is *not* an advisory to
tell your partner that every time they see a bare LF they should drop an
extra CR into the hash. Should such an implementation exist, I believe it
is wrong. Text mode means you have translated the blob from your local
format into the interchange format and its binary content is that of the
interchange format.

The IETF meta-rule we keep coming back to is to be conservative in what you
generate, and liberal in what you accept. So -- you should only mark a
chunk of data as "text" if it's been run through a canonicalizer that
transforms it into our interchange format. On the other hand it is
reasonable, but not necessary, that you accept something that's not
strictly right. For example, let's suppose that an OpenPGP implementation
has a bug in it and that bug is that after the canonicalizer has been run,
the message is generated with the pre-canonicalized text. Oops. What we
have in this is unix text (LF line-ends, let's say and ignore the character
set issues), but a signature that was computed over the same text with CRLF
line ends. If an implementation were, after failing to verify the signature
on the blob, to use a heuristic that tried to compensate for that encoding
error then I think we'd all applaud. That software would say to itself,
"Hmm, this signature didn't check. Let me look at the file. Well, well.
It's got 37 LFs and not a single CR. I'll bet that someone made a mistake.
Let's add in those CRs and check the signature again. Hey, look at that, it
works."

But that's mostly a digression. The point I'm making is this: Whatever
problems text mode, canonicalization, and translations give you, there
should be zero problems in signature checking. A signature is a signature
is a signature is a signature. The signature should be over the actual data
as found in the data packet.

While I'm at it, I'll mention that as far as OpenPGP is concerned, there
are other things a canonicalizer can do. It can, for example, translate
tabs into spaces. Heck, it can translate ANSI escape sequences into HTML,
or HTML into RTF, for all OpenPGP cares. The end users may care, but it's
OpenPGP compliant. You can always do more than what the standard says. Your
users might not like it, but the standard doesn't care.

        Jon