Re: (out of the blue) OCP header encoding issues



On Mon, 19 May 2003, Keith Moore wrote:

[note: I'm not on this list, and I probably don't care enough about
OPES to follow discussion on the list.  (I have read this thread in
the list archives, but I don't have the broader context).  I have
done some thinking about presentation encodings in protocols, and
someone who knew of that work forwarded this to me and suggested
that I follow up.]


Thank you for detailed/thoughtful comments! I hope you will continue to
review our work from time to time. As you may know by now, we have
decided to take the first step towards a text-based protocol. The first
rough draft has been posted [1]. The current version of the BNF (RFC
2234) is quoted below.

        message = name [parameters] [payload] "." CRLF

        parameters = [anonym-parameters] [CRLF named-parameters]
        payload = data

        anonym-parameters = *anonym-parameter
        anonym-parameter = SP value
        named-parameters = *named-parameter
        named-parameter = name ":" SP value CRLF

        name = ALPHA *safe-OCTET
        value = bare-value / quoted-value
        bare-value = <1>*safe-OCTET
        quoted-value = DQUOTE data DQUOTE
        data = size ":" <n>OCTET                     ; n == size

        safe-OCTET = ALPHA / DIGIT / "-" / "_"
        size = %d0-2147483647

Here are 4 examples of short control messages (omitting CRLF at the end
of each message):

     i-am-here.
     data-pause 22 1.
     data-end 22 1 200.
     do-you-support "28:http://iana.org/opes/ocp/TLS";.

Here is an example of a more complex message that carries application
data (omitting CRLF at the end of each line):

     data-have 1 3 0 8865
     modp: 75
     x-info: "26:twenty six octet extension"
     8865:<... 8865 bytes of data ...>.

I think we managed to avoid several pitfalls you are talking about
below, but more work remains. Please keep in mind that the current draft
does not necessarily represent any consensus of the working group.

One of the major differences between our protocol and protocols like
HTTP is that we have many very small "control" messages in addition to a
few possibly large messages that carry payloads. HTTP has, essentially,
one shot: a request message has to contain all information about client
desires and a response message has to contain everything about the
server reaction. With SMTP, there are a few control messages but they
are pretty much limited to initial negotiations. We have a bidirectional
pipeline of control and "data" messages.

...  Actually I'd say that how you delimit records is the
fundamental question, not whether you use text or binary.  You
basically have two choices: length counts or end-of-record
delimiters.


I agree. We currently use length counts for "unsafe" and possibly large
protocol "atoms" such as complicated parameter values ('quoted-value'
above) or application data ('payload' above). Everything else is
delimiter-based. Thus, we avoid expensive/awkward "octet stuffing" but
keep messages human-friendly.

End-of-record delimeters are attractive in that you don't have to
know the length of a record in advance before you start writing it


True. However, as far as basic protocol elements are concerned, in my
experience, you always know the length in advance except for when
writing numbers. If you do not know the length, something else is
broken in the design (e.g., protocol lacks chunking support for raw
data).

IMO, the primary practical feature (some would say advantage) of
delimiters is that they allow for human-friendly syntax. For example,

        GET / HTTP/1.0 CRLF

is much more friendly to a human than the equivalent

        3:GET1:/4:HTTP1:/3:1.02:CRLF

or something of that kind. Note that computer "preferences" are quite
the opposite -- the second example leaves fewer possibilities for errors
in a general context.

- but they do have some disadvantages: you don't know the length of
a record before you start reading it either,


This is usually not a problem for performance-sensitive protocols
because their implementations read using raw data buffers anyway. If
one cares about performance, one allocates I/O buffers and not header
structures (where possible).

and if you're going to want the ability to transmit arbitrary octet
values within a record then you need some kind of quoting mechanism,
which introduces more complexity.  Once you have that quoting
mechanism you can't use ordinary printf statements (or whatever) to
emit protocol bits.


Yes, quoting (i.e., octet stuffing) is very inefficient because it
requires every agent to look at every octet of the supposedly opaque
data. This is why we are avoiding it in the protocol.

Length counts make transparency easy, but might be unattractive if
some records will be so large that you don't want to buffer the
whole record before transmitting any of it.


That's why chunking support (in some shape or form) is a must for any
modern protocol.

Typing

How many data types for protocol elements do you need?  Do you want to
coerce everything into "text", or do you want to allow binary integers
also?  Do you need multiple sizes of integers?   Unsigned and signed?
Floating point?  Special types for things like dates?


We do not have these problems yet (we only have two or three simple
"types" that cover all current needs), but I suspect we may have to
add more types and suffer the consequences.

(in the case of 822 messages, the vast majority of the attributes
are strings, so expecting everything to be text on the wire is not a
huge problem.  that might or might not be the case for your
protocol.)


Since we decided to start with a text-based approach, we use text for
everything but payload. This is a [performance] problem, but the
current consensus is that readability and ease of ICAP migration are
more important.

Regularity

It's really useful if the decoder (encoder) don't need to have specific
knowledge of the particular protocol elements they're reading (writing).


Yes, this is essential for being able to support extensions. I think we
are OK with the current syntax.

Extensibility

Sometimes it's really useful if you can add additional protocol
elements to a record (say to extend a protocol) without resulting in
an incompatible record structure.  (822 headers are extensible in
that you can add new fields without changing the meaning of existing
fields; however, it's hard to add new data elements within a field.)


On a syntax level, our protocol has similar property: it is easy to
add new fields ('named-parameter' above), but not new elements within
a known field. The design assumption is that each field represents an
atomic "thing" that should not need more data elements. However, I am
sure there will be cases when what was perceived as a complete atom
becomes a collection of smaller particles that need more elements for
completeness.

Opacity

If some of your protocol engines need to pass data from one peer to
another without examining it themselves, it's useful if the protocol
can treat that chunk of data as "opaque" - merely copying it from
one peer to the other without decoding and re-encoding it (and
potentially changing its representation).  Also, if an inner
protocol element is malformed, it's useful it this doesn't break
parsing of the outer protocol element.


I think our current NetString-like approach for data and metadata
passing works well here.

Similarly, if you have protocol elements that are going to be
subjected to digital signatures and/or integrity checks, it's useful
if the application can treat those protocol elements as 'opaque' for
the purpose of signing/verification and not always have to deal with
them in decoded form.  (this has been difficult in 822, since
there's no clear distinction between things that are changable in
transit and things that are not)


Good point! Signing payloads should be OK. I think we do not have any
variability in the header syntax, except that a value can be quoted even
if it does not need to be. We will have to decide whether that makes
signing headers difficult _if_ we need to sign them. Added to the to-do
list.

Mapping between internal and external representation

It is useful if there is a good impedance match between internal
(in-memory) and external (on the wire) representation of data
elements.  For instance, if the presentation layer supports
arbitrary-length integers, this is not easily handled by programming
languages that assume fixed-length integers.  Or if the programming
language insists that character strings be in unicode (so that
comparisons with string and character constants work) but the
presentation layer doesn't specify a charset.

Also, any time there is a need to map complex data structures from
(to) a format where variable-length data elements are located by
sequential scanning (e.g. 822 headers, XDR, BER, etc.) to (from) a
format where variable-length data elements are located by following
pointers (typical in-memory representation), there can be a number
of efficiency losses.

There is also what I would call "reblocking inefficiency" - if you
have to copy or transform data from one layer in order to use it in
another layer, that slows things down.  An example would be having
to copy multiple lines of an 822 header into a string representing a
single field, then you had to parse that field into individual
sub-fields, then to decode individual sub-fields (like an
encoded-word or a domain name encoded per IDN), etc.


While our situation is better than MIME (e.g., we do not have implied
LWS and octet stuffing), we still suffer inefficiency since all
numbers in the protocol are text-based and most computers cannot
compute symbolically. Not sure we can do much better here unless we
switch to binary representation.

Familiarity and mindshare

Any new bit of technology imposes a learning curve, and many people
naturally prefer immediately starting work with familiar tools, to
learning new tools.  (I'm certainly guilty: I still do much of my
programing in C; the computational linear algebra people I work with
still do lots of theirs in FORTRAN.)

822/MIME/HTTP headers are familiar, but they are also fairly
irregular.  I have written a lot of C code written to handle them-
routines to parse dates, address lists (with comments), content-type
fields, content-disposition fields, encoded-words, addresses, etc.
IMHO, their apparent simplicity is somewhat of an illusion.
Another problem with having 822 headers appear so simple is that
syntax errors are fairly common.


Our current syntax is very strict, but message headers "look like"
canonical MIME. It remains to be seen whether we stroke the right
balance.

From reading recent messages on the list there seems to be a bit of
support for using 822-style headers, presumably due to familiarity
and mindshare considerations.  If the WG does decide to go this
route I encourage it to define a single syntax which is shared by
all fields, and which provides adequate nesting, etc. for your
protocol's needs, while leaving some room for extensibility.
(Offhand I'd recommend something resembling LISP expressions.)


I think we did what you are suggesting, except there is no support for
"nesting". Extensions are supported by adding more 'named-parameters' to
a message. Do you know of any text-based protocol that is not XML-based
but supports nesting?

However you may find that when you actually think about the whole
protocol that the degree of familiarity and mindshare benefit isn't
as much as you previously thought.

And if you want to consider a reasonably-complete non-text alternative,
you might take a look at BLOB:
http://www.cs.utk.edu/~moore/draft-moore-rescap-blob-02.txt


Thanks a lot for the pointer (the URL you really meant was [2])! BLOB is
certainly an interesting animal.  If nothing else, it looks simpler and
more straightforward than XDR approach, and may become a candidate if we
decide to switch to the binary path. Are there any production-quality
protocols built on top of BLOB?

Thank you,

Alex.

[1] http://www.measurement-factory.com/tmp/opes/
[2] http://www.cs.utk.edu/~moore/blob/draft-moore-rescap-blob-02.txt