I've been reading the archives of ietf-822 on the matter of text and
soft-wrapping, and I see efforts to deal with the problem seem to be
sacrificing completeness for the sake of readability and even
accomodation of USENET quasi-markup conventions.
To understand completeness, here's an attempt to formalise the problem:
I define _codepoint-stream_ as a sequence of zero or more UCS-4
codepoints excluding 'CR' and 'LF' codepoints. Codepoint-streams
represent text, and line-separators and paragraph-separators are
represented by the unambiguous codepoints 'LINE SEPARATOR' (0x2028) and
'PARAGRAPH SEPARATOR' (0x2029). I'm using this representation solely to
define a domain for the purposes of formalisation: I'm quite aware that
these two codepoints are rarely used in practice.
I define _UCS-2 codepoint-stream_ as a codepoint-stream where every
codepoint is in UCS-2.
I define _octet-stream_ as a sequence of zero or more octets.
I define _7bit octet-stream_ as an octet-stream comforming to the rules
of '7bit' CTE. Likewise _8bit octet-stream_.
I define _protocol_ as a 'source' domain, a 'code' domain, an 'encode'
map from source to code, and a 'decode' map from a subset of code to
source, such that every member x of source, decode(encode(x)) = x.
I define _identity protocol_ as a protocol P where
1. P.source is a subset of P.code;
2. For every x member of P.source, P.encode(x) = x.
A protocol P _represents_ A if A is a subset of P.source.
A protocol P _represents in/as_ A if P.code is a subset of A.
I define _content-transfer-encoding_ as a protocol for representing a
subset of octet-streams as octet-streams.
A content-transfer-encoding is _complete_ if it represents octet-streams.
A content-transfer-encoding is _7bit-safe_ if it represents as 7bit
octet-streams.
I define _charset_ as a protocol for representing a subset of
codepoint-streams as octet-streams.
For instance, I claim that quoted-printable and base64 are each complete
7bit-safe content-transfer-encodings, and that UTF-8 is a charset.
I define _subprotocol of P_ as a protocol S where:
1. S.code = P.code;
2. S.source is a subset of P.source;
3. For every x member of S.source, S.encode(x) = P.encode(x).
I define _A over B_ for protocols A and B as the protocol C defined by:
1. C.source = A.source
2. C.code = B.code
3. C.encode(x) = B.encode(A.encode(x))
4. C.decode(x) = A.decode(B.decode(x))
only meaningful is A.code is a subset of B.source.
What protocols are commonly used by UAs? Bear in mind that as far as I
know no UA actually internally converts to codepoint-streams I've defined
them directly, but if we use codepoint-streams to represent text
understood and correctly displayed, most UAs effectively implement
subprotocols of the protocols of various charsets over various
content-transfer-encodings.
Now the problem, as I understand it, is to come up with methods of
representing text in RFC822 mail messages in a way that's as readable as
possible. These informal 'methods' each involve formal protocols for
representing some subset of codepoint-streams as 7bit octet-streams
labelled with MIME-headers: specifically, I don't believe quasi-markup
such as _italic_, *bold*, >blockquote etc. should get any special
treatment. I think it's enough for MUAs to insert whatever represents
hard line-separators between lines when performing >-quoting.
Also, it's sufficient for UAs to effectively implement reasonable
subprotocols of specified protocols, since UAs are not always going to be
able to correctly display every defined codepoint of UCS-4.
Desirable fuzzy attributes for such methods (with protocol P) include, in
no particular order:
- 'Completeness': P.source is the full set of codepoint-streams;
- 'backward-compatible handling' (by existing UAs): for instance, while
it's my opinion that UAs should display unrecognised text/* as text/plain
in preference to saving it to some file, I don't believe all existing UAs
actually do this;
- 'backward-compatible readability': specifically, for a typical text
string x, and for any 'text/plain' protocol T compatible with US-ASCII,
T.decode(P.encode(x)) is reasonably readable, given that very few UAs
understand LINE SEPARATOR and PARAGRAPH SEPARATOR;
- 'applicability': certain existing commonly-used protocols are
subprotocols of P, most obviously those used by plain-text
proportional-font soft-wrapping text-editors (e.g. SimpleText);
- 'router-stability': mail-routers are not guaranteed not to filter their
mail in various ways -- ideally, we want to be sure that for every x
member of P.source, P.decode(filter(P.encode(x))) = x. At the very least,
P.decode(filter(P.encode(x))) should reasonably resemble x.
- 'simplicity' of design.
Now not all MIME UAs are MUAs: MIME is used not only by mail and news,
but also by HTTP and even file-systems (e.g. BeFS). And MIME types and
options can be used in a number of different scenarios.
For general email with no explicit prior content-negotiation, I'd order
the desirability of attributes this way:
1. router-stability
2. backward-compatible handling
3. backward-compatible readability
4. completeness
5. applicability
6. simplicity
But for use in typed file-systems and anything involving
content-negotiation, I see a desire for a method with priorities more
like this:
1. completeness
2. applicability
3. backward-compatible readability
4. simplicity
5. router-stability
6. backward-compatible handling
Now I've looked at both draft-newman-mime-textpara-00 and
draft-gellens-format-00, which I assume are latest versions.
draft-newman-mime-textpara defines a MIME type 'text/paragraph' which can
be used over suitable CTEs just like text/plain. I consider it a way of
varying charsets such that CRLF bytes are interpreted as
paragraph-separators rather than line-separators. In my opinion it's
ideal for use in typed file-systems, as well as any communication
featuring content-negotiation:
1. completeness: yes
2. applicability: yes
3. backward-compatible readability: as per qu=6Fted-printable
4. simplicity: yes
5. router-stability: yes, with quoted-printable
6. backward-compatible handling: probably not
...and I think it was premature for Chris Newman to withdraw it.
As for draft-gellens-format, I believe its attempts to handle >-quoted
text compromise it, mainly because of the various ways people quote text.
I think if we go that route, we should specify a complete markup language
of *bold*, _italic_, - bullets, etc., with strict semantics for what any
given sequence of octets means.
But if >-quote handling were stripped out, draft-gellens-format might
work for general email:
1. router-stability: not perfect
2. backward-compatible handling: yes
3. backward-compatible readability: yes
4. completeness: not as far as I can tell.
5. applicability: no
6. simplicity: fair
--
Ashley Yakeley, Seattle WA