Text

I've been reading the archives of ietf-822 on the matter of text and 
soft-wrapping, and I see efforts to deal with the problem seem to be 
sacrificing completeness for the sake of readability and even 
accomodation of USENET quasi-markup conventions.

To understand completeness, here's an attempt to formalise the problem:

I define _codepoint-stream_ as a sequence of zero or more UCS-4 
codepoints excluding 'CR' and 'LF' codepoints. Codepoint-streams 
represent text, and line-separators and paragraph-separators are 
represented by the unambiguous codepoints 'LINE SEPARATOR' (0x2028) and 
'PARAGRAPH SEPARATOR' (0x2029). I'm using this representation solely to 
define a domain for the purposes of formalisation: I'm quite aware that 
these two codepoints are rarely used in practice.

I define _UCS-2 codepoint-stream_ as a codepoint-stream where every 
codepoint is in UCS-2.

I define _octet-stream_ as a sequence of zero or more octets.

I define _7bit octet-stream_ as an octet-stream comforming to the rules 
of '7bit' CTE. Likewise _8bit octet-stream_.

I define _protocol_ as a 'source' domain, a 'code' domain, an 'encode' 
map from source to code, and a 'decode' map from a subset of code to 
source, such that every member x of source, decode(encode(x)) = x.

I define _identity protocol_ as a protocol P where 
 1. P.source is a subset of P.code;
 2. For every x member of P.source, P.encode(x) = x.

A protocol P _represents_ A if A is a subset of P.source.

A protocol P _represents in/as_ A if P.code is a subset of A.

I define _content-transfer-encoding_ as a protocol for representing a 
subset of octet-streams as octet-streams.

A content-transfer-encoding is _complete_ if it represents octet-streams.

A content-transfer-encoding is _7bit-safe_ if it represents as 7bit 
octet-streams.

I define _charset_ as a protocol for representing a subset of 
codepoint-streams as octet-streams.

For instance, I claim that quoted-printable and base64 are each complete 
7bit-safe content-transfer-encodings, and that UTF-8 is a charset.

I define _subprotocol of P_ as a protocol S where:
 1. S.code = P.code;
 2. S.source is a subset of P.source;
 3. For every x member of S.source, S.encode(x) = P.encode(x).

I define _A over B_ for protocols A and B as the protocol C defined by:
 1. C.source = A.source
 2. C.code = B.code
 3. C.encode(x) = B.encode(A.encode(x))
 4. C.decode(x) = A.decode(B.decode(x))
only meaningful is A.code is a subset of B.source.

What protocols are commonly used by UAs? Bear in mind that as far as I 
know no UA actually internally converts to codepoint-streams I've defined 
them directly, but if we use codepoint-streams to represent text 
understood and correctly displayed, most UAs effectively implement 
subprotocols of the protocols of various charsets over various 
content-transfer-encodings.

Now the problem, as I understand it, is to come up with methods of 
representing text in RFC822 mail messages in a way that's as readable as 
possible. These informal 'methods' each involve formal protocols for 
representing some subset of codepoint-streams as 7bit octet-streams 
labelled with MIME-headers: specifically, I don't believe quasi-markup 
such as _italic_, *bold*, >blockquote etc. should get any special 
treatment. I think it's enough for MUAs to insert whatever represents 
hard line-separators between lines when performing >-quoting.

Also, it's sufficient for UAs to effectively implement reasonable 
subprotocols of specified protocols, since UAs are not always going to be 
able to correctly display every defined codepoint of UCS-4.

Desirable fuzzy attributes for such methods (with protocol P) include, in 
no particular order:

- 'Completeness': P.source is the full set of codepoint-streams;

- 'backward-compatible handling' (by existing UAs): for instance, while 
it's my opinion that UAs should display unrecognised text/* as text/plain 
in preference to saving it to some file, I don't believe all existing UAs 
actually do this;

- 'backward-compatible readability': specifically, for a typical text 
string x, and for any 'text/plain' protocol T compatible with US-ASCII, 
T.decode(P.encode(x)) is reasonably readable, given that very few UAs 
understand LINE SEPARATOR and PARAGRAPH SEPARATOR;

- 'applicability': certain existing commonly-used protocols are 
subprotocols of P, most obviously those used by plain-text 
proportional-font soft-wrapping text-editors (e.g. SimpleText);

- 'router-stability': mail-routers are not guaranteed not to filter their 
mail in various ways -- ideally, we want to be sure that for every x 
member of P.source, P.decode(filter(P.encode(x))) = x. At the very least, 
P.decode(filter(P.encode(x))) should reasonably resemble x.

- 'simplicity' of design.

Now not all MIME UAs are MUAs: MIME is used not only by mail and news, 
but also by HTTP and even file-systems (e.g. BeFS). And MIME types and 
options can be used in a number of different scenarios.

For general email with no explicit prior content-negotiation, I'd order 
the desirability of attributes this way:
1. router-stability
2. backward-compatible handling
3. backward-compatible readability
4. completeness
5. applicability
6. simplicity

But for use in typed file-systems and anything involving 
content-negotiation, I see a desire for a method with priorities more 
like this:
1. completeness
2. applicability
3. backward-compatible readability
4. simplicity
5. router-stability
6. backward-compatible handling

Now I've looked at both draft-newman-mime-textpara-00 and 
draft-gellens-format-00, which I assume are latest versions. 
draft-newman-mime-textpara defines a MIME type 'text/paragraph' which can 
be used over suitable CTEs just like text/plain. I consider it a way of 
varying charsets such that CRLF bytes are interpreted as 
paragraph-separators rather than line-separators. In my opinion it's 
ideal for use in typed file-systems, as well as any communication 
featuring content-negotiation:

1. completeness: yes
2. applicability: yes
3. backward-compatible readability: as per qu=6Fted-printable
4. simplicity: yes
5. router-stability: yes, with quoted-printable
6. backward-compatible handling: probably not

...and I think it was premature for Chris Newman to withdraw it.

As for draft-gellens-format, I believe its attempts to handle >-quoted 
text compromise it, mainly because of the various ways people quote text. 
I think if we go that route, we should specify a complete markup language 
of *bold*, _italic_, - bullets, etc., with strict semantics for what any 
given sequence of octets means.

But if >-quote handling were stripped out, draft-gellens-format might 
work for general email:

1. router-stability: not perfect
2. backward-compatible handling: yes
3. backward-compatible readability: yes
4. completeness: not as far as I can tell.
5. applicability: no
6. simplicity: fair


-- 
Ashley Yakeley, Seattle WA