ietf-822
[Top] [All Lists]

Re: application/postscript newlines

1994-05-17 15:48:19
Application/postscript is, by definition, not text.  On the other hand, it
clearly IS pretty much text encoded on most systems.

On most systems perhaps, but not on all. PostScript contains facilities that
may require pure binary material in PostScript files. These facilities are not
often used because (you guessed it) they sometimes get messed up by translation
from one system to the next. This problem is not restricted to email -- all too
often binary PostScript is corrupted before its even passed to the user agent.

This lead to a minor question about newlines.  Different platforms use
different conventions for
newline,

There are two cases, one where the PostScript consists of text-like material
and the other where the PostScript contains pure binary. In the former case the
PostScript language defines CR, LF, and CR LF as being equivalent newline
sequences (Red/White book, page 27). With this level of flexibility local
newline conventions are generally not a problem, since almost all systems use
CR, LF, CR LF, or some local convention that can easily be converted into one
of these.

so how is the lowly app/post-generating UA to know what to do?
Must they be a full-fledged ps parser?

Assuming you even bother to support binary PostScript (most systems and
Document Managers don't), you aren't supposed to need a full-fledged parser to
do it. In point of fact, a full-fledged parser wouldn't help, since determining
whether or not binary material is present can easily be shown to be equivalent
to the halting problem.

The Document Structuring Conventions (DSC) take care of this for you.
Specifically, there's a specific header comment that's intended to declare what
sort of material is contained in a given document:

  %%DocumentData: Clean7Bit | Clean8Bit | Binary

(Red/White book, page 642.) Documents containing binary are at a minimum
supposed to be labelled as such, since if they aren't PostScript Document
Managers won't know how to process them properly. (Aside: If you're thinking
that MIME's content-transfer-encoding model is similar, well, guess where we
got the idea?) In particular, if a document is labelled as one of the clean
types of data, all CRs and LFs can be treated as generic PostScript newlines.

While it is acceptable in most cases to treat a document containing binary as
entirely binary, mixed mode processing may be necessary in some cases. In other
words, there needs to be a way to tell what parts of the document are binary
data and what parts aren't. This is done in the obvious way with the
%%BeginData and %%EndData header comments.

Proper use of these DSC makes it possible to Do The Right Thing without having
to know anything about the PostScript language itself. Parsing header comments
is roughly comparable to parsing RFC822 headers. In other words, no rocket
science is involved.

Do we require conversion to a canonical newline?

No. The safest course is to always encode in base64 or in quoted-printable
without ever using quoted-printable's canonical newlines. This isn't especially
friendly when sending to non-MIME systems, however, so many sites opt to treat
PostScript as text in all cases. Still others may opt to base their decision on
DSC information, or make it a user option, or whatever.

I believe this directly affects interoperability of app/ps

Actually, it really isn't a big deal at all. There are many problems with
PostScript handling that occur far more often and cause a lot more trouble than
this does. (I'm speaking as someone who supports a popular product that
processes lots of emailed PostScript.)

Here's a good example of a serious gotcha. Large PostScript documents exist
that contain few if any line breaks at all anywhere. You end up with a single
line of text that may be many megabytes long. This is perfectly legal
PostScript, but it causes problems with all sorts of spooling software.
(Paradoxically, VMS, with its 65K max record limits, usually handles this a lot
better than other systems with no inherent record length limitations. Go
figure!) 

The obvious way to handle this sort of stuff is to turn some of the spaces in
the material into newlines. (We provide such a problem as part of our software,
as do many other vendors.) This is possible because in addition to defining CR,
LF, and CR LF all to be equivalent newline sequences, PostScript also has the
notion of linear whitespace. Newlines are generally equivalent to linear
whitespace, but (and this is the kicker) not always. Specifically, there are
certain operators in PostScript that are sensitive to the use of newlines as
opposed to other sorts of linear whitespace. This means that a simple lexer
cannot determine what whitespace is safe and what isn't. And unfortunately, the
DSC doesn't provide quite the level of detail necessary to figure this stuff
out.

The result is that programs which implement these sorts of transformations end
up scanning the PostScript and using complex heuristic rules. And this usually
works. But not always. Again, this is the halting problem is diguise, so there
simply isn't any completely general solution to this real-world problem.

                                Ned