ietf
[Top] [All Lists]

Re: More liberal draft formatting standards required

2009-06-30 04:48:45
On 29 jun 2009, at 23:32, Andrew Sullivan wrote:

On Mon, Jun 29, 2009 at 01:37:31PM -0700, David Morris wrote:

1000 years from now, it will certainly be easier to recover content from
an ascii 'file' than an html, xml, or pdf 'file' created now. It is
probably an unjustified assumption that 'software' available 1000 years
from now will be able to render today's html, xml, or pdf.

I am not sure I agree with this assertion.  In 1000 years, I have
every hope that some versions of PDF will be widely usable; but the
currently prescribed format of electronic versions of RFCs I think is
already obsolete, and will be unreadable in 1000 years.

My original message was about problems _creating_ a certain format. But this is of course related to _reading_ a format.

Don't underestimate how quickly formats go away. Anyone here try to open a Wordperfect document recently?

Assuming that in 1000 years people still understand English and can still read latin script, it's trivial to decode a plain ASCII file. HTML is only slightly more difficult: just remove everything between < and > and you have something that's mostly readable. With XML you should be able to recover most of the text, but I'm pretty sure in 1000 years nobody is going to understand what <rfc ipr="trust200902" category="exp"> means. Not exactly sure what the insides of a PDF file look like, but I'll go on a limb and say that it won't be possible to get anything useful out of a PDF file without software that understands PDF. I don't think that will be around in 1000 years. However, because PDF unambiguously maps to an image it should be possible to convert from PDF to other image formats without losing any content. (And then a decade or two later run OCR on that to retreive the original text...)

So I'd say that if we want to change our archival format a carefully documented subset of HTML would probably be a good choice. This is easy to display on a variety of screen sizes and prints reasonably without effort, can be made to print very well with additional tools. It has a lot more structure than flat text so scraping tools could potentially be more effective than today's, especially considering that old RFCs weren't formatted as rigorously as recent ones.

PDF would be a disaster because it's not compatible with text-only displays, not compatible with any scraping tools, can't be viewed without non-trivial software and doesn't scale to display size.

ASCII, on the other hand, doesn't meet any of the librarians'
criteria, and never did.  It is too restrictive even to deal with
non-American titles in the library catalogue (e.g. books priced in
pounds sterling), never mind to deal with non-English titles.

Last time I checked RFCs were free and in English...

Consider this: even if we could use non-latin scripts for author names in RFCs, would that be a good idea?

Back to my original problem: although there are tons of modern tools that create HTML, they usually create completely unstructured and very messy HTML that would be unusable for archiving or pretty much anything else. With a modern word processor you can basically create an unstructured and unformatted ASCII file without even line and page breaks, or create something highly structured that requires conversion tools to create something that looks like draft format.

We've really painted ourselves in a corner here.
_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www.ietf.org/mailman/listinfo/ietf