Re: Requesting a revision of RFC3023


Bjoern Hoehrmann scripsit:

Impractical. File systems commonly do not support encoding such
information


In fact most file systems support extended attributes today.

[A]nd even if they did, this would cause interoperability
problems with file systems and protocols which do not provide such
means. If you transfer the document using FTP to your web server the
information is lost and the document will break.


No worse than today's situation, and FTP could be enhanced or abandoned
in favor of HTTP PUT.

Further, file system
information is typically almost invisible to authors and would thus
have the same problem as the charset parameter. If I edit a document
in an XML unaware text editor, change the encoding declaration and
some text nodes and save the file, file system and encoding declaration
are likely to contradict each other and the document would break.


No worse than today's situation.

You are basically suggesting to change all file systems and software
that interacts with it and expect everyone to upgrade the software and
the file system information of all documents.


*You* are suggesting that every text file format that has ever existed --
innumerable assembly languages, C, C++, Java, Fortran, Lisp, Scheme, Prolog,
Perl, Python, Smalltalk, awk, sed, ... sh, csh, bash, zsh, ... mail archives,
news archives, ... Tex, LaTex, nroff/troff, ... -- be revised to find someplace
to stuff a charset indication, and then that every one of the billions of
documents in each of those formats be changed to carry that information.

If an applicable solution
may go this far, you should rather suggest to outlaw all non-Unicode
encodings, much simpler, more consistent and more interoperable. This
would also work if the text is not stored in the file system but rather
generated by software, something your solution does not consider.


Indeed, which is why Plan 9 sensibly makes everything UTF-8 and Windows NT/2K/XP
makes most things UTF-16, at least under the covers.

Otherwise, generic text-processing tools become impossible,


They are impossible today.


The impossible does not happen, but I usefully use generic text processing
tools every hour of every working day.

They are not trying to read the format, they are trying to read byte
streams as character streams. If they are trying to read the format,
they have to support that format anyway, including mechanisms to
determine the character encoding.


Not so.  If I want to process a Fortran 77 program as text (to find the
identifiers which occur only once, e.g.) then I can use generic tools
(tr, sort, uniq) and supply the character encoding out of band.  This is
annoying, but it works.  If the tools had to understand where backpatched
Fortran 77 text hides its in-band character encoding declaration, the
results would be as I describe: huge amounts of useless hair.

If you consider HTTP a file system, it
already implements your solution; all text is identified using text/*
types and either the file system provides encoding information (charset
parameter) or text processors are required to treat the document as
ISO-8859-1 encoded. Text processors would actually only get character
streams from the HTTP implementation and would not have to worry about
character encodings and stuff. Does it work? No.


It does not work because HTTP is layered over file systems which don't bother
to support the notion of encoding declarations persistently.

-- 
"We are lost, lost.  No name, no business, no Precious, nothing.  Only empty.
Only hungry: yes, we are hungry.  A few little fishes, nassty bony little
fishes, for a poor creature, and they say death.  So wise they are; so just,
so very just."  --Gollum        jcowan(_at_)reutershealth(_dot_)com  
www.ccil.org/~cowan