ietf-822
[Top] [All Lists]

Re: Charset mandatory in unix/linux

2006-03-12 12:14:23

(cc'ing the ietf-types list since this doesn't seem like an appropriate topic
for ietf-822)

The charset parameter is mandatory in the MIME content-type
attribute.

Actually, I don't know of a single case where this is true. All media type
pararameters are either type or subtype specific, so there is no general rule
that applies to all charset parameters. Nevertheless, the charset parameters
that attach to the text top-level type are optional, as is the charset
parameter on application/xml. And making the parameter optional doesn't even
imply that there's a default. For exmaple, In the case of XML the allowed
charsets for unlabelled material are intentionally limited so they can be
determined by inspection.

However, such a parameter is not mandatory in
Unix or Linux.

I could say the same thing about media types. File extensions or type codes are
commonly used to determine the media type. This is a huge problem that has led
to serious security glitches as well as poor user experiences.

This is causing more and more problems, when
people have a mixture of files with different charsets,
which you easily get when you download files from the
Internet or receive them via e-mail.

The reality is it is causing less and less problems as things gradually shift
towards Unicode-based charsets and away from the vast array of less capable
charsets. The security issues caused by non-use or misuse of media type labels
are a far bigger problem, and worse, one that doesn't appear to be going away.

Would it be possible to get the people responsible for the
file systems in Unix and Linux to add a mandatory charset
attribute to all text files?

Knowing the charset buys you very little without also knowing the media type.
You seem to be focused on plain text here and hence you're ignoring the larger
media type issue. Lots of media types have parameters and even when the media
type can be determined - it frequently cannot be done reliably - it is often
done in a way that doesn't allow additional parameters to be attached.

Additionally, usage is shifting away from plain text and towards other, more
sophisticated, formats - so much so that support for plain text has become a
bit problematic in some places. This is one of the reasons given for moving
away from plain text RFCs to some other format.

Best is probably to add a
generalized property list to files, so that also other
properties than charset can be added in the future.

The ability to attach metadata to files is indeed a very useful feature, one
that has been around for decades on some platforms at least. (I'm not going to
bother with the history here.) And it is already available on Linux - at a
minimum the ext2, ext3, and XFS file systems support it. (There are probably
others but I'm too lazy to go look them up.)

So in the sense of getting the filesystem to support this sort of tagging, your
problem is already solved in many cases. But this is the easy part. You now
have to get applications to agree on a specific use of metadata tags for
charsets or media types or whatever. Good luck on getting that to happen.

The advantage would be that programs which transport files
across the Internet, such as e-mail, ftp and http, would
more often use the correct charset and not munge the files
by giving then an incorrect charset. The commonly occuring
problem with incorrect charset would be reduced. Also local
problems such as text editors would benefit from knowing
the charset of a file.

First of all, email and http do not "transfer files" per se. They transfer data
objects and each protocol defines the metadata it considers approproate to
attach to data objects. The matchup of this metadata to file system metadata is
imperfect at best, and there are plenty of issues in this space. (It is far
from clear that adding additional type information will solve these problems,
however.)

Ftp does transfer files, but ftp is kinda old and doesn't understand file
metadata too well if at all.

This situation means that in situations where retention of file metadata is
important some sort of additional container has to be used. A vast number of
such container formats have been defined - tar files, zip files,
AppleSingle/AppleDouble, etc.

(Mac OS earlier had a very good feature, you could add to
every file a property list called the "resource fork").
This still works in Mac OS X, but is less and less often
used, since Unix, on which Mac OS X is based, does not have
this facility. In Mac OS X the resource fork is stored in a
separate file whose file name starts with ".", in the same
directory as the file described).

Sigh. There are so many things wrong here it is difficult to know where to
begin. I guess I'll start with resource forks.

The HFS+ and HFSX file systems used by Mac OS X support resource forks
natively. They are accessible through the Posix APIs not by prepending a dot to
the file name (given how often adjunct files with this naming convention are
created by UNIX facilites such a convention would cause all sorts of havoc) but
rather by appending "/rsrc" - it's a sort of file-in-a-pseudo-subdirectory. And
while it used to be true that most utilities on Mac OS X were unaware of
resource forks, this has changed in recent versions - most of the UNIX-derived
OS utilieis now handle resource forks correctly.

However, resource forks are used to store file data, not metadata, so none of
this is especially relevant to the metadata issue. Now, Apple apparently had
some notion of extending the fork model to allow arbitrary named forks. Such a
facility could in theory be used to store metadata, although using such a
heavyweight mechanism to store metadata never seemed like a very good idea to
me. (A fork is in effect a separate file as far as the on-disk structures are
concerned, so even if you hide this from applications the overhead is still
there.) Regardless, Apple has backed away from actually completing the
implementation of this facility. My understanding is if you try to create forks
with other names with the provided APIs you get an error.

What Apple has done instead is implement full support for file metadata
separate from forks. This IMO is the right approach. Apparently there's a
separate B-tree for this somewhere in HFS+ - you don't even need to be
running HFSX for it to work. It took Apple a while to get this done, but API
routines to access this stuff now exist, and even better, they are apparently
the same as the ones used on Linux.

This support also works on other filesystems that lack built in metadata
storage - the metadata is instead stored in a separate file with "._" prepended
to the name. (I know there are ext2 and ext3 drivers for Mac OS X but I don't
know if or how they handle the metadata issue - hopefully they store it in a
way that's compatible with usage on Linux.)

There's lots more to know about metadata on Mac OS X. This article
describes the landscape pretty well:

   http://arstechnica.com/reviews/os/macosx-10.4.ars/6

So what's the bottom line? The bottom line is that you appear to be focusing on
the wrong problem along several dimensions. First, charset information
specifically isn't as interesting or essential as you claim, and the degree to
which is it interesting is dropping for a variety of reasons. Second, you
appear to have missed the larger and much more important problem of not having
correct parameterized type information available. (And we haven't even
discussed the many other sorts of metadata, like say language information, that
is also useful to have.) Third, your focus on getting metadata support into
filesystems is mostly misplaced - this is a solved problem in a lot of cases.
And fifth, you don't seem to appreciate the difficulty of getting everyone to
agree to actually use filesystem metadata to solve any of these problems. This
last is a complete showstopper and I dispair of there ever being significant
progress in this area because of it.

                                Ned