ietf-822
[Top] [All Lists]

Re: RFC 2047 and gatewaying

2003-01-06 10:13:46

In <20030104033518(_dot_)GA16177(_at_)ussenterprise(_dot_)ufp(_dot_)org> Leo 
Bicknell <bicknell(_at_)ufp(_dot_)org> writes:

While I know little of Usenet's problems, or of NSI's motivations,
I can speak as a sysadmin who has found simple harmony to the fact
that all the internet protocols use US-ASCII, or ISO-8859-1.  I
also see a future developing where if I want to code up a perl
script to do something I'll have to understand UTF-8 to read a
Netnews article, some weirdo NSI encoding to DNS resolve names from
that article, and then know how to base64 encode things for e-mail.
And that's just for three services I have been following.

The way the world seems to be moving is that UTF-16 will be used internally
by OSes, and UTF-8 will be used for external communications, and maybe for
file storage. OSes will be able to convert between the two (often with the
user being unaware, as when dragging or pasting).

UTF-16 as an internal code is not perfect (it will be a pain for the
Private Use planes, musical notes, etc. but is fine for almost everything
else). UTF-8 is fine for internet use 'on the wire' because it does not
misuse CR, LF and NUL, and it does not have any -endian problems.

The Bad News is that this combination will not satisfy the Chinese, though
it will satisfy almost everybody else. And some OSes will try to use UTF-8
internally (including some or all UNICES).

This is already bad, but knowing that OS vendors are dealing with
the UTF-7/UTF-8/UTF-16 problem, and other protocol groups are
dealing with the same issues (indeed, IMAP uses UTF-7) seems to
show a very fragmented future.

I think it is agreed that UTF-7 was an experiment that failed. Its usage
will die out (and that includes in IMAP where it was not obligatory,
anyway).

What's my point?  Today I can "grep" a newsgroup article (ok,
depends on server software and format), pass that to "dig" (ok,
maybe with some sed and other work) and find out DNS information,
and then use "sendmail" (ok, with a wrapper to generate a real
e-mail) to mail someone.  I hope we all agree this is a good thing.
I see a future developing where a custom filter will be needed
between each of those steps to preserve "international" characters.
I hope we can all see why that is bad.

Today you can grep, using UTF-8, in a file that is written in ISO8859-1,
or UTF-16 or whatever. What you can't do is grep in a file that contains
stuff encoded in RFC 2047. So if you want to grep in your news spool for
articles From: "Claus Färber" you might find it if the headers were in
UTF-8 (as Usefor prefers). But it would be hopeless even if you grepped
for the corresponding RFC 2047 string, because there are many ways that
"Claus Färber" can be encoded according to RFC 2047.

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk      Snail: 5 
Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5

<Prev in Thread] Current Thread [Next in Thread>