ietf-822
[Top] [All Lists]

Re: What is a charset?

1993-03-08 05:27:15
    The term "character set" is used in this document to refer to a method
    used with one or more tables to convert encoded text to a series of
    octets.  This definition is intended to allow various kinds of text
    encodings, from simple single-table mappings such as ASCII, to complex
    table switching methods such as those that use ISO 2022's techniques.

We should get it right.

The first thing to note, is that we have invented our own term "charset"

Actually, we don't use "charset" as a "term" in the document.  It's
just the name of a parameter.


If we mean something different or a special subset of the
family of ISO "coded character sets" we should say so in the document.

Yup.  That's why I mention 2022 as an example of a "complex" method.


What [mime2] got right (IMHO) is the direction of the mapping - from the
bits to the characters.

Actually, you need to go in both directions.  Maybe we should say
"convert between A and B" rather than "convert A to B".


the EvdP definition does not relate the concept to ISO terms.

I don't think we should use the ISO definitions since the MIME
documents use the term "character" in a pretty informal, though
acceptable, way.  (Sometimes MIME says "character" when it means
"octet", but the meaning is clear from context.)

Instead of trying to modify the document to make formal use of the
terms "character" and "octet", it would be easier to leave it as it is
(i.e. a relatively informal IETF document, as opposed to a less
readable, formal ISO document).


I will question the use of "octet".
There is 7-bit communication lines in existence today that are
perfectly capable of doing MIME mail.

In MIME's case we must use the term "octet" since the CTEs Base64 and
Q-P operate on octets and nothing else.


The term "encoded text" is undefined, and to me it seems circular.

I agree that this could be confusing.  I was trying to get the "order
of encoding" stuff in there somehow (i.e. text -> subtype -> charset
-> CTE).  But since this stuff is going to be explained elsewhere in
the document, we can probably just say "text" instead of "encoded
text".  Or perhaps we should say "sequence of characters"?


    A charset
    specification must include all information for a bit stream to 
    be interpreted as the correct characters.

Actually, we need a way to specify the version of a charset so that
extension becomes possible without regressing to "splatting the raw
bytes on the screen" when we extend the charset.  We can specify that
the version info be included inside the charset parameter, or we can
have a separate parameter or whatever.

  e.g.  Content-Type: text/plain; charset=iso-2022-jp/2

  or    Content-Type: text/plain; charset=iso-2022-jp cs-vers=2

The 2nd example would allow "old" iso-2022-jp implementations to
display some of the characters correctly.  I'm pretty sure that we
need to be able to indicate the charset version, but I'm not sure
where we should put that info.


Erik


<Prev in Thread] Current Thread [Next in Thread>