ietf-822
[Top] [All Lists]

TEXT version of Draft RFC

1991-04-22 10:20:50
Network Working Group -- Request for Comments: XXXX

                        A Multipart Content-Type 
                     and Content-Encoding Mechanism 
                          for RFC 822 Messages

                     Nathaniel Borenstein, Bellcore
                           Ned Freed, Innosoft

                               April 1991
                                                                   
Status of This Memo

This RFC suggests extensions to the RFC 822 message representation
protocol to allow multi-part textual and non-textual messages to be
represented and exchanged without loss of information. Discussion and
suggestions for improvements are welcome.  This memo does not specify an
Internet standard.  Distribution of this memo is unlimited.

If this RFC becomes a standard, it would affect the following other RFC's:

Would Obsolete:  RFC 934, RFC 1049, RFC 1154
Would Update:    RFC 822
Would Affect:    RFC 1148

Table of Contents

    Introduction
    The Content-Type Header Field
    The Content-Encoding Header Field
    Quoted-Printable Content-Encoding
    Quoted-Printable Content-Encoding
    Base64 Content-Encoding
    The "Multipart" Content-Type
    A Complex Multipart Example
    The Encoded-Variable Header Field
    Cross-References Between Encapsulated Parts 
    Optional Content-Size Header Field
    Summary
    Acknowledgements
    References
    Appendix A:  The Character Set for the MAILASCII Content-Type


1       Introduction

One of the limitations of RFC 821/822 based mail systems is the fact
that they limit the contents of electronic mail messages to relatively
short lines of seven-bit ASCII.  This forces a user to convert any
non-textual data that she may wish to send into a seven-bit ASCII
representation before invoking her local mail UA (User Agent program). 
Examples of encodings currently used in the Internet include pure
hexadecimal, uuencode, the 3-in-4 base 64 scheme specified in RFC 1113,
the Andrew Toolkit Representation [REF-ATK], and many others.

This limitation becomes even more apparent as gateways are designed to
allow for the exchange of mail messages between RFC 822 hosts and X.400
hosts.  X.400 [REF-X400] specifies mechanisms for the inclusion of
non-textual body parts within electronic mail messages.  The current
standards for the mapping of X.400 messages to RFC 822 messages specify
that either X.400 non-textual body parts should be converted to (not
encoded in) an ASCII format, or that they should be discarded, notifying
the RFC 822 user that discarding has occurred.  This is clearly
undesirable, as information that a user may wish to receive is lost. 
Even though a user's UA may not have the capability of dealing with the
non-textual body part, the user might have some mechanism external to
the UA that can extract useful information from the body part. 
Moreover, it does not allow for the fact that the message may eventually
be gatewayed back into an X.400 MHS, where the non-textual information
would definitely become useful again.

In devising an encapsulation scheme, two things must be considered: how
to convert the non-textual data to a representation which may be
transmitted over a seven-bit SMTP connection without loss of data, and
how to preserve information about the structure of the data itself. 
This "structural" information must include, at a minimum, the type of
data involved. This type information may be something recognized by many
systems or it may be some type of data specific to a single operating
system. 

This memo describes several mechanisms that combine to solve these
problems.  In particular, it describes an encapsulation mechanism that
may be used to describe multiple part ("multipart") messages. The parts
themselves may contain textual or nontextual data; non-textual data is
encoded in a form that can survive mailers unaware of this
specification.  This memo also defines two RFC 822 header fields to be
used to indicate the inclusion of non-textual information in a mail
message: Content-Type and Content-Encoding.   Additionally, this memo
proposes an Encoded-Variable header field for including non-textual or
international text information in certain parts of the message header
area.  Finally, this memo defines an optional header field,
Content-Size, which may be used within multipart messages.

2      The Content-Type Header Field

The Content-Type header field was previously defined in RFC 1049, and is
reaffirmed here.  The remainder of this section is derived from RFC
1049, and, where different, is intended to supersede it.

The Content-type:  header field consists of up to four parameter values.
 The first, or type parameter names the type, format, or structuring
technique; the second, optional, parameter is a version number, ver-num,
which indicates a particular version or revision of the standardized
format.  The third parameter is a resource reference, resource-ref,
which may indicate a standard database of information to be used in
interpreting the information.  The last parameter is a comment.

In the Extended BNF notation of RFC-822, we have:

Content-Type:= type [";" ver-num [";" 1#resource-ref]] 
                [comment]

ver-num:=      local-part

resource-ref:=  local-part

type   := "POSTSCRIPT" /
          "SCRIBE" /
          "SGML" /
          "TeX" /
          "TROFF" /
          "DVI" /
          "ODA" /
          "MULTIPART" /
          "MAILASCII" /
          iso-charset-type /
          "U-LAW" /
          "A-LAW" /
          "PBM" /
          "PGM" /
          "PPM" /
          "DES-MESSAGE" /
          x400-type /
          x400-1984-type /
          x400-1988-type /
          "X-"atom

iso-charset-type := "ISO-IR-" 1*DIGIT

x400-type := "IA5-Text" /            ; [0] IA5Text, IA5TextBodyPart 
          "Voice" /               ; [2] Voice, VoiceBodyPart 
          "G3-Fax" /              ; [3] G3Fax, G3FacsimileBodyPart 
          "Teletex" /             ; [5] TTX, TeletexBodyPart
          "Videotex" /            ; [6] Videotex, VideotexBodyPart
          "Nationally-Defined" /  ; [7] NationallyDefined,
                                  ;     NationallyDefinedBodyPart 
          "Encrypted" /           ; [8] Encrypted, EncryptedBodypart
          "Message"               ; [9] ForwardedIPMessageMessage,
                                  ;     MessageBodyPart

x400-1984-type := "Telex" /       ; [1] TLX
                  "TIF0" /        ; [4] TIF0
                  "SFD" /         ; [10] SFD
                  "TIF1"          ; [11] TIF1

x400-1988-type := "G4-Class1" /           ; [4] G4Class1BodyPart 
                  "Mixed-Mode" /          ; [11] MixedMode
                  "Bilaterally-Defined" / ; [14] BilaterallyDefined
                  "Externally-Defined"    ; [15] ExternallyDefined


These values are not case sensitive.  POSTSCRIPT, Postscript, and
POStscriPT are all equivalent.  Additional "standard" Content-type
values may be registered with Internet Assigned Numbers Coordinator at
USC-ISI.  Those wishing to register such values should contact:

                            Joyce K. Reynolds
                   USC Information Sciences Institute
                           4676 Admiralty Way
                     Marina del Rey, CA  90292-6695

                   213-822-1511    JKReynolds(_at_)ISI(_dot_)EDU

The specific predefined "type" fields are explained below:

"X-"atom -- Any type value beginning with the characters "X-" is a
private value, to be used by consenting mail systems by mutual
agreement.  Any format without a rigorous and public definition should
be named with an "X-" prefix.

POSTSCRIPT -- Indicates the enclosed document consists of information
encoded using the Postscript Page Definition Language developed by Adobe
Systems, Inc. [REF-PS].  For type "postscript" the valid ver-num fields
are "1.0", "2.0", and "null", and the valid resource-ref fields include,
but are not limited to, "laserprep2.9", "laserprep3.0", "laserprep3.1",
and "laserprep4.0".

SCRIBE -- Indicates the document contains embedded formatting
information according to the syntax used by the Scribe document
formatting language distributed by the Unilogic Corporation.
[REF-SCRIBE].  For type "scribe" the valid ver-num fields are "null",
"3", "4", "5", etc.

SGML -- Indicates the document contains structuring information to
according the rules specified for the Standard Generalized Markup
Language, IS 8879, as published by the International Organization for
Standardization. [REF-SGML] Documents structured according to the ISO
DIS 8613--Office Docment Architecture and Interchange Format--may also
be encoded using SGML syntax.  For type "sgml" the valid ver-num fields
are "IS.8879.1986" and "null"

TeX -- Indicates the document contains embedded formatting information
according to the syntax of the TeX document production language.
[REF-TEX]

TROFF -- Indicates the document contains embedded formatting information
according to the syntax specified for the TROFF formatting package
developed by AT&T Bell Laboratories. [REF-TROFF].  For type "troff" the
valid resource-ref fields include, but are not limited to, "eqn", "tbl",
"me", and the names of other troff macro packages.

ODA -- Indicates that the body is an ODA document, containing formatted
information encoded according to the Office Document Architecture
[REF-ODA].   If needed, a document application profile is to be included
as part of the message body.

DVI -- Indicates the document contains information according to the
device independent file format produced by TROFF or TeX.

MULTIPART -- Indicates the document contains multiple encapsulated
messages, each of which may be of a different content-type.  The precise
syntax of a "multipart" message is defined later in this RFC, as are the
possible values for its ver-num and resource-ref fields.

U-LAW or A-LAW --   Indicates that the document contains audio data in
U-law [REF-ULAW] or A-law [REF-ALAW], respectively.  U-law and A-law are
the American and European audio telephony standards.  If one of these
content-types is used, the ver-num field can be used to give a sampling
rate in Hertz, optionally followed by the letter "HZ".  Although audio
header formats are not yet standardized, the resource-ref field can be
used to specify an audio header format.  Thus an appropriate
content-type header for audio might be something like "Content-type: 
u-law; 8000 HZ; X-Next"

PBM or PGM or PPM -- Indicates the document contains image data encoded
in the Portable Bitmap format [REF-PBM] for black and white, grey scale,
or color images.

DES-MESSAGE -- Indicates that the body is an encapsulated message
encrypted with DES encryption [REF-DES].  An encrytped message is
specified, rather than simply encrypted text, because this permits the
encrypted object to contain a Content-type header and thus to contain
encrypted data of any type.  If all that is desired is encrypted text,
the header are of the encapsulated message can be blank (i.e. once
decrypted, it begins with CRLF.)

ISO-CHARSET-TYPE -- Indicates the document contains text in an ISO
standard character set by ints International Registration number.  Each
ISO character set defines a new standard mail content type, given by the
string "ISO-IR-" followed by the numeric value of the character set. 
Thus, for example, a content-type of "ISO-IR-6" specifies a character
set that is extremely similar, and perhaps identical, to MAILASCII. 
However, it should be noted that even when the Content-type is an
ISO-IR- character set type, certain control characters will always be
construed according to the guidelines of RFC 821 and RFC 822.  In
particular, character positions 13, 10, and 32 will always be
interpreted at times as CR, LF, and SPACE, respectively.

X400-TYPE -- Indicates the document contains an ASN.1 representation of
an X.400 bodypart. The type field may be either "1984", indicating that
the represenation is defined in [REF-CCITT84c], or "1988", indicating
that the encoding is defined in [REF-CCITT/ISO88b].

X400-1984-TYPE -- Indicates that the document contains an ASN.1
representation of an X.400 bodypart specific to the 1984 version of the
standard [REF-CCITT84c]. The type field must be "1984" if specified.

X400-1988-TYPE -- Indicates that the document contains an ASN.1
representation of an X.400 bodypart specific to the 1988 version of the
standard [REF-CCITT/ISO88b]. The type field must be "1988" if specified.

MAILASCII -- Indicates the document contains only unencoded 7 bit US
ASCII text, the default content-type for RFC 822 mail.  This
content-type has been the subject of some confusion and ambiguity in the
past.  Its definition is spelled out in Appendix A.

If no Content-type header field is present, "MAILASCII" is assumed.  
That is, the name "MAILASCII" is intended to refer to the default
message body type as defined by RFC 822. 

It should be noted that the list of Content-type values given above is
expected to be augmented in time, and that such additions will be
registered at the address given above.  We have simply attempted, in
this RFC, to give as many standard Content-type definitions as was
possible given the current state of our knowledge.  The Content-type
values defined above are a superset of the values defined by RFC 1049.

Thos wishing to transmit FAX by Internet mail should note that G3-FAX is
one of the Content-types defined for X.400 support.  It is thus
appropriate to use "Content-type: G3-FAX" for such data.

3       The Content-Encoding Header Field

Many content-types are represented, in their natural format, as 8-bit or
binary data.  Such data can not be transmitted over existing Internet
mail mechanisms because both RFC 821 and RFC 822 restrict mail messages
to 7 bit data with reasonably short lines.  It is necessary, therefore,
to define a standard mechanism for encoding such data in an acceptable
manner.

This RFC specifies that this encoding will be done by a new
"Content-Encoding" header field.  The Content-Encoding field is used to
indicate the type of transformation that has been used to represent the
message body in an acceptable manner.  Unlike Content-types, which are
expected to proliferate, it is expected that there will never be more
than a few different Content-Encoding values, both because there is less
need for variation and because the effect of variation in
Content-Encoding would be more problematic.  

However, establishing only a single Content-Encoding mechanism does not
seem possible.  In particular, there is a tradeoff between the desire
for a compact and efficient encoding of binary data and the desire for a
readable encoding of data that is mostly, but not entirely, MAILASCII
text.  For this reason, at least two encoding mechanisms are necessary,
a "readable" encoding and a "dense" encoding.  This RFC also specifies a
third encoding which is neither readable nor dense, but is the most
simple to encode and unencode.  A fourth encoding, for compressed
("super-dense") data, might reasonably be defined at a later date.

The Content-Encoding field is designed to specify a two-way mapping
between the "native" representation of a type of data and a
representation that can be readily exchanged using 7 bit mail transport
protocols as defined by RFC 821 (SMTP). This field has not been defined
by any previous RFC. The field's value is a single atom specifying the
type of encoding, as enumerated below.  Formally:

Content-Encoding:=      "BASE64"/
                        "HEXADECIMAL"/
                        "QUOTED-PRINTABLE"/
                        "8BIT"/"BINARY"/
                        "7BIT"/"X-"atom

These values are not case sensitive.  That is, Hexadecimal and
HEXADECIMAL and heXadeCimAl are all equivalent.  An encoding type of
7BIT implies that the message is already in a seven-bit ASCII
representation. This value is assumed if the Content-Encoding header
field is not present.  If the message is stored or transported via a
mechanism that permits 8-bit data, a Content-Encoding of "8bit" should
nonetheless be used.  If the message is stored or transported via a
mechanism that permits arbitary binary data, a Content-Encoding of
"binary" should nonetheless be used.  (DISCUSSION:  The distinction
between the Content-Encoding values of "binary," "8bit," and "7bit" may
seem unimportant in an 8-bit binary environment, but clear labeling will
be of enormous value to gateways between 8-bit and 7-bit systems.  The
difference between "8bit" and "binary" is that "8bit" implies adherence
to SMTP limits on line length and CR/LF semantics, whereas "binary" does
not.)

Implementors may define new content encoding values, but should prefix
them with "x-" to indicate their non-standard status, e.g.
"Content-Encoding:  x-my-new-encoding".   However, unlike Content-types,
the creation of new Content-Encoding values is explicitly discouraged,
as it seems likely to hinder inter-operability with little potential
benefit.

If a Content-Encoding header field appears as part of a message header,
it applies to the entire message body, whether or not that body is of
type "multipart."  If it is of type multipart, the encoding applies
recursively to all of the encapsulated parts, including their
encapsulated headers.  If a Content-Encoding header field appears as
part of an encapsulation's headers, it applies only to the body of the
encapsulated part.  If the encapsulated part is itself of type
"multipart", the encoding applies recursively to all of the encapsulated
parts within that encapsulated part.

The following sections will define the standard encoding mechanisms.

3.1     Quoted-Printable Content-Encoding

The Quoted-Printable encoding is intended to represent data that is
largely, but not entirely, 7 bit ASCII.  Printable ASCII portions of
body parts encoded in this way should be recognizable by humans, if
necessary, without translation.

In this encoding, ASCII characters 9 (tab), 10 (nl), 13 (cr), 32 through
37, inclusive, 39 through 91, and 93 through 127, inclusive, are
unchanged.  All other characters, including characters 38 and 92, are to
be represented in either of the following quotation styles and special
cases:

    Style #1:  Any 8 bit value may be represented a "\" followed by a
    two digit hexadecimal representation of the character's ASCII value.
     Thus, for example, character 12 (control-L, or formfeed) can be
    represented by "\0C", the ampersand character (38) can be
    represented by "\26", and the backslash character (92) itself can be
    represented by "\5C".

    Style #2:  An 8 bit value from 160 through 255 may, alternately, be
    represented by an ampersand character followed by the character
    obtained by the removal of the high order bit, i.e. by subtracting
    128 from the value.  Thus  the 8 bit value 193 may be represented as
    "&A".  

    Note that these two styles may be freely intermixed.  Style #1 is
    preferred for characters 128 through 159, because style #2 might
    include control characters (e.g. TAB) that are altered by some MTA
    (see NOTES TO IMPLEMENTERS, below).  Style #2 is provided for
    improved readability of some 8-bit character sets in which turning
    on the 8th bit produces a character similar to the corresponding 7
    bit character, e.g. the 8th bit simply adds an umlaut.  In such
    cases, style #2 is somewhat more readable, but should be used
    carefully, as explained in the NOTES TO IMPLEMENTERS.

    Additionally, there are two special cases that may be represented
    otherwise:

    Special case #1:  The literal ampersand and backslash characters may
    themselves be quoted by backslashes.  Thus, the backslash may be
    represented as "\\" and the ampersand as "\&".  Note that this is
    not ambiguous with regard to the first clause, because neither "\"
    nor "&" are part of the hexadecimal alphabet.

    Special case #2:  A backslash at the end of a line may be used to
    indicate a non-significant line break.  That is, if one needs to
    include a long line without line breaks, but is concerned that MTA's
    will break the line into multiple lines, a message encoded with the
    quoted-printable encoding may include "soft" line breaks by
    preceding the line break with a backslash.  Thus if the "raw" form
    of the line is a single line that says:

    Now's the time for all men to come to the aid of their country. 
    Now's the time for all men to come to the aid of their country. 
    Now's the time for all men to come to the aid of their country.

    This could be represented, in the quoted-printable encoding, as

    Now's the time for all men to come to the aid of their country.  \
    Now's the time for all men to come to the aid of their country.  \
    Now's the time for all men to come to the aid of their country.  

    This provides a mechanism with which long lines can be encoded in
    such a way as to be restored by the user agent.  

NOTES TO IMPLEMENTERS of encoding agents:  for maximum portability
across MTA's, it is recommended that any long lines be represented using
"soft" line breaks which are inserted before any line reaches the 80th
character.  It is also recommended that trailing white space (white
space at the end of a line) not be relied upon, as some MTA's freely
delete such trailing white space.  (Such a line may be represented, if
necessary, using the above rules, by appending a backslash to the end of
the line, and following it with a blank line.)  It is also recommended
that the persistence of character codes less than 32 should not be
relied on, particularly the TAB, CR, and LF characters.  Where such
characters would be required for representation in style #2, it is
recommended that style #1 be used.  

NOTE ABOUT CR AND LF in encoded messages:  The use of CR or LF
characters that are not part of a CR/LF sequence is NOT PERMITTED in
messages that use the Quoted-Printable encoding.  (Their presence is not
an issue for the other encodings.)  Sequences such as CR LF LF are also
invalid; the correct sequence is CR LF CR LF.  The effect in an encoded
message of a CR without a following LF, or an LF without a preceding CR,
is undefined.  Although RFC-822 defines these as ordinary characters
when used outside of the CR/LF sequence, some implementations treat one
(or both) as equivalent to newline or as error characters that are
discarded.  Messages which contain embedded bare CR or LF characters
should use encoding style #1 to encode these characters "safely". 
(Discussion: Some environments use a bare CR or bare LF as the local
newline convention.  If a message contains embedded bare CR or LF
characters, it is impossible to transform it from Internet to local
conventions without interfering with this local convention.)

Since the hyphen character ("-") is represented as itself in the
Quoted-Printable encoding, care must be taken, when encapsulating a
quoted-printable encoded message in a multipart message, to ensure that
the encapsulation boundary does not appear anywhere in the message.  See
the definition of multipart messages, later in this document.

3.2     Hexadecimal Content-Encoding

The Hexadecimal Content-Encoding is intended to represent arbitrary data
that is not humanly-readable in a printable 7-bit form that can be
passed through 7 bit mail transport agents.  It transforms a byte stream
into a series of two-digit hexadecimal values.  Thus, the sequence of
the five 8-bit values "ABC control-L newline" would be represented by
"4142430C0A".  Since newlines are themselves encoded as 0A, non-data
newlines may be scattered freely to break the stream into multiple
lines.  In fact, it is recommended that newlines be included at least
every 60 characters (30 encoded characters).  Such newlines will be
discarded by the decoder.

The hexadecimal encoding is a simple way to represent arbitrary 8 bit
data in 7 bit mail, but not a very efficient one, as it doubles the size
of the data.  The Base64 encoding, to be described below, is a
reasonably simple alternative that only increases the size of the data
by 33 percent.  The hexadecimal encoding is permitted explicitly because
there are widespread utilities for converting binary files to
hexadecimal.

Since the hyphen character ("-") is not used in hexadecimal encodings,
there is no need to worry about quoting apparent encapsulation
boundaries within hexadecimal-encoded body parts.

When encoding a bit stream via the hexadecimal encoding, the bit stream
should be presumed to be ordered with the most-significant-bit first. 
That is, the first bit in the stream will be the high-order bit in the
first byte, and the eighth bit with be the low-order bit in the first
byte, and so on.

The Hexadecimal alphabet is defined as  "0123456789ABCDEF".  Upper case
letters A-F should be used by encoders, though it is acceptable if a
decoder ignores case.

3.3     Base64 Content-Encoding

The Base64 Content-Encoding is designed to represent arbitrary 8 bit
data in a form that is not humanly readable.  The encoding and decoding
algorithms are simple, but the encoded data is only about 33 percent
larger than the unencoded data.  This encoding is also used in Privacy
Enhanced Mail applications; it is described in RFC 1113. The ability in
RFC1113 to imbed clear text within such an encoding is not allowed in
this context, however. The following description of the encoding is
adapted from RFC 1113; apart from the exclusion of the "*" mechanism for
imbedded clear text there are no significant technical changes.

A 64-character subset of International Alphabet IA5 is used, enabling 6
bits to be represented per printable character.  (The proposed subset of
characters is represented identically in IA5 and ASCII.) One additional
character, "=", is used to signify special processing functions.  The
character "=" is used for padding within the printable encoding
procedure. The encoding function's output is delimited into text lines
(using local conventions), with each line except the last containing
exactly 64 printable characters and the final line containing 64 or
fewer printable characters.  (This line length is easily printable and
is guaranteed to satisfy SMTP's 1000 character transmitted line length
limit.)

The encoding process represents 24-bit groups of input bits as output
strings of 4 encoded characters. Proceeding from left to right across a
24-bit input group is formed by concatenating 3 8-bit input groups, this
is then treated as 4 concatenated 6-bit groups.  When encoding a bit
stream via the base64 encoding, the bit stream should be presumed to be
ordered with the most-significant-bit first.  That is, the first bit in
the stream will be the high-order bit in the first byte, and the eighth
bit with be the low-order bit in the first byte, and so on.

Each 6-bit group is used as an index into an array of 64 printable
characters. The character referenced by the index is placed in the
output string. These characters, identified in Table 1 below, are
selected so as to be universally representable, and the set excludes
characters with particular significance to SMTP (e.g., ".", "<CR>",
"<LF>").

                                 Table 1

   Value Encoding  Value Encoding  Value Encoding  Value Encoding
       0 A            17 R            34 i            51 z
       1 B            18 S            35 j            52 0
       2 C            19 T            36 k            53 1
       3 D            20 U            37 l            54 2
       4 E            21 V            38 m            55 3
       5 F            22 W            39 n            56 4
       6 G            23 X            40 o            57 5
       7 H            24 Y            41 p            58 6
       8 I            25 Z            42 q            59 7
       9 J            26 a            43 r            60 8
      10 K            27 b            44 s            61 9
      11 L            28 c            45 t            62 +
      12 M            29 d            46 u            63 /
      13 N            30 e            47 v
      14 O            31 f            48 w         (pad) =
      15 P            32 g            49 x
      16 Q            33 h            50 y

Special processing is performed if fewer than 24 bits are available in
an at the end of a message or encapsulated part of a message.  A full
encoding quantum is always completed at the end of a message. When fewer
than 24 input bits are available in an input group, zero bits are added
(on the right) to form an integral number of 6-bit groups.  Output
character positions which are not required to represent actual input
data are set to the character "=".  Since all canonically encoded output
is an integral number of octets, only the following cases can arise: (1)
the final quantum of encoding input is an integral multiple of 24 bits;
here, the final unit of encoded output will be an integral multiple of 4
characters with no "=" padding, (2) the final quantum of encoding input
is exactly 8 bits; here, the final unit of encoded output will be two
characters followed by two "=" padding characters, or (3) the final
quantum of encoding input is exactly 16 bits; here, the final unit of
encoded output will be three characters followed by one "=" padding
character.

Since the hyphen character ("-") is not used, there is no need to worry
about quoting apparent encapsulation boundaries within base64-encoded
body parts.

4       The "Multipart" Content-Type

In the case of multiple part messages, a "multipart" Content-type field
should appear in the RFC 822 message header. The message body is then
assumed to contain multiple parts separated by encapsulation boundaries.
 Each of the parts is defined, in essence, as a complete RFC 822 message
in miniature.  That is, what is found between the encapsulation
boundaries is a header area, a blank line, and a body area, in
accordance with the RFC 822 syntax for a message.  However, it should be
noted that NO header fields are actually required in these encapsulated
messages.  An encapsulation that starts with a blank line, therefore, is
a legitimate encapsulation of a message with no header fields.  In such
a case, of course, the absence of a Content-type header field implies
that the encapsulation is MAILASCII text.

Important to note is that the encapsulation boundary MUST NOT appear
inside any of the encapsulated parts.  Thus, it is crucial that the
composing agent be able to choose and specify the boundary that will
separate the parts.  This is done using the resource specification in
the Content-type header field.

The Content-type header field, as defined earlier in this document, has
two important optional fields that may follow the type name. These
fields are for a version number and a resource specification.  In the
case of the "multipart" content-type, this document defines version
numbers 1-S and 1-P; if the version number is omitted or "null", it is
to be assumed to be version 1-S.  The two versions have identical
syntax, but the "-P" is intended as a hint, to receivers, that the parts
are intended to be viewed in parallel rather than sequentially.  
Implementations that can not show the parts in parallel, or that choose
not to do so, are free to treat all multipart messages of version "1-P"
as if they were version "1-S".  However, all implementation should check
the version number, to ensure graceful behavior in the event that an
incompatible future version of multipart messages is defined later.

The resource specification, which is always required for multipart
messages, is used to specify the format of the encapsulation boundary. 
The encapsulation boundary is defined as two hyphen characters ("-",
decimal code 45) followed by the resource-specification portion of the
Content-type header field with any leading or trailing white space
removed.  (DISCUSSION:  The specification that white space be removed is
intended to eliminate the possible introduction of ambiguity caused by
the addition or deletion of white space by message transport agents. 
They hyphens are for rough compatibility with the earlier RFC 934 method
of message encapsulation, and for ease of searching for the boundaries
in some implementations.  However, it should be noted that multipart
messages are NOT completely compatible with RFC 934 encapsulations; in
particular, they do not obey RFC 934 quoting conventions for embedded
lines that begin with hyphens.)

Thus, a typical multipart content-type header field might look like this:

Content-type: multipart; 1-S; gc0p4Jq0M2Yt08jU534c0p

This indicates that the message consists of several parts, each itself
structured as an RFC 822 message, which are intended to be viewed
one-at-a-time, and that the parts are separated by the line

--gc0p4Jq0M2Yt08jU534c0p

The encapsulation boundaries must not appear within the encapsulations,
and should be no longer than 70 characters, not counting the two leading
hyphens.

It should be noted that no interpretation is specified for any lines
preceding the first encapsulation boundary or following the last one. 
In general, these "prefix" and "postfix" areas of multipart messages
should be regarded as comments, and implementations are free to discard
them.  However, it is recommended that composing agents use the prefix
area to include a short textual message, in MAILASCII, explaining that
what follows is an encapsulated multipart message, intended to be
interpreted by software rather than by human eyes.  This message is for
the benefit of people who might read the message with older user agents
that do not properly interpret multipart messages.

The use of "Content-Type: Multipart" as a message part within another
"Content-Type: Multipart" is explicitly allowed.   In such cases, for
obvious reasons, care must be taken to ensure that each nested mulitpart
message should use a different boundary delimiter.  See the example in
the following section.

Overall, the body of a multipart message may be specified as follows:

body := delimiter 1*encapsulation

encapsulation := message CRLF delimiter

delimiter := "--" <delimiter from Content-type resource> CRLF

message = <as defined in RFC 822, with all header fields
          optional, containing no lines matching "delimiter">

5       A Complex Multipart Example

What follows is the outline of a complex multipart message.  This
message has three parts to be displayed serially:  an introductory plain
text (MAILASCII) part, an embedded multipart message, and a closing
"rich text" part in SGML, which includes additional header fields to
indicate that it originally came from a different sender.  The embedded
multipart message has two parts to be displayed in parallel, a picture
and an audio fragment.

    From: ...
    Subject: ...
    Content-type: multipart; 1-s; tweedledum

    This is a multipart message.  
    If you are reading this text, you might want to 
    consider changing to a user agent that understands 
    how to properly display multipart messages.
    --tweedledum

    ...Introductory text goes here...  
    [Note that the preceding blank line means 
    no header fields were given and this is MAILASCII.]
    --tweedledum
    Content-type: multipart; 1-p; tweedledee

    This is a multipart message.  
    If you are reading this text, you might want to 
    consider changing to a user agent that understands 
    how to properly display multipart messages.
    --tweedledee
    Content-type: u-law; 8000 HZ; X-NEXT
    Content-Encoding: Hexadecimal

    ... hex-encoded NeXT-format audio data goes here....
    --tweedledee
    Content-type: G3FAX
    Content-Encoding: Base64

    ... base64-encoded FAX data goes here....
    --tweedledee
    --tweedledum
    From: ...
    Subject: ...
    Content-type: SGML; null
    Content-Encoding: Quoted-printable

    ... Closing text goes here ...
    --tweedledum

6      The Encoded-Variable Header Field

A particularly thorny problem, not addressed by the Content-Encoding
header field specified earlier in this memo, is the problem of including
data other than MAILASCII in a message header.  

It is tempting, to many, to simply declare that such inclusion is too
problematic, and that message headers should always be entirely
MAILASCII.  After all, most of the information in the header is not
intended for human consumption anyway.  However, there are certain parts
of the header that are intended entirely for human viewing, and these
are the parts where MAILASCII is deemed most unsatisfactory.  In
particular, there is widespread desire to have the contents of the
Subject field and the names of message senders and recipients appear in
languages that cannot be represented in MAILASCII.

The heart of the problem is the fact that RFC822 prescribes a great deal
of syntax and semantics for the message header area, all of it based on
MAILASCII.  Tampering with this, it would seem, could introduce a great
deal of complexity, as well as bugs involving backward compatibility.

Instead, this memo proposes a mechanism by which the header area remains
entirely MAILASCII, but encodes non-MAILASCII information in a manner
from which it can easily be restored by conforming user agents.

The basic idea is that, in certain parts of the headers which are never
machine-interpreted, the human-readable data might best be represented
in a content-type other than MAILASCII.  In such cases, the data are to
be represented, in the header field, by a "variable reference" -- a
placeholder for a value defined elsewhere in the message header area. 
The variables are defined by one or more "Encoded-Variable" headers,
with a syntax as specified below.

Thus, for example, if a user's name includes characters that cannot be
represented in MAILASCII,  it can be replaced by the name of a variable
that is defined elsewhere.  To improve readability by UA's that only
handle MAILASCII, it is recommended that the variable name itself be as
close an approximation as possible to the correct name.  Thus, for
example, one might have;

From: $Keld_JXrn_Simonsen <keld(_at_)dkuug(_dot_)dk>
Encoded-Variable: Keld_JXrn_Simonsen = quoted-printable, iso646, 
        Keld_J&0Crn_Simonsen

*** NOTE:  It would be nice to get the character set & hex code right
for the above example.

Where multiple variables need to be defined, multiple Encoded-Variable
header fields may be used.

It is important to constrain the use of encoded-variables to places
where they will not interfere with the established syntax or semantics
of header fields.  For that reason, their use is explicitly restricted
to the Subject and Comments header fields, and to the "phrase" portion
of RFC 822 addresses.  This implies a small redefinition of RFC 822's
"optional-field", "mailbox", and "group" syntax:

optional-field =
                 /  "Message-ID"        ":"   msg-id
                 /  "Resent-Message-ID" ":"   msg-id
                 /  "In-Reply-To"       ":"  *(phrase / msg-id)
                 /  "References"        ":"  *(phrase / msg-id)
                 /  "Keywords"          ":"  #phrase
                 /  "Subject"           ":"  var-text
                 /  "Comments"          ":"  var-text
                 /  "Encrypted"         ":" 1#2word
                 /  extension-field           ; To be defined
                 /  user-defined-field        ; May be pre-empted

mailbox     =  addr-spec                    ; simple address
                 /  var-phrase route-addr   ; name & addr-spec

group       =  var-phrase ":" [#mailbox] ";"

The two new syntactic entities, "var-text" and "var-phrase", are defined
as follows:

var-text =  *text / var-ref

var-phrase =  phrase / var-ref

var-ref =  "$" var-name

var-name = atom

NOTE that the definition of "atom" permits underscores, but not spaces
or any other "specials" as defined by RFC 822.  Note also that this does
not actually change the legal syntax defined by RFC 822, because a
"var-ref" is itself a valid instance of "phrase" or "*text".  Thus, no
correct existing parsers should be broken by the new definitions. 
However, the old parsers will not recognize a difference between a
var-ref and any other instance of *text or phrase, and will therefore
not do any variable substitution.

The syntax of the Encoded-Variable field is defined as follows:

Encoded-variable = var-name "=" Content-Encoding 
                   "," Content-Type "," var-contents

var-contents = *text

Here the var-contents is the encoded value of the variable, of a type
given by Content-Type and encoded with the encoding given in
Content-Encoding.  Both a Content-Type and a Content-Encoding are
required for each Encoded-Variable header field.

7       Cross-references Between Encapsulated Parts

Within a multipart message, as defined above, there is essentially no
cross-encapsulation structure.  However, multimedia mail systems such as
Andrew [REF-ATK] have demonstrated the value of inter-part reference. 
All that is necessary, in order to make a multipart scheme work, is a
mechanism to allow one encapsulated part to make reference to another. 
Some have proposed the use of a new "Content-Label" header field within
the encapsulated parts, in order to give each part a name by which it
can be referenced.  However, this is not necessary, as the established
Message-ID header field can in fact be used for precisely this purpose. 
Each encapsulated part can include a Message-ID header field, which can
then be used for reference purposes by related body parts.

8       Optional Content-size Header Field

In the discussions of earlier drafts of this memo, some people indicated
a strong preference for using a size-counting scheme to delimit the
boundaries between encapsulated parts of multipart messages.  This was
rejected because such schemes are not, in general, sufficiently robust
across the SMTP transport layer.  For example, line counts can be
altered by line-wrapping MTA's, and byte counts can be altered in any
number of ways.  However, there are restricted environments in which
either or both of these counts can be relied upon, and in such
environments it may be desirable to implement a count-based approach to
delimiters.  Therefore this memo specifies a conventional way to do
this, in order to promote interoperability among systems that are able
to take this approach.

In such cases, boundary delimiters, as defined above, are still
required.  However, the header area of an encapsulated part may include
an optional Content-Size header which indicates where the encapsulated
part ends, if its size has not been altered.  The size may be measured
in either bytes or lines.  Those who use the Content-Size header field
should still preserve the encapsulation boundaries, and should recognize
that other agents are free to ignore it in favor of complete reliance on
encapsulation boundaries.

The Content-Size header field is defined as follows:

Content-Size = 1*DIGIT "lines"
        / 1*DIGIT "bytes"

Note that each encapsulated part should still end with a newline
followed by an encapsulation boundary.  However, a message store that
wishes, for example, to use a storage format that is largely RFC
822-compliant, but includes binary storage of binary objects, can use
the Content-Size header field to indicate whether or not the final
newline is to be interpreted as part of the binary object.  If the
newline follows the number of bytes specified for the encapsulation,
then it is not part of the encapsulation.

The size given by the Content-Size header field is the size of the
encapsulation's body only, not counting the blank line that separates
the header from the body.  In other words, the four bytes CRLF CRLF,
which separate header from body, are NOT counted as part of the
content-size.

9       Summary

Using the Content-Type and Content-Encoding header fields, it is
possible to include, in a standardized way, arbitrary types of data
objects in RFC 822 mail messages, without breaking any of the existing
restrictions imposed by RFC 821 and RFC 822.  Using the "mulitpart"
content-type, it is possible to mix multiple objects of different types
in a single message.  The additional optional header field, Content-Size
provides a conventional mechanism for an extension deemed desirable by
many implementors.  Finally, a limited mechanism is provided for
including non-MAILASCII data in certain RFC 822 header fields.

For more information, the authors of this document may be contacted via
Internet mail:

             Nathaniel Borenstein <nsb(_at_)thumper(_dot_)bellcore(_dot_)com>
                  Ned Freed <ned(_at_)hmcvax(_dot_)claremont(_dot_)edu>

10      Acknowledgements

This RFC is the result of the collective effort of a large number of
people, at several IETF meetings and on the IETF-SMTP and IETF-822
mailing lists.  Although any enumeration seems doomed to suffer from
egregious omissions, the following are among the many contributors to
this effort:  Harald Alvestrand, Kevin Carosso, Mark Crispin, Dave
Crocker, Walt Daniels, Kevin Donnelly, Johnny Eriksson, Craig Everhart,
Bruce Howard, Risto Kankkunen, Neil Katin, Steve Kille, Anders Klemets,
John Klensin, Vincent Lau, Timo Lehtinen, Rick McGowan, Mark Needleman,
John Noerenberg, David Robinson, Jonathan Rosenberg, Jan Rynning, Mark
Sherman, Keld Simonsen, Einar Stefferud, Michael Stein, Robert Ullman,
Stuart Vance,  Erik van der Poel, Greg Vaudreuil, Brian Wideen, Glenn
Wright, and David Zimmerman.  The authors apologize for any omissions
from this list, which were certainly unintentional.

11      References

[REF-PS]  Adobe Systems, Inc.  Postscript Language Reference Manual. 
Addison-Wesley, Reading, Mass., 1985.

[REF-SGML]  ISO TC97/SC18.  Standard Generalized Markup Language. Tech.
Rept. DIS 8879, ISO, 1986.

[REF-TEX]  Knuth, Donald E.  The TEXbook.  Addison-Wesley, Reading,
Mass., 1984.

[REF-TROFF]  Ossanna, Joseph F. NROFF/TROFF User's Manual.  Bell
Laboratories, Murray Hill, New Jersey, 1976.  Computing Science
Technical Report No.54.

[REF-SCRIBE]  Unilogic.  SCRIBE Document Production Software.  Unilogic,
1985. Fourth Edition.

[REF-ISO646] International Standard--Information Processing--ISO 7-bit
coded  character set for information interchange, ISO 646:1983.

[REF-7BIT] International Standard--Information Processing--ISO 7-bit and
 8-bit coded character sets--Code extension techniques, ISO 2022:1986.

[REF-ANSI] Coded Character Set--7-Bit American National Standard Code
for  Information Interchange, ANSI X3.4-1986.

[REF-X400]  Schicker, Pietro, "Message Handling Systems, X.400", Message
Handling Systems and Distributed Applications, E. Stefferud, O-j.
Jacobsen, and P. Schicker, eds., North-Holland, 1989, pp. 3-41.

[RFC-821] Postel, J.B.  Simple Mail Transfer Protocol.  August, 1982,
Network Information Center, RFC-821. 

[RFC-822]   Crocker, D.  Standard for the format of ARPA Internet text
messages.   August, 1982, Network Information Center, RFC-822.

[RFC-934]   Rose, M.T.; Stefferud, E.A.  Proposed standard for message 
encapsulation.  January, 1985, Network Information Center, RFC-934.

[RFC-1049]  Sirbu, M.A.  Content-type header field for Internet
messages.  March, 1988, Network Information Center, RFC-1049. 

[RFC-1113]  Linn, J.  Privacy enhancement for Internet electronic mail:
Part I -  message encipherment and authentication procedures [Draft]. 
August, 1989, Network Information Center, RFC-1113.

[RFC-1148]  Kille, S.E.  Mapping between X.400(1988) / ISO 10021 and RFC
822.  March, 1990, Network Information Center, RFC-1148.

[RFC-1154]  Robinson, D.; Ullmann, R.  Encoding header field for
internet messages. April, 1990, Network Information Center, RFC-1154.

[REF-ATK] Borenstein, Nathaniel S., Multimedia Applications Development
with the Andrew Toolkit, Prentice Hall, 1990.

[REF-CCITT84c]  CCITT SG 5/VII, "Recommendations X.420," Message
Handling Systems: Interpersonal Messaging User Agent Layer, October 1984.

[REF-CCITT/ISO88b]  CCITT/ISO, "CCITT Recommendations X.420/ ISO IS
10021-7", Message Handling Systems: Interpersonal Messaging System,

[REF-ODA] **************

[REF-ULAW] ***************

[REF-ALAW] ***************

[REF-DES] ****************

[REF-PBM] ****************

Appendix A -- The Character Set for the MAILASCII Content-Type

As stated in this document, the MAILASCII content-type is based on a
series of standards and on the historical standard practice in the
Internet mail community.  However, the precise meaning of this
content-type has been the subject of some debate.  In this appendix,
therefore, we define the MAILASCII content-type.  It is our belief that
this definition corresponds with the default assumptions made for
messages without Content-type headers as defined by RFC 822.

The message body is coded in the character set of American National
Standard Code for Information Interchange, sometimes known as "7-bit
ASCII" [REF-7BIT]. This is not an arbitrary seven-bit character code,
but indicates that the message body uses character coding that uses the
exact correspondence of codes to characters specified in ASCII. 
National use variations on ISO646 [REF-ISO646] are not ASCII, and
neither an explicit "ASCII" content type, nor "MAILASCII", nor the
default (omission of a content-type) should be used when characters are
coded using them.   (Discussion: RFC821 very explicitly specifies
"ASCII", and references  an earlier version of the American National
Standard cited in [REF-ANSI].  Whether that specification, rather than a
reference to an International Standard, was done deliberately or out of
convenience or ignorance, is no longer interesting: insofar as one of
the purposes of specifying a content-type is to permit the receiver to
unambiguously determine how the sender intended the coded message to be
interpreted, assuming anything other than "strict ASCII" as the default
would risk unintentional and incompatible changes to the semantics of
messages now being transmitted.    This also implies that messages
containing characters coded according  to national variations on ISO646,
or using code-switching procedures (e.g., those of ISO2022), as well as
8-bit or multiple  octet character encodings MUST use an appropriate
content-type to be consistent with this specification.)    

Because of the restriction imposed on message bodies by RFC 822 and, in
practice, by Message Transport Agents that are more-or-less compliant
with RFC 821, implementors should be careful in several ways regarding
MAILASCII text:  

    (1) Delimiters other than CR-LF pairs may be used in the local
    representation of a message on some systems.  The persistence of
    CR-LF pairs should not be relied on.

    (2) Isolated CR and LF characters are not well tolerated in
    general; they may be lost or converted to delimiters on some
    systems, and hence should not be relied on.

    (3) TAB characters may be misinterpreted or may be automatically
    converted to variable numbers of spaces.  This is unavoidable in
    some environments, notably those not based on the ASCII
    character set. Such conversion is STRONGLY DISCOURAGED, but it
    may occur, and users of MAILASCII format should not rely on the
    persistence of TAB characters.

    (4) Lines longer than 80 characters may be wrapped in some
    environments. Line wrapping is STRONGLY DISCOURAGED, but
    unavoidable in some cases. Applications which depend on lines
    not being wrapped should use mechanisms other than unencoded
    MAILASCII bodyparts to transmits messages. 

    (5)  Trailing "white space" characters (SPACE, TAB, etc.) on a
    line may be discarded by some transport agents, and hence should
    not be relied on.

See RFC 821, RFC 822, and RFC1113 for additional information about
canonical SMTP formats.  Authors of software which composes "MAILASCII"
in compliance with this RFC should be well-acquainted with SMTP formats.

The complete MAILASCII character set is listed below: ***** CONTROL CHARS????

 0 nul  16 dle  32 sp   48  0   64  @   80  P    96  `   112  p 
 1 soh  17 dc1  33  !   49  1   65  A   81  Q    97  a   113  q 
 2 stx  18 dc2  34  "   50  2   66  B   82  R    98  b   114  r 
 3 etx  19 dc3  35  #   51  3   67  C   83  S    99  c   115  s 
 4 eot  20 dc4  36  $   52  4   68  D   84  T   100  d   116  t 
 5 enq  21 nak  37  %   53  5   69  E   85  U   101  e   117  u 
 6 ack  22 syn  38  &   54  6   70  F   86  V   102  f   118  v 
 7 bel  23 etb  39  '   55  7   71  G   87  W   103  g   119  w 
 8 bs   24 can  40  (   56  8   72  H   88  X   104  h   120  x 
 9 ht   25 em   41  )   57  9   73  I   89  Y   105  i   121  y 
10 nl   26 sub  42  *   58  :   74  J   90  Z   106  j   122  z 
11 vt   27 esc  43  +   59  ;   75  K   91  [   107  k   123  { 
12 np   28 fs   44  ,   60  <   76  L   92  \   108  l   124  |
13 cr   29 gs   45  -   61  =   77  M   93  ]   109  m   125  } 
14 so   30 rs   46  .   62  >   78  N   94  ^   110  n   126  ~ 
15 si   31 us   47  /   63  ?   79  O   95  _   111  o   127 del

<Prev in Thread] Current Thread [Next in Thread>