Re: RFC 2152 - UTF-7 clarification

On Wed, Oct 07, 2015 at 11:21:49AM +0300, A. Rothman wrote:

I'd like to raise an issue with the UTF-7 decoding process as described
in the RFC with respect to trailing padding bits:

"Next, the octet stream is encoded by applying the Base64 content
transfer encoding algorithm as defined in RFC 2045, modified to
omit the "=" pad character. Instead, when encoding, zero bits are
added to pad to a Base64 character boundary. When decoding, any
bits at the end of the Modified Base64 sequence that do not
constitute a complete 16-bit Unicode character are discarded. If
such discarded bits are non-zero the sequence is ill-formed."


It seems to me that the encoder's behaviour is specified clearly
enough.  Namely, the encoder outputs an unpadded base64 encoding
of the input octet-stream that is zero padded with 0, 2 or 4 bits
(an odd padding length can't happen) to ensure that the total number
of bits is a multiple of 6, allowing each 6 bits to be encoded as
base64 output character.

The decoder's job is then to reverse this process.  The base64
input produces a stream of 6-bit blocks, which in total yields 16q
+ 2r bits where 0 <= r < 8.  The "q" groups of 16 bits are the
decoded text.  The "2r" extra bits must be zero and are discarded.

An encoder should never generate 6 <= 2r <= 14 extra bits, since
0, 2 or 4 is enough, however it seems that an encoder can get away
with up to 10 "extra" padding bits so long as the total count is
less than 16.

The way I understand this is that after decoding the modified-base64
data and grouping the resulting octets into 16-bit Unicode characters,
any remaining zero bits at the end (up to 15 bits, theoretically)


Well 14, due to an even bit count.

should simply be ignored.


Correct, though in practice that count should always be 0, 2 or 4.
Encoders that produce 6, 8, 10, 12 or 14 padding bits are appending
extraneous "A" output octets to the base64 stream.

I'm not sure why an encoder would want to add extra zero bits at the end
beyond the minimum necessary, but it is arguably allowed to pad 'to *a*
Base64 character boundary', not specifically *the next* boundary.


Correct, with 6, 8 or 10 extra bits, it would have been simpler
for the encoder to save one output "A" and emit 0, 2 or 4 padding
bits.  With 12 or 14, save outputting "AA" and emit 0 or 2 extra
bits.  What the extra "A" or "AA" might allow the encoder to do is
to "round-up" the base64 output to a multiple of 4 octets, which
simplifies decoding.  The only time the encoder can't do that is
when the input length in bits is 24q + 2|4|6|8 (1/3 of the time),
because this would require 16 or more padding bits.

I would not write an encoder that makes the base64 output an exact
multiple of 4 octets 2/3 of the time.  Too much trouble for incomplete
success, but it seems that the specification allows this.

Perhaps an encoder would use some version of a standard
Base64 routine and then replace the padding '=' characters with 'A'
characters (which are then decoded to all zero bits).


This does not work, because the total number of padding bits may
then equal or exceed 16.

Such encoding
would obviously be less space-efficient since it adds unnecessary octets
to the encoding - but it seems like there are valid reasons to do so.


It would also be wrong, because it would not be able to represent
trailing zeros correctly.

So, since there is such discrepancy in practice in how this is being
interpreted, I submit that the description is not clear enough to make
this unambiguous. Could someone please clarify what is officially
valid/invalid according to the RFC regarding trailing zero bits? Can we
add errata that clarifies it in either case?


Admittedly, I am just applying logic.  Perhaps the specification
is not as logical as I expect.  If it is logical, then it should
be as noted above.

Finally, if it helps, here are some concrete test cases to consider:


+A-             : empty + 6 (unnecessary) padding bits

+AA-            : empty + 12 (unnecessary) padding bits
+AAA-           : \U+0000, and 2 (required) padding bits
+AAAA-          : \U+0000, and 8 (6 extra) padding bits
+AAAAA-         : \U+0000, and 14 (12 extra) padding bits
+AAAAAA-        : \U+0000\U+0000, and 4 (required) padding bits
+AAAAAAA-       : \U+0000\U+0000, and 10 (6 extra) padding bits

Which of these are valid inputs? Which are invalid? How many 0x0000
16-bit characters should each one be decoded into?


They are all valid, because any padding bits are zero in all of
them.  They decode to floor(6n/16) == floor(3n/8) 16-bit unicode
code points, where "n" is the length of the base64 input.

-- 
        Viktor.