Re: RFC 2152 - UTF-7 clarification


In message 
<DB4PR06MB4573125043060E318DC6A30AD350(_at_)DB4PR06MB457(_dot_)eurprd06(_dot_)prod(_dot_)ou
tlook.com>, l(_dot_)wood(_at_)surrey(_dot_)ac(_dot_)uk writes:

The best place to raise this erratum for formal consideration would be
https://www.rfc-editor.org/errata.php

Lloyd Wood
http://about.me/lloydwood
________________________________________
From: ietf <ietf-bounces(_at_)ietf(_dot_)org> on behalf of A. Rothman 
<amichai2@amicha=
is.net>
Sent: Wednesday, 7 October 2015 7:21 PM
To: ietf(_at_)ietf(_dot_)org
Subject: RFC 2152 - UTF-7 clarification

Hi,

I hope this is the right place to discuss RFC 2152 - I couldn't find a
conclusive answer as to where and how one should comment on a specific
Request For Comments (which is a bit unsettling :-) )

I'd like to raise an issue with the UTF-7 decoding process as described
in the RFC with respect to trailing padding bits:

"Next, the octet stream is encoded by applying the Base64 content
transfer encoding algorithm as defined in RFC 2045, modified to
omit the "=" pad character. Instead, when encoding, zero bits are
added to pad to a Base64 character boundary. When decoding, any
bits at the end of the Modified Base64 sequence that do not
constitute a complete 16-bit Unicode character are discarded. If
such discarded bits are non-zero the sequence is ill-formed."

The way I understand this is that after decoding the modified-base64
data and grouping the resulting octets into 16-bit Unicode characters,
any remaining zero bits at the end (up to 15 bits, theoretically) should
simply be ignored. I'm not sure why an encoder would want to add extra
zero bits at the end beyond the minimum necessary, but it is arguably
allowed to pad 'to *a* Base64 character boundary', not specifically *the
next* boundary. Perhaps an encoder would use some version of a standard
Base64 routine and then replace the padding '=' characters with 'A'
characters (which are then decoded to all zero bits). Such encoding
would obviously be less space-efficient since it adds unnecessary octets
to the encoding - but it seems like there are valid reasons to do so.


It says omit, not replaced with 'A'.  In addition just replacing
'=' with 'A' can add a 0x0000 to the end of a unicode string as the
pad characters can cover 12 bits with 4 bits from the second character
of the 4 character base64 word.

e.g.
        AAAAAA== 0x0000, 0x0000 (discard 4 bits)
        AAAAAAAA 0x0000, 0x0000, 0x0000

Though I can see how you could think this was a valid strategy if
you only look at a single base64 word after encoding a single utf-16
character.

        AAA=     0x0000 (discard 2 bits)
        AAAA     0x0000 (discard 8 bits)

Now you could safely replace all the '=' pad characters with a
single 'A' but that would just be a perverse encoder and if you
were to use such a encoder I wouldn't blame the decoder for rejecting
the input.

The issue is with the decoding though, and the reason it came up is that
I've checked various existing UTF-7 decoder implementations and
resources (e.g. iconv, uconv, icu, jutf7, jcharset, Wikipedia, etc.) and
they seem to disagree about this issue. Some generate an error after
a maximum number of trailing zero bits (the maximum changes between
implementations), and some agree with my interpretation above where any
leftover partial group of zero bits is discarded and it's always valid.

So, since there is such discrepancy in practice in how this is being
interpreted, I submit that the description is not clear enough to make
this unambiguous. Could someone please clarify what is officially
valid/invalid according to the RFC regarding trailing zero bits? Can we
add errata that clarifies it in either case?

Finally, if it helps, here are some concrete test cases to consider:

+A-                   illegal !modified base64
+AA-                  illegal !a multiple of 16 bits in modified base64
+AAA-                 legal   0x0000 (last 2 bits zero)
+AAAA-                illegal !a multiple of 16 bits in modified base64
+AAAAA-               illegal !modified base64
+AAAAAA-              legal   0x0000, 0x0000 (last 4 bits zero)
+AAAAAAA-             illegal !a multiple of 16 bits in modified base64

Which of these are valid inputs? Which are invalid? How many 0x0000
16-bit characters should each one be decoded into?

Thanks!

Amichai

-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742                 INTERNET: marka(_at_)isc(_dot_)org