Re: RFC 2152 - UTF-7 clarification


Just in case someone missed it (I almost did): Mark added his own
detailed comments on the test cases, but they got buried within a long
quote from my original email so may have gone unnoticed. To recap, here
are the two interpretations:

+A-             empty + 6 (unnecessary) padding bits
+AA-            empty + 12 (unnecessary) padding bits
+AAA-           \U+0000, and 2 (required) padding bits
+AAAA-          \U+0000, and 8 (6 extra) padding bits
+AAAAA-         \U+0000, and 14 (12 extra) padding bits
+AAAAAA-        \U+0000\U+0000, and 4 (required) padding bits
+AAAAAAA-       \U+0000\U+0000, and 10 (6 extra) padding bits


+A-             illegal !modified base64
+AA-            illegal !a multiple of 16 bits in modified base64
+AAA-           legal   0x0000 (last 2 bits zero)
+AAAA-          illegal !a multiple of 16 bits in modified base64
+AAAAA-         illegal !modified base64
+AAAAAA-        legal   0x0000, 0x0000 (last 4 bits zero)
+AAAAAAA-       illegal !a multiple of 16 bits in modified base64


Does anyone else want to vote or comment on the two interpretations above?

On 10/08/2015 07:06 PM, Viktor Dukhovni wrote:

On Thu, Oct 08, 2015 at 06:22:51PM +1100, Mark Andrews wrote:

Though I can see how you could think this was a valid strategy if
you only look at a single base64 word after encoding a single utf-16
character.

     AAA=     0x0000 (discard 2 bits)
     AAAA     0x0000 (discard 8 bits)

Now you could safely replace all the '=' pad characters with a
single 'A' but that would just be a perverse encoder and if you
were to use such a encoder I wouldn't blame the decoder for rejecting
the input.

I don't read Mark's response as saying that non-minimal padding is
*invalid*.  He says the encoder is "perverse", and I agree that
the encoder would be better off not generating excess padding.  

He further says that he would not be surprised if some decoders
rejected non-minimally padded input, and frankly I would also not
be surprised, but that does not make the input invalid.  The
specification says that up to 14 (< 16) bits of zero padding is to
be discarded by decoders, it does not limit the discard bit count
to 4 (< 6).  

There are lots of lazy and fragile implementations of standards
out there, encoders need to try to avoid generating non-mainstream
outputs if they want most decoders to handle the result.

On Thu, Oct 08, 2015 at 02:21:36PM +0300, A. Rothman wrote:

Everything else still stands. Specifically, the two replies beautifully
illustrate my point about ambiguousness - in their interpretation of the
actual test cases I submitted, one says that all inputs are valid, and
the other says some of them are invalid. That's exactly the problem I
saw when comparing libraries.

Perhaps Mark really does consider 8 to 14 bits of padding as
"invalid" (not just "perverse").  If so, then indeed the specification
is open to multiple interpretations.  As I see it, so far Mark and I
are on the same page.

As a starting point, my suggestion would be that an encoder SHOULD add
the minimal amount of padding necessary, which is likely what encoders
already do, while a decoder MUST accept and discard any amount of zero
padding (less than 16 bits of course), in line with being more lenient
on inputs, and simplifying/micro-optimizing the decoder by removing an
extra check+documentation and applying KISS. It would be nice to add one
of the test cases in the errata as well, to clarify the expected result.

The only thing "missing" from the specification is advice (or a
requirement) to make the padding "minimal".  That is to pad only
to the *closest* base64 (i.e. multiple of 6 bit) boundary.