On Thu, 25 Mar 2021, Tony Finch wrote:
> Stuart Cheshire and Mary Baker have a super elegant binary escaping scheme
> called "consistent overhead byte stuffing" (COBS) that has a guaranteed
> overhead of less than 1%.
That's quite clever, although it only avoids a single value and it's not
immediately obvious how to extend it to multiple values. Applying it
iteratively for different values wouldn't work.
If you really think this is a problem worth solving, the simplest way to do it
is encrypt the dats using a fast encryption scheme and a fixed key. Encryption
pretty much guarantees that there won't be too many of any particular byte
For a key I suggest this hex value:
That's the SHA-256 hash of "This may be a little overengineered."
If you're paranoid about some sort of chosen plaintext attack you can
generate a random IV and put it at start of the data.
As for the encryption scheme, given the hardware assists and numerous excellent
libraries that take advantage of it, AES-256 in OFB mode is probably a good
choice. (OFB eliminates the need to do anything special at the end of the data.)
However, if support on really low end hardware is important, Speck would be a
The only potential problem I see is that using encyption completely trashes your
ability to apply compression to the encoded part and get anything useful. But
given the use of compression in most media formats plus the efficiency of the
resulting encoding, it isn't clear to me that there's reason to care.
But supposing you do care, or supposing that you're one of those people who
insisted on running your CSMA network way under capacity out of fear of
collision storms, or who after generating their maximum length MIME boundary
out of 256 bits of random data obtained through observation of radiation decay,
still insist on checking for collisions, there are deterministic algorithms
that will work.
Let's first note that simple counting arguments suffice to demonstrate
that there's no possible input that can make it impossible for a combination
of swapping and stuffing to be ineffective unless the set of
octets that have to be avoided is large - certainly much larger than
the set of 4-6 we're contemplating here.
So the goal is to find a set of swaps and a quoting character that will
deal with the pathological cases. This is easy: Pick a block size, read
in a (possibly partial) block, and then count the number of times each
octet appears. (The obvious way to do this is with a 256 element array.)
Now scan the array for values with the lowest counts, omitting ones
that are associated with octets you can't use. You'll need one
for the quoting octet plus one for each octet you can't use.
The octet that shows up the least is your quoting octet. The others you pair up
with the octets you can't use, say by associating the least used one with the
most used unusable octet. Encoding is then done by first writing the
quoting/pairing/blocking information (format left as an exercise for the
reader), then taking each octet from the block, swapping it if it's one that
needs swapping, quoting if it's now one that needs quoting, and output the
I think that does it.
ietf-smtp mailing list