On Mon, Mar 13, 2017 at 09:14:16AM +0100, Julian Reschke wrote:
So the changes in RFC 7159 allow top-level strings, so we can't rely on the
first *two* characters being US-ASCII. But we *can* rely on the first one
being US-ASCII, no?
Correct.
If one OR two bytes of the first four are NULs, then the encoding is
UTF-16 (or something else or invalid):
So the following should still be correct:
Since the first character of a JSON text will always be an ASCII
character [RFC0020], it is possible to determine whether an octet
stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
at the pattern of nulls in the first four octets.
00 00 00 xx UTF-32BE
00 xx xx xx UTF-16BE
xx 00 00 00 UTF-32LE
xx 00 xx xx UTF-16LE
xx xx xx xx UTF-8
Count the number of NULs in the first four bytes:
- if zero -> UTF-8
- if one or two -> UTF-16
- if three -> UTF-32
Nico
--