Dan Kogai wrote:
...
Okay, here is my strategy.
decode("\x{8C00}-\0x{8FFFF}") encode("\x{10000}-\x{10FFFF}")
The Unicode consortium does discuss this:
http://www.unicode.org/versions/corrigendum1.html
Corrigendum #1: UTF-8 Shortest Form
The conformance clause C12 in The Unicode Standard, Version
3.0 forbids the generation of "non-shortest form" UTF-8, and
forbids the interpretation of illegal sequences, but not the
interpretation of "non-shortest form". Where software does
interpret the non-shortest forms, security issues can arise.
For example:
Process A performs security checks, but does not check for
non-shortest forms.
Process B accepts the byte sequence from process A, and
transforms it into UTF-16 while interpreting non-shortest
forms.
The UTF-16 text may then contain characters that should
have been filtered out by process A.
You might want to consider adding a security override for this.
Brian Stell