perl-unicode

Re: [Encode] UCS/UTF mess and Surrogate Handlings

2002-04-05 10:53:32


Dan Kogai wrote:
...
Okay, here is my strategy.

                decode("\x{8C00}-\0x{8FFFF}")   encode("\x{10000}-\x{10FFFF}")

The Unicode consortium does discuss this:

http://www.unicode.org/versions/corrigendum1.html

    Corrigendum #1: UTF-8 Shortest Form

    The conformance clause C12 in The Unicode Standard, Version 
    3.0 forbids the generation of "non-shortest form" UTF-8, and 
    forbids the interpretation of illegal sequences, but not the 
    interpretation of "non-shortest form". Where software does 
    interpret the non-shortest forms, security issues can arise. 
    For example:

    Process A performs security checks, but does not check for
    non-shortest forms. 
    
    Process B accepts the byte sequence from process A, and 
    transforms it into UTF-16 while interpreting non-shortest 
    forms. 
    
    The UTF-16 text may then contain characters that should 
    have been filtered out by process A. 

You might want to consider adding a security override for this.

Brian Stell