perl-unicode

Re: Practical problems with custom .ucm based encoding

2002-04-25 02:06:29
Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> writes:
On Wednesday, April 24, 2002, at 09:25 , Bart Schuller wrote:
Hello,

The cool Encoding support in 5.8 to be enables me to properly solve a
very common task: making HTML entities out of utf-8 data.

I generated a ucm file with entries like this:

    <U00A0> \x26\x6E\x62\x73\x70\x3B                 |0 # nbsp

The resulting Encode::HTMLEntities encoding works perfectly. However, I
want it to do more.

Not every unicode character has a corresponding entity. Unknown ones can
be encoded like &#8364;, so I would like my Encoding to use a simple
function as a fallback. This proves hard. With CHECK == Encode::FB_WARN
it looks like the whole string is left untouched, so my plan to just
substr() off the first character, handle it by hand and repeat is not
going to work.

There is meant to be an option that leaves the source string with 1st 
char being the offending one. We are still tidying up the check stuff.

Big picture theory:

  CHECK not present or 0 - use encodings "best" fallback, only stop 
                           at end of string or on partial character,
                           possibly treat partial char like other "bad" char.   
                    
                           
  CHECK = XXX  -  use encodings "best" fallback, only stop 
                  at end of string or on partial character
                  return part that works, leave src pointing at
                  partial char if any. (This is one PerlIO::Encoding wants)
  
  CHECK = YYY  - stop on error, returning part that worked,
                 leaving source starting with offending char (either partial
                 or un-mappable. This is one Tk wants

  CHECK =  PPP - Use perl's qq syntax rather than fallbacks
     

We now have a bit mask - so in theory there can be bits for 
   A. Update src string
   B. Use fallbacks
   C. Partials as bad chars
   D. Use perl QQ
   E. Warn on error 
   F. Croak on error 

   H. ;-) Use HTML entities as fallbacks 

-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/