Practical problems with custom .ucm based encoding

2002-04-24 05:25:37

The cool Encoding support in 5.8 to be enables me to properly solve a
very common task: making HTML entities out of utf-8 data.

I generated a ucm file with entries like this:

    <U00A0> \x26\x6E\x62\x73\x70\x3B                 |0 # nbsp

The resulting Encode::HTMLEntities encoding works perfectly. However, I
want it to do more.

Not every unicode character has a corresponding entity. Unknown ones can
be encoded like &#8364;, so I would like my Encoding to use a simple
function as a fallback. This proves hard. With CHECK == Encode::FB_WARN
it looks like the whole string is left untouched, so my plan to just
substr() off the first character, handle it by hand and repeat is not
going to work.

I'd be very happy with a CHECK mode which would allow me to handle a
single problematic character in perl. Having to find it in a longer
string is very hard in this case, because it's every character > 0x{7f}
which is not in my .ucm file.