perl-unicode

Re: Practical problems with custom .ucm based encoding

2002-04-24 05:32:44
On Wednesday, April 24, 2002, at 09:25 , Bart Schuller wrote:
Hello,

The cool Encoding support in 5.8 to be enables me to properly solve a
very common task: making HTML entities out of utf-8 data.

I generated a ucm file with entries like this:

    <U00A0> \x26\x6E\x62\x73\x70\x3B                 |0 # nbsp

The resulting Encode::HTMLEntities encoding works perfectly. However, I
want it to do more.

Not every unicode character has a corresponding entity. Unknown ones can
be encoded like &#8364;, so I would like my Encoding to use a simple
function as a fallback. This proves hard. With CHECK == Encode::FB_WARN
it looks like the whole string is left untouched, so my plan to just
substr() off the first character, handle it by hand and repeat is not
going to work.

I'd be very happy with a CHECK mode which would allow me to handle a
single problematic character in perl. Having to find it in a longer
string is very hard in this case, because it's every character > 0x{7f}
which is not in my .ucm file.

As a matter of fact, I was thinking of adding FB_HTMLENT or something like that. It seems trivial; Unless jhi whips me for the sin of Feeping Creaturism, I'll do so.

CAVEAT; This will be done via fallback so &<>" will not turn into entities!

Dan the Encode Maintainer