perl-unicode

Re: utf8, japanese, web-pages: beginning to see the light...

2004-06-17 03:30:05
Marco Baroni <baroni(_at_)sslmit(_dot_)unibo(_dot_)it> writes:

Now for a much less pressing issue: Does anybody know of something
similar to the HTML::FormatText module that can take utf-8 input, and
generate utf-8 output?

Doubt it. But if you run it on Unicode chars (as indicated above)
then unless it is doing something too clever it should just work.

Could it be that the problem is with HTML::TreeBuilder (which is
required for pre-processing by HTML::FormatText)? Does anybody know if
this module has issues with Unicode?

It probably has as it uses HTML::Parser underneath.  HTML::Parser is
not really Unicode aware.  The strings passed to the event callback
will not preserve the UTF8 flag of strings it parses.

A workaround can be to pass it encoded UTF8.  I would also welcome
patch suggestions that make HTML::Parser that propegates the UTF8
flag.

Regards,
Gisle