Re: losing utf8 flag on strings?

Nick Ing-Simmons wrote:

Paul Bijnens <paul(_dot_)bijnens(_at_)xplanation(_dot_)com> writes:

Can anyone explain what I'm doing wrong?


I was about to contact the author of HTML::Entities, when
I noticed HTML::Parser 3.45 was released on 6 Jan 2005.

Installed it -- and guess what?  Now it works as expected!

I guess Gisle is one of those wizards that solve problems
before you even ask them :-)   Thanks, Gisle!!!



But that does not yet solve my problem with Plucene, and others,
where I have similar problems with losing utf8 flags.
Debugging utf8-flags is difficult, partly because my
understanding of the issue is not clear either.



As I recall HTML::Entities has a build-time option as to whether it handles
Unicode - do you know if yours has that turned on?
What locale are you in (i.e. is it something that has € as a native
8-bit coding (Windows 1251 or iso-8859-15 say)?

I have this recurring problem of strings not being flagged
as utf8, when -- I believe -- they should be.

One of those cases is in decode_entities() from the module
HTML::Entities, but I have other occurances too (e.g. in Plucene).

When I run this program:

########### cut here
#!/usr/bin/perl
use HTML::Entities;
use Encode;
print "This is perl ", $], "\n";

$s = "&euro;";
$t = decode_entities($s);
$u = decode("utf8", $t, Decode::FB_CROAK);

print "t: ", Encode::is_utf8($t) ? "is" : "not", " utf8", "\n";
print "u: ", Encode::is_utf8($u) ? "is" : "not", " utf8", "\n";
print "t: ", ($t eq "\x{20ac}") ? "is" : "not", " Eurosign\n";
print "u: ", ($u eq "\x{20ac}") ? "is" : "not", " Eurosign\n";
########### cut here

I get this output:

This is perl 5.008005
t: not utf8
u: is utf8
t: not Eurosign
u: is Eurosign

I would expect that $t does have the utf8 flag set,
as indicated in the manpage of HTML::Entities :

      decode_entities( $string )
          This routine replaces HTML entities found in the
          $string with the corresponding ISO-8859-1 character,
          and if possible (under perl 5.8 or later) will replace
          to Unicode characters.  Unrecognized entities are left
          alone.

Why do I have to force the utf8 flag using decode("utf8",..) ?



Well that does suggest what you expect I agree.

One of my guesses is that the problem lies in XS-processing of strings
where the utf8 flag is not set correctly.  True?



Certainly possible - suggest you contact author of HTML:Entities
It is also possible it is left encoded deliberately.





--
Paul Bijnens, Xplanation                            Tel  +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM    Fax  +32 16 397.512
http://www.xplanation.com/          email:  
Paul(_dot_)Bijnens(_at_)xplanation(_dot_)com
***********************************************************************
* I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, F6, *
* quit,  ZZ, :q, :q!,  M-Z, ^X^C,  logoff, logout, close, bye,  /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
* kill -9 1,  Alt-F4,  Ctrl-Alt-Del,  AltGr-NumLock,  Stop-A,  ...    *
* ...  "Are you sure?"  ...   YES   ...   Phew ...   I'm out          *
***********************************************************************