On Fri, 10 Jan 2003 20:39:10 +0200
Jarkko Hietaniemi <jhi(_at_)iki(_dot_)fi> wrote:
On Fri, Jan 10, 2003 at 07:28:00PM +0100, Merijn van den Kroonenberg wrote:
You might be looking for these:
# ISO 8859-1 to UTF-8
s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg;
# UTF-8 to ISO 8859-1
s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
I think that will work (they are not mine, so don't blame me if not ;-)
They are mine :-) so I feel free to say that they don't &#NNN;
conversion... but they certainly could be changed to work so.
(Answer)
$string = qq/ABC ÀÁÂÃÄÅÆ/;
$string =~ s/([\x80-\xff])/"&#".ord($1).";"/ge;
print "$string\n";
# gets "ABC ÀÁÂÃÄÅÆ"
(Another answer)
Gisle Aas's HTML::Entities may help.
It's aware of other types of character references too:
i.e. <ê>, <ê>, and <ê>.
distributed from:
http://search.cpan.org/author/GAAS/HTML-Parser-3.26/
use HTML::Entities;
$string = qq/ABC ÀÁÂÃÄÅÆ/;
print encode_entities($string, "\x80-\xff");
# gets "ABC ÀÁÂÃÄÅÆ"
$encoded = qq/ABC ÀÖÝÆ/;
print decode_entities($encoded), "\n";
# gets "ABC ÀÖÝÆ"
Greetings, Merijn
----- Original Message -----
From: "Narins, Josh" <josh(_dot_)narins(_at_)lehman(_dot_)com>
To: <perl-unicode(_at_)perl(_dot_)org>
Sent: Friday, January 10, 2003 6:54 PM
Subject: beginniner's 5.6.1 latin1<->utf8 question
At one point I had a regex which perfectly converts the string A below
into
a series of ê strings.
This is nice for me, because I just sling them out on the web, and as
entities, they always seem to work.
I've lost the regex, can't seem to find it. I know it had chr or ord in
it.
I've been reading the perl-unicode archives, and googling, but I just
don't
see it.
This is for perl5.6.1 with Sun's (reputedly?) sick iconv.
If someone could tap me in the right direction...
Thx in advance
Jarkko Hietaniemi <jhi(_at_)iki(_dot_)fi> http://www.iki.fi/jhi/ "There is
this special
biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen
SADAHIRO Tomoyuki