Sorry for my late reply.
From: Earl Hood <earl(_at_)earlhood(_dot_)com>
Subject: Re: New iso2022jp.pl
Date: Fri, 26 Jul 2002 00:07:35 -0500>
Here is my shot using the Unicode::String module and can be included
in the MHonArc::UTF8 module (which appears to work under Perl 5.6.1):
sub clip {
use utf8;
...
for ( $pos=0, $sublen=$len; $pos < $html_len; ) {
(*1)> $subtext = $u->substr($pos, $sublen);
$pos += $sublen;
# strip tags
if ($has_tags) {
$subtext =~ s/\A[^<]*>//; # clipped tag
$subtext =~ s/<[^>]*>//g;
$subtext =~ s/<[^>]*\Z//; # clipped tag
(*2)> }
...
# compute entity reference lengths to determine "real" character
# count and not raw character count.
while ($subtext =~ /(\&[^;]+);/g) {
$er_len += length($1);
}
(*3)> $text .= $subtext;
# done if we have enough
$real_len = $text->length - $er_len;
if ($real_len >= $len) {
last;
}
$sublen = $len - ($text->length - $er_len);
}
$text;
}
Maybe this function has a bug:
When you want to clip a string "A<a href="XYZ">BCD</a>" in
4-char length (i.e., clip("A<a href="XYZ">BCD</a>", 4, 1, 1);),
Result of (*1) (*2) (*3)
"A<a " "A" "A" # OK
"hre" "hre" "Ahre" # NG!
The expected result is "ABCD", isn't it?
FYI: I attached a patch for usplit.pl, which I posted a few
days ago:
http://www.mhonarc.org/archive/html/mhonarc-dev/2002-07/msg00011.html
Anyway,
One question I have
with respect to iso-2022-jp is if the clip function you implemented
can be expanded to handle the $has_tags flag as shown in the above
function.
I think I can.
Please wait for a while.
--
Takashi P.KATOH
--- usplit.pl- Mon Jul 29 17:54:57 2002
+++ usplit.pl Mon Jul 29 17:22:53 2002
@@ -11,6 +11,8 @@
foreach $len (0 .. 9) {
$ret = $_;
+ $ret =~ s/<[^>]*>//g;
+
# taken from MHonArc (mhrcvars.pl)
my @chars = $ret =~ /(\&[^;\s]*;|.)/g;
if (scalar(@chars) < $len) {