On July 24, 2002 at 17:18, "Takashi P.KATOH" wrote:
I've not checked the UTF-8 related codes in MHonArc yet, but
I think clipping UTF-8 string is very easy if we use perl
5.8 or later.
I attached a sample code to this mail.
Here is my shot using the Unicode::String module and can be included
in the MHonArc::UTF8 module (which appears to work under Perl 5.6.1):
sub clip {
use utf8;
my $str = \shift; # Prevent unnecessary copy.
my $len = shift; # Clip length
my $is_html = shift; # If entity references should be considered
my $has_tags = shift; # If html tags should be stripped
my $u = Unicode::String::utf8($$str);
if (!$is_html) {
return $u->substr(0, $len);
}
my $text = Unicode::String::utf8("");
my $subtext;
my $html_len = $u->length;
my($pos, $sublen, $erlen, $real_len);
my $er_len = 0;
for ( $pos=0, $sublen=$len; $pos < $html_len; ) {
$subtext = $u->substr($pos, $sublen);
$pos += $sublen;
# strip tags
if ($has_tags) {
$subtext =~ s/\A[^<]*>//; # clipped tag
$subtext =~ s/<[^>]*>//g;
$subtext =~ s/<[^>]*\Z//; # clipped tag
}
# check for clipped entity reference
if (($pos < $html_len) && ($subtext =~ /\&[^;]*\Z/)) {
my $semi = $u->index(';', $pos);
if ($semi < 0) {
# malformed entity reference
$subtext .= $u->substr($pos);
$pos = $html_len;
} else {
$subtext .= $u->substr($pos, $semi-$pos+1)
if $semi > $pos;
$pos = $semi+1;
}
}
# compute entity reference lengths to determine "real" character
# count and not raw character count.
while ($subtext =~ /(\&[^;]+);/g) {
$er_len += length($1);
}
$text .= $subtext;
# done if we have enough
$real_len = $text->length - $er_len;
if ($real_len >= $len) {
last;
}
$sublen = $len - ($text->length - $er_len);
}
$text;
}
This function is basically an adaptation of a mhonarc::clip_text
function to replace what is used in mhonarc::replace_li_var. This
algorithm avoids doing a character split, which can be an expensive
operation.
I am working on adding a TEXTCLIPFUNC resource. One question I have
with respect to iso-2022-jp is if the clip function you implemented
can be expanded to handle the $has_tags flag as shown in the above
function.
The reason for the flag is I plan to add message body preview
capabilities similiar to what is done by the mha-preview program in
the examples/ directory of MHonArc. The clip function will be
used to clip out HTML body text.
If it will be a problem for iso-2022-jp, it will be a feature
that may not be usable for iso-2022-jp data.
--ewh
---------------------------------------------------------------------
To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the
message text UNSUBSCRIBE MHONARC-DEV