mhonarc-dev

Re: New iso2022jp.pl

2002-07-25 22:07:43
On July 24, 2002 at 17:18, "Takashi P.KATOH" wrote:

I've not checked the UTF-8 related codes in MHonArc yet, but
I think clipping UTF-8 string is very easy if we use perl
5.8 or later.
I attached a sample code to this mail.

Here is my shot using the Unicode::String module and can be included
in the MHonArc::UTF8 module (which appears to work under Perl 5.6.1):

sub clip {
    use utf8;
    my $str      = \shift;  # Prevent unnecessary copy.
    my $len      = shift;   # Clip length
    my $is_html  = shift;   # If entity references should be considered
    my $has_tags = shift;   # If html tags should be stripped

    my $u = Unicode::String::utf8($$str);

    if (!$is_html) {
      return $u->substr(0, $len);
    }

    my $text = Unicode::String::utf8("");
    my $subtext;
    my $html_len = $u->length;
    my($pos, $sublen, $erlen, $real_len);
    my $er_len = 0;
    
    for ( $pos=0, $sublen=$len; $pos < $html_len; ) {
        $subtext = $u->substr($pos, $sublen);
        $pos += $sublen;

        # strip tags
        if ($has_tags) {
            $subtext =~ s/\A[^<]*>//; # clipped tag
            $subtext =~ s/<[^>]*>//g;
            $subtext =~ s/<[^>]*\Z//; # clipped tag
        }

        # check for clipped entity reference
        if (($pos < $html_len) && ($subtext =~ /\&[^;]*\Z/)) {
            my $semi = $u->index(';', $pos);
            if ($semi < 0) {
                # malformed entity reference
                $subtext .= $u->substr($pos);
                $pos = $html_len;
            } else {
                $subtext .= $u->substr($pos, $semi-$pos+1)
                    if $semi > $pos;
                $pos = $semi+1;
            }
        }

        # compute entity reference lengths to determine "real" character
        # count and not raw character count.
        while ($subtext =~ /(\&[^;]+);/g) {
            $er_len += length($1);
        }

        $text .= $subtext;

        # done if we have enough
        $real_len = $text->length - $er_len;
        if ($real_len >= $len) {
            last;
        }
        $sublen = $len - ($text->length - $er_len);
    }
    $text;
}

This function is basically an adaptation of a mhonarc::clip_text
function to replace what is used in mhonarc::replace_li_var.  This
algorithm avoids doing a character split, which can be an expensive
operation.

I am working on adding a TEXTCLIPFUNC resource.  One question I have
with respect to iso-2022-jp is if the clip function you implemented
can be expanded to handle the $has_tags flag as shown in the above
function.

The reason for the flag is I plan to add message body preview
capabilities similiar to what is done by the mha-preview program in
the examples/ directory of MHonArc.  The clip function will be
used to clip out HTML body text.

If it will be a problem for iso-2022-jp, it will be a feature
that may not be usable for iso-2022-jp data.

--ewh

---------------------------------------------------------------------
To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the
message text UNSUBSCRIBE MHONARC-DEV

<Prev in Thread] Current Thread [Next in Thread>