mhonarc-dev

Re: New iso2022jp.pl

2002-07-29 02:09:44
Sorry for my late reply.

From: Earl Hood <earl(_at_)earlhood(_dot_)com>
Subject: Re: New iso2022jp.pl
Date: Fri, 26 Jul 2002 00:07:35 -0500>

Here is my shot using the Unicode::String module and can be included
in the MHonArc::UTF8 module (which appears to work under Perl 5.6.1):

sub clip {
    use utf8;
...
    
    for ( $pos=0, $sublen=$len; $pos < $html_len; ) {
(*1)>   $subtext = $u->substr($pos, $sublen);
      $pos += $sublen;

      # strip tags
      if ($has_tags) {
          $subtext =~ s/\A[^<]*>//; # clipped tag
          $subtext =~ s/<[^>]*>//g;
          $subtext =~ s/<[^>]*\Z//; # clipped tag
(*2)>   }
...
      # compute entity reference lengths to determine "real" character
      # count and not raw character count.
      while ($subtext =~ /(\&[^;]+);/g) {
          $er_len += length($1);
      }

(*3)>   $text .= $subtext;

      # done if we have enough
      $real_len = $text->length - $er_len;
      if ($real_len >= $len) {
          last;
      }
      $sublen = $len - ($text->length - $er_len);
    }
    $text;
}

Maybe this function has a bug:
When you want to clip a string "A<a href="XYZ">BCD</a>" in
4-char length (i.e., clip("A<a href="XYZ">BCD</a>", 4, 1, 1);),

Result of       (*1)    (*2)    (*3)
                "A<a "  "A"     "A"             # OK
                "hre"   "hre"   "Ahre"          # NG!

The expected result is "ABCD", isn't it?

FYI: I attached a patch for usplit.pl, which I posted a few
days ago:
  http://www.mhonarc.org/archive/html/mhonarc-dev/2002-07/msg00011.html


Anyway,

                                                 One question I have
with respect to iso-2022-jp is if the clip function you implemented
can be expanded to handle the $has_tags flag as shown in the above
function.

I think I can.
Please wait for a while.

-- 
Takashi P.KATOH

--- usplit.pl-  Mon Jul 29 17:54:57 2002
+++ usplit.pl   Mon Jul 29 17:22:53 2002
@@ -11,6 +11,8 @@
     foreach $len (0 .. 9) {
        $ret = $_;
        
+       $ret =~ s/<[^>]*>//g;
+
        # taken from MHonArc (mhrcvars.pl)
        my @chars = $ret =~ /(\&[^;\s]*;|.)/g;
        if (scalar(@chars) < $len) {
<Prev in Thread] Current Thread [Next in Thread>