Re: MHonArc and multi-byte characters in HTML

[ Refer to 
http://www.xray.mpe.mpg.de/mailing-lists/mhonarc/1998-04/msg00129.html 
  for the history behind this message ]

I just upgraded to v2.5.0b2 and was surprised to find I'm still having the 
same problem (see link above) years later!  Is the "dirty workaround" shown 
below still the best way to solve it?


Koichi Nakatani <nakatani(_at_)konica(_dot_)co(_dot_)jp> writes:

Jason R Mastaler wrote:

&gt;&gt; &gt; ESC$B<i2,ESC(B ESC$BCNI'ESC(B / MORIOKA Tomohiko ...
                   ^
See the unescaped open bracket there?  I don't know enough about
encodings to say whether or not the bracket is specifically legal
there or not, but it doesn't look like legal HTML to me.  My
understanding is that the wilma_striphtml program requires legal HTML
for correct operation.


It is a complicated problem, and I cannot say what should be the right 
solution.
However, I can show you a dirty workaround.
  The following patch will set MSB of Japanese characters, and strip
ESC$B & ESC(B.  This is called EUC-JP character encoding.  In this way,
Japanese characters will not affect wilma.

diff -urN MHonArc2.2.0/lib/mhtxtplain.pl MHonArc2.2.0-jp0/lib/mhtxtplain.pl
--- MHonArc2.2.0/lib/mhtxtplain.pl      Wed Mar  4 09:12:54 1998
+++ MHonArc2.2.0-euc-jp/lib/mhtxtplain.pl  Fri Mar 20 21:19:12 1998
@@ -174,7 +174,7 @@
 sub jp2022 {
     local(*body) = shift;
     local(@lines) = split(/\r?\n/,$body);
-    local($ret, $ascii_text);
+    local($ret, $ascii_text, $jp_text);
     local($_);

     $ret = "<PRE>\n";
@@ -205,7 +205,7 @@
        # Process Each Segment
        while(1) {
            if (s/^(\033\([BJ])//) { # Single Byte Segment
-               $ret .= $1;
+               # $ret .= $1;
                while(1) {
                    if (s/^([^\033]+)//) {      # ASCII plain text
                        $ascii_text = $1;
@@ -228,10 +228,12 @@
                    }
                }
            } elsif (s/^(\033\$[\(_at_)AB]|\033\$\([CD])//) { # Double Byte 
Segment
-               $ret .= $1;
+               # $ret .= $1;
                while (1) {
                    if (s/^([!-~][!-~]+)//) { # Double Char plain text
-                       $ret .= $1;
+                       $jp_text = $1;
+                       $jp_text =~ tr/\041-\176/\241-\376/;
+                       $ret .= $jp_text;
                    } elsif (s/(\033\.[A-F])//) { # G2 Designate Sequence
                        $ret .= $1;
                    } elsif (s/(\033N[ -^?])//) { # Single Shift Sequence
End of patch.
-- 
Koichi Nakatani
Graphic Arts Center, Konica Corporation