files embedded en text messages, extensions, etc.

Dear MHonArc people,

I recently started using MHonArc, one of the reasons being its ability
to handle mime-encoded mails. However, we still get quite a lot of
ordinary text mails with embedded files, mainly (compressed) uuencoded
files, PostScript files and LaTeX files. I have the following wish
list/suggestions:

1) It would be useful if MHonArc considered more than just the last
   filename extension. To reduce mail size people often compress files
   before attaching them. Using only the last extension, e.g. .gz of
   file.ps.gz, in the file name means that browser does not know how
   to handle them after they have been saved. If .ps.gz is used, the
   browser knows that it just have to uncompress the file and call a
   PostScript viewer.

2) 12 Aug 1999 Earl Hood announced that he had added support for
   decoding uuencoded data within text messages in mhtxtplain.pl of
   MHonArc v2.4.2. I would like MHonArc to be able to handle other
   kinds of embedded files. Preferably specified through resources.

3) It is possible to tell MHonArc if it should gzip files it creates.
   It would be nice if one could instruct MHonArc to uncompress
   attached compressed files. Some of our users are on platforms where
   they do not necessary have the tools to make the needed
   uncompression.

4) Many attached files are not "web friendly", i.e. they are in a
   format one cannot expect the user to be able to view directly.
   However, our MHonArc server can turn many formats into pdf. If one
   could specify some post processing/hook filter MHonArc should call
   after saving an attachment, these converters could be activated.


Regarding 1) it is sufficient to replace the line:

       ($nameparm =~ /\.(\w+)$/)) {     # filename has an extention

in mhexternal.pl with the line:

       ($nameparm =~ /\.(\w+(\.\w+)*)$/)) { # filename has extension(s)


Regarding 2) I have made a slight modification of mhtxtplain.pl (patch
enclosed) which not only extracts embedded uuencoded files but also
embedded PostScript and LaTeX files. The modification is activated via
an "extract" option.

Essentially the modification tries to interpret an ordinary text
message as a multipart message by splitting the text message according
to a given pattern corresponding to the file types in question. Based
on patterns, each part is assigned a "Content-Type" (and possibly a
name and encoding) and readmail::MAILread_body is called on each part.

As it is now, these patterns are hard wired, but ideally they should
be specified through dedicated resource types (MIMEARGS are not
suitable). MHonArc has the ability to hook in own custom message
filters, but unfortunately one cannot hook in own custom resource
types.

As for 3) and 4) I do not have immediate solutions. Maybe others have?

I hope some of this will make it into future versions of MHonArc.

Regards,

Uffe H. Engberg


Exract patch for mhtxtplain.pl 2.9 00/04/24 00:02:43 as included in
MHonArc2.4.6

--- mhtxtplain.pl       Mon Apr 24 09:23:12 2000
+++ /users/engberg/mhtxtplain.pl        Tue Jun  6 13:38:27 2000
@@ -51,11 +51,6 @@
 ##
 ##     default=set     Default charset to use if not set.
 ##
-##      inlineexts="ext1,ext2,..."
-##                      A comma separated list of message specified filename
-##                      extensions to treat as inline data.
-##                      Applicable only when uudecode options specified.
-##
 ##     keepspace       Preserve whitespace if nonfixed
 ##
 ##     nourl           Do hyperlink URLs
@@ -70,7 +65,11 @@
 ##     target=name     Set TARGET attribute for links if converting URLs
 ##                     to links.  Defaults to _top.
 ##
-##     uudecode        Decoded any embedded uuencoded data.
+##     extract         Extract embeded data into separate files
+##                     The extracted data is handled by calling
+##                     readmail::MAILread_body with appropiate contents
+##                     type, file name (if deduceable) and encoding type
+##                     ('x-uuencode' if uuencoded).
 ##
 ##     All arguments should be separated by at least one space
 ##
@@ -81,67 +80,101 @@
     ## Parse arguments
     $args      = ""  unless defined($args);
 
-    ## Check if decoding uuencoded data.  The implementation chosen here
-    ## for decoding uuencoded data was done so when uudecode is not
+    ## Check if extracting data.  The implementation chosen here
+    ## for extracting data was done so when extract is not
     ## specified, there is no extra overhead (besides the $args check for
-    ## uudecode).  However, when uudecode is specified, more overhead may
+    ## extract).  However, when extract is specified, more overhead may
     ## exist over other potential implementations.
-    ## I.e.  We only try to penalize performance when uudecode is specified.
-    if ($args =~ s/\buudecode\b//ig) {
-       # $args has uudecode stripped out for recursive calls
-
-       # Make sure we have needed routines
-       require 'base64.pl';
-       require 'mhmimetypes.pl';
-
-       # Grab any filename extensions that imply inlining
-       my $inlineexts = '';
-       if ($args =~ /\binlineexts=(\S+)/) {
-           $inlineexts = ',' . lc($1) . ',';
-           $inlineexts =~ s/['"]//g;
-       }
+    ## I.e.  We only try to penalize performance when extract is specified.
+    if ($args =~ s/\bextract\b//ig) {
+       # $args has extract stripped out for recursive calls
 
        local($pdata);  # have to use local() since typeglobs used
-       my($inext, $uddata, $file, $urlfile);
+       my($bpdata); # Begining of part data, $pdata
+       my($pheader,$type,$subtype,$ctype,$name,$encoding,$filter);
        my @files = ( );
+       my @array = ( );
        my $ret = "";
        my $i = 0;
 
-       # Split on uuencoded data.  For text portions, recursively call
+       # Split on extract data.  For text portions, recursively call
        # filter to convert text data: makes it easier to handle all
        # the various formatting options.
+        my ($splitpattern) =
+              'begin \d\d\d \S+\n[!-M].*?\nend\n|' .
+              '%!PS.*?\n%%EOF|' .
+              '\\\\document(?:style|class).*?\\\\end{document}';
        foreach $pdata
-               (split(/^(begin \d\d\d \S+\n[!-M].*?\nend\n)/sm, $data)) {
-           if ($i % 2) {       # uuencoded data
-               # extract filename extension
-               ($file) = $pdata =~ /^begin \d\d\d (\S+)/;
-               if ($file =~ /\.(\w+)$/) { $inext = $1; } else { $inext = ""; }
-
-               # decode data
-               $uddata = base64::uudecode($pdata);
-
-               # save to file
-               push(@files,
-                    mhonarc::write_attachment(
-                       'application/octet-stream', \$uddata, '', '', $inext));
-               $urlfile = mhonarc::htmlize($files[$#files]);
-
-               # create link to file
-               if (index($inlineexts, ','.lc($inext).',') >= $[) {
-                   $ret .= qq|<A HREF="$urlfile"><IMG SRC="$urlfile">| .
-                           qq|</A><BR>\n|;
-               } else {
-                   $ret .= qq|<A HREF="$urlfile">| . mhonarc::htmlize($file) .
-                           qq|</A><BR>\n|;
-               }
-
+         (split(/^($splitpattern)/sm, $data)) { 
+         if ($i % 2) { # extract data
+           $ctype = '';
+           $name = '';
+           $encoding = '';
+           ($bpdata) = $pdata =~ /^(.*?)\n/;
+           CTYPESWITCH : {
+           if ($bpdata =~ /^begin \d\d\d (\S+)/) { # uuencoded data
+               $name = $1;
+               $ctype = 'application/octet-stream';
+               $encoding = 'x-uuencode';
+               last CTYPESWITCH; };
+           if ($bpdata =~ /^%!PS/) { # PostScript data
+               $ctype = 'application/postscript';
+               last CTYPESWITCH; };
+           if ($bpdata =~ /^\\document(style|class)/) { # LaTeX document
+               $ctype = 'application/x-latex';
+               last CTYPESWITCH; };
+           # Default if no handler
+               $ctype = 'application/octet-stream';
+               warn qq|Warning: No Content-type for extract data |,
+                    qq|starting with: "$bpdata", |,
+                    qq|assuming "$ctype"\n|;
+           }; # END OF CTYPESWITCH
+
+           # Following taken from readmail to know what filter loaded
+
+           ## Get type/subtype
+           $ctype = $ctype || 'text/plain';        # Default to text/plain 
+                                                   # if no content-type
+           ($ctype) = $ctype =~ m%^\s*([\w-\./]+)%;# Extract content-type
+           $ctype =~ tr/A-Z/a-z/;                  # Convert to lowercase
+           if ($ctype =~ m%/%) {                   # Extract base and sub type
+             ($type,$subtype) = split(/\//, $ctype, 2);
+           } elsif ($ctype =~ /text/i) {
+             $ctype = 'text/plain';
+             $type = 'text';  $subtype = 'plain';
+           } else {
+            $type = $subtype = '';
+           };
+
+           ## Load content-type filter
+           if (   (   !defined($filter = &readmail::load_filter($ctype))
+                   || !defined(&$filter))
+               && (   !defined($filter = &readmail::load_filter("$type/*"))
+                   || !defined(&$filter))
+               && (   !defined($filter = &readmail::load_filter("*/*"))
+                   || !defined(&$filter)) ) {
+               warn qq|Warning: Unrecognized content-type, "$ctype", |,
+                    qq|assuming "application/octet-stream"\n|;
+               $filter = &readmail::load_filter('application/octet-stream');
+           };
+           if ($filter ne 'm2h_text_plain::filter') {
+             $pheader = "Content-Type: $ctype";
+             $pheader .= qq|; name="$name"| if $name;
+             @array =
+               &readmail::MAILread_body($pheader,$pdata,$ctype,$encoding,'');
+             ## Setup return variables
+             $ret .= shift @array;                        # Return string
+             push(@files, @array);                        # Derived files
+           } else { # avoid infinite recursion, consider it plain text
+               $ret .= &filter($header, *fields, *pdata, $isdecode, $args);
+           }
            } else {            # plain text
                $ret .= &filter($header, *fields, *pdata, $isdecode, $args);
            }
            ++$i;
        }
 
-       ## Done with uudecode
+       ## Done with extracting
        return ($ret, @files);
     }