procmail
[Top] [All Lists]

Re: Using procmail to organize PDF files

2009-11-16 16:25:54
* Professional Software Engineering 
<PSE-L(_at_)mail(_dot_)professional(_dot_)org> [2009-11-16 21:14]:

Not to discourage you, but if you have a mechanism for extracting
the PDF metadata, I'd think it'd make sense to build directly upon
that to do your filing.  Like injecting the data directly into an
SQL db (probably even storing the PDF itself).

I've not looked at the code for 'pdfinfo' or 'pdftotext', but I
suspect updating compiled code and recompiling would not be ideal, and
writing the code in a way that would enable soft coding the keywords
and regular expressions and parsing some file with rules would be some
effort.

An interpretted language like procmail seems to be ideal, because of
the scoring and matching centric approach.  For each new type of
document, the procmail could be readily expanded with a new rule.

How is it you're planning on conducting the "search" part of your
equation?

The knee jerk thought is to use grepmail, but that's more of a garage
solution that probably wouldn't perform well.  I haven't given much
thought on the indexing.  Perhaps the mailbox with index data could be
imported by a commercial tool like Mailsteward, which would get it
into a mysql database.

How is the PDF file itself moved about, or is it residing on the
local host and you're just generating an email message with summary
info?

This is what the phony email looks like after the first couple rules:

  From document
  Filename: /home/user/data/heap_of_unorganized_documents/document.pdf
  Creator:        TeX
  Producer:       pdfTeX-0.14h
  CreationDate:   Tue May 13 07:00:00 2008
  Tagged:         no
  Pages:          1
  Encrypted:      no
  Page size:      612 x 792 pts (letter)
  File size:      19601 bytes
  Optimized:      no
  PDF version:    1.3

  <email body - extracted text here>


The filename is parsed using \/ and $MATCH.  From there, the action
would be something like:

  |mv $MATCH /some_meaningful_path/some_meaningful_filename.pdf

____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)de
http://mailman.rwth-aachen.de/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>