If you
could convert the PDF to FO you could just as easily convert it to some
specific DTD--the problem is essentially the same and has the same
level
of difficulty.
Yeah but there is in fact one benefit to getting it in xsl-fo, which
would be building code to edit/merge fo instances. Getting the PDF to a
specific dtd is only useful in situations where the structure of your
pdfs imply some form of meaning, for example a bunch of pdfs generated
by Academics to print their whitepapers could perhaps be processed in
this way but once one was trying to deal with PDF of any particular
format I just don't think there's too much meaningful information in
them other than their layout without a lot of preperation. So I guess
what I'm saying is: the more general the solution for this kind of
problem the more it must be focused on extracting layout information
only. Anyone have any arguments against that, could be I haven't
considered things very deeply on this issue but am just going on gut
instinct.
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list