ghostscript includes a pstext utility to extract text: it does a
reasonable but not 100% accurate job (and includes the full ghostscript
postscript interpreter).
If you turn off the ps2ascii simple mode (remove the "-dSIMPLE"
argument),
GhostScript outputs font and positioning information for each string.
You
can use that information to eliminate headers & footers, identify
elements
to tag, and so forth.
Exegenix (http://exegenix.com/) has a commercial solution for converting
PostScript or PDF to XML; it looks intriguing.
--
Larry Kollar k o l l a r @ a l l t e l . n e t
"The hardest part of all this is the part that requires thinking."
-- Paul Tyson, on xml-doc
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list