On Fri, 13 Aug 2004, Wendell Piez wrote:
Your project sounds very ambitious. Up-conversion is a challenging and
fascinating business, which we're all going to learn much more about.
You have several conference papers' worth of material here, I bet.
I'm hoping so.
Quite frankly, I hadn't realized we were so cutting edge. :)
Ultimately, my goal is to provide an application that offers integration
between the text file (written using the user's text processor of choice).
User wants to submit a manuscript, then the application performs all the
necessary generation of the document (including cover letter) using
user-specific information about how they want the document to appear,
including any market- or genre-specific styles. Press a button, out pops
the PDF or RTF. For now, I'll settle for PDF. :)
I'd already written the submission manager and am trying to work to
integrate the work of another person into the project. Thus my struggle to
understand.
At 08:15 PM 8/12/2004, you wrote:
But I've been thinking, based on the comments from the list, that a
better process might be eliminating the perl script entirely.
Maybe: but you'll need something at least as good to do the work it's
doing, and Perl is really good at regular-expressions and string processing
generally.
(Personally I might have tried it in Python, but that's mainly because I
can count the lines of Perl I've written in my life on one hand. Of course,
I can count in binary on my hands, which gets me higher than five.)
I didn't write the perl script, thus my frustration (as a Python person).
My partner-in-crime and I have come at the problem from entirely different
directions.
Now it has some regexp support, XSLT 2.0 should be at least a credible
option here, but its features have yet to be stress-tested TMK and
tools support is still somewhat up in the air. (I believe Mike Kay is
speaking on this very topic at XML 2004 this November in Washington
DC.)
OK, that's what I'd been beginning to understnad based on list comments. I
wasn't aware of the tool support problem.
A split-down-the-middle option could be to write a little function
library in the language of your choice to do the upconversion
string-processing, and call out to it from your XSLT using extension
functions. (This is what I kind of imagined would happen five years
ago, but it turns out processor-dependent extension functions are
unfashionable these days.)
This is an intriguing option.
99% of the problem comes from documents saved in the native platform that
aren't correctly tagged. I'm not quite certain what to do about this so
that the editing is transparent. Yet.
I feel moderately confident that this might make it a more contiguous
process, which would also require fewer installed pieces in order to work.
I'm not sure I'd
want to eliminate the intermediate XML file, though.
I think having the intermediate format will prove to be good design in
any case.
OK.
Option 3 seems to be ruled out based on my current toolchain
(apache-FOP), which probably eliminates #2 as well. (I could easily be
wrong on this)
Apache Xalan-J has support for a node-set function, so you could use
option 2 if you wanted. It will even recognize it in the exslt.org
namespace, which is nice.
Neat.
So, my question (you knew there was one): can someone give me a
description of how to accomplish #4, given the workflow I've got, using
something like Saxon? I see that it's an XSLT processor, but I'm don't get
the map of how all the pieces fit together. Right now, I know (after
having looked) that I'm using xalan for the simple reason that it came
with my apache-fop install.
Saxon is well-liked by developers (it runs well, it's conformant, and
it has good error messages), and can be switched in for Xalan in your
toolchain if you prefer it. Saxon also supports exslt:node-set, so you
can use option #2 with it as well.
Well, I can see if it offers me more options. I know enough to figure out
how to wrest it into the toolchain.
As I mentioned, it has an extension attribute, saxon:next-in-chain, that
can be invoked for pipelining. IIRC it passes SAX events between processor
invocations (Mike?), so it's much faster than writing a file and reparsing,
though perhaps not quite as fast as passing unserialized trees, as options
2 and 3 would do.
Right now, I'm running a script daily that re-generates XML files from any
changed text files in a given directory tree. The generation of a PDF is
upon-request, with re-generation of XML if it's needed. So part A
(txt->xml) doesn't necessarily happen when part B (xml->pdf) does.
Nevertheless, you've given me another idea, which I'll try over this
weekend.
I am reasonably sure Xalan offers similar features, however, or the Cocoon
framework does.
Cocoon seems very interesting, but I don't quite get where it fits into
the overall picture of things, though I am reading up on it.
I'd also eventually like to get a decent RTF output. Standard manuscript
prose is not terribly complex, so something that supported basic features
should suffice for that. Unfortunately, the commercial options are too
expensive for the intended audience. Is jfor likely to be my best
available option?
I'd be interested to hear myself from the list on this question. I haven't
yet myself seen a really nice route to RTF. I think two passes to this
(analogous to the way IBM deployed a "TeXML" which could be targeted as a
route to TeX) might be the best way to do it: have yet another tag set that
describes only the formatting primitives supported by RTF and a utility
stylesheet to make RTF out of that. Or use XSL-FO, if any of the formatters
can make decent RTF yet.
jfor hasn't been updated at all in over a year, so it seems like a dead
project. And jfor.org is down.
I should add that I *do* need API access rather than a standalone
application.
--
_Deirdre web: http://deirdre.net blog: http://deirdre.org/blog/
yarn: http://fuzzyorange.com cat's blog: http://fuzzyorange.com/vsd/
"Memes are a hoax! Pass it on!"