At 11:32 AM 10/20/2006, Andrew wrote:
If you think its not really feasible to parse a plain text citation
into a marked up version then that's good feedback - it could well be
that a percentage need to be done by hand.
Scale is a real issue here. Real-world citation formats include
variations like "use 'pp.' on page ranges for articles in books, but
not for articles in journals." At scale, even if your process does
the correct thing with 85 of 100 citations (a very optimistic rate),
that can leave scores of incorrect ones. And if your upconversion
can't recognize where it's failing, you have to find the errors
before you can fix them.
David is right: it's ultimately an NLP problem (though a very
interesting subset of NLP). As he also says, success depends both on
handling the rules properly, and on the input actually following
those rules. (There are dozens of citation formats around, too.)
"Never say never" is good to keep in mind, but when I'm asked to look
at citations I immediately start asking questions about the scope of
the input, its validation, and acceptable strategies for exception
handling. When told there won't be any exceptions it's usually pretty
easy to find a bunch.
Cheers,
Wendell
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--