On Mon, 2018-04-09 at 20:52 +0000, David Sewell dsewell(_at_)virginia(_dot_)edu
wrote:
Wondering if anyone has a serviceable function (preferably in XSLT
2/3 but v1 is
fine if it works) that takes a string as input and returns it with
title
capitalization according to English-language editorial practice (for
example,
Chicago Manual of Style).
I'd use replace() probably, rather than tokenizing, so as to change as
little as possible & facilitate regression tests.
Some test cases should include
* words that do and don't change at the start and at the end of input;
* words like o'clock and don't that include apostrophes, both as '
and as ’ (it doesn't matter whether they are input as entities
or literally or numeric character references though, as they all
end up the same after XML parsing)
* hyphenated proper names like Rees-Mogg
* exceptions like Ladies-in-Waiting
* punctuation such as em dashes, quotes, commas, semicolons
Unfortunately XSLT doesn't give us Perl's wonderful e modifier on
substitution, and neither does XQuery (where it'd be more useful), but
XSLT does give us xsl:analyze-string. I'd start with David Carlisle's
approach and add a lot of test cases and fix the regexp to be something
more like
(\w)(\w*(?:'\w+)?)
maybe.
An alternative is to replace (\w)'(\w) with $1Œ$2 everywhere, where Œ
is some Unicode upper-case letter or sequence of letters that
definitely doesn't occur in your input, and change it back at the end.
In XSLT 1 i'd cry for a while and then write something recursive that
split its input using translate() and substring-before() to find where
to split.
For https://words.fromoldbooks.org/Chalmers-Biography/ i use Perl, as
the input isn't well-formed XML at first, with a table of manual
overrides, but there are fewer than 10,000 entries i think. Once it's
in XMl my script/Makefile for conversion does use XSLT, taking 46
seconds to process 43MBytes of XML into 9771 separate XML files with
Saxon.
Liam
--
Liam Quin, W3C, http://www.w3.org/People/Quin/
Staff contact for Verifiable Claims WG, SVG WG, XQuery WG
Improving Web Advertising: https://www.w3.org/community/web-adv/
Personal: awesome vintage art: http://www.fromoldbooks.org/
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--