xsl-list
[Top] [All Lists]

Re: [xsl] XSLT function for title capitalization?

2018-04-10 01:19:37
On Mon, 2018-04-09 at 20:52 +0000, David Sewell dsewell(_at_)virginia(_dot_)edu
wrote:
Wondering if anyone has a serviceable function (preferably in XSLT
2/3 but v1 is 
fine if it works) that takes a string as input and returns it with
title 
capitalization according to English-language editorial practice (for
example, 
Chicago Manual of Style). 

I'd use replace() probably, rather than tokenizing, so as to change as
little as possible & facilitate regression tests.

Some test cases should include
* words that do and don't change at the start and at the end of input;
* words like o'clock and don't that include apostrophes, both as '
  and as ’ (it doesn't matter whether they are input as entities
  or literally or numeric character references though, as they all
  end up the same after XML parsing)
* hyphenated proper names like Rees-Mogg
* exceptions like Ladies-in-Waiting
* punctuation such as em dashes, quotes, commas, semicolons

Unfortunately XSLT doesn't give us Perl's wonderful e modifier on
substitution, and neither does XQuery (where it'd be more useful), but
XSLT does give us xsl:analyze-string. I'd start with David Carlisle's
approach and add a lot of test cases and fix the regexp to be something
more like
   (\w)(\w*(?:'\w+)?)
maybe.

An alternative is to replace (\w)'(\w) with $1Œ$2 everywhere, where Œ
is some Unicode upper-case letter or sequence of letters that
definitely doesn't occur in your input, and change it back at the end.

In XSLT 1 i'd cry for a while and then write something recursive that
split its input using translate() and substring-before() to find where
to split.

For https://words.fromoldbooks.org/Chalmers-Biography/ i use Perl, as
the input isn't well-formed XML at first, with a table of manual
overrides, but there are fewer than 10,000 entries i think. Once it's
in XMl my script/Makefile for conversion does use XSLT, taking 46
seconds to process 43MBytes of XML into 9771 separate XML files with
Saxon.

Liam


-- 
Liam Quin, W3C, http://www.w3.org/People/Quin/
Staff contact for Verifiable Claims WG, SVG WG, XQuery WG
Improving Web Advertising: https://www.w3.org/community/web-adv/
Personal: awesome vintage art: http://www.fromoldbooks.org/
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--

<Prev in Thread] Current Thread [Next in Thread>