xsl-list
[Top] [All Lists]

RE: [xsl] Using analyze-string to catch roman numerals?

2008-10-09 18:06:27
The two things wrong with your solution are:

(a) you're matching any sequence of letters that could be a roman numeral,
without looking at the context, hence matching the IX in APPENDIX.

(b) you're only matching the first thing in each element that looks like a
roman numeral

The second is easily fixed: don't use an anchored regex in analyze-string
like this

regex="^(.*?)([IVXL]+)(.*?)$"

Instead use an unanchored regex

regex="([IVXL]+)"

and add an xsl:non-matching-substring element that copies unmatched
substrings across unchanged (or case-converted if you want).

Problem (a) is much harder. You can get a fair way by requiring the sequence
of IVXL to have non-letters before and after it. But you'll still be
matching the word "ILL" as a roman numeral when it clearly isn't. Like all
up-conversion tasks, though, it's very much up to you how much time you want
to spend fine-tuning the patterns and rules that you define.

Michael Kay
http://www.saxonica.com/ 

-----Original Message-----
From: Tony Zanella [mailto:tony(_dot_)zanella(_at_)gmail(_dot_)com] 
Sent: 09 October 2008 20:18
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: [xsl] Using analyze-string to catch roman numerals?

Hello all,

Given the following input:

<root>
    <head>CHAPTER II. THE WRECKED FOUNDATIONS OF DOMESTICITY</head>
    <head>PROBLEMA. HELOISE XXIX.</head>
    <head>Selected Letters</head>
    <head>The Second Part of Henry IV.</head>
    <head>VIII</head>
    <head>APPENDIX VII</head>
    <head>Appendix VII</head>
    <head>APPENDIX</head>
    <head>CALVIN XVII</head>
    <head>ILLUSTRATION</head>
</root>

and the following template:

<xsl:template match="head">
        <xsl:choose>
            <xsl:when test="not(matches(.,'^(.*?)([IVXL]+)(.*?)$'))">
                <xsl:value-of select="lower-case(.)"/>
            </xsl:when>
            <xsl:when test="matches(.,'^(.*?)([IVXL]+)(.*?)$')">
                <xsl:analyze-string select="." 
regex="^(.*?)([IVXL]+)(.*?)$">
                    <xsl:matching-substring>
                        <xsl:value-of 
select="lower-case(regex-group(1))"/>
                        <xsl:value-of 
select="upper-case(regex-group(2))"/>
                        <xsl:value-of 
select="lower-case(regex-group(3))"/>
                    </xsl:matching-substring>
                </xsl:analyze-string>
            </xsl:when>
            <xsl:otherwise/>
        </xsl:choose>
    </xsl:template>

I'm trying to use analyze-string to do the following:
Test for a roman numeral. If there isn't one, lower-case(.). 
If there is one, break (.) into its roman numeral and 
non-roman numeral parts, lower-case()ing the latter.

The output I get is:

    chapter II. the wrecked foundations of domesticity
    probLema. heloise xxix.
    selected Letters
    the second part of henry IV.
    VIII
    appendIX vii
    appendix VII
    appendIX
    caLVIn xvii
    ILLustration

When what I want is this:

      chapter II. the wrecked foundations of domesticity
      problema. heloise XXIX.
      selected letters
      the second part of henry IV.
      VIII
      appendix VII
      appendix VII
      appendix
      calvin XVII
      illustration

 Between my relative inexperience with both regexes and XSLT, 
thanks for any help!
Tony

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>