xsl-list
[Top] [All Lists]

Re: [xsl] two <xsl:analyze-string> questions

2011-10-22 11:43:31
The following might work for part 2.

  <xsl:variable name="regex" select="'(\p{L})6(\p{L}?)|(\p{L}?)6(\p{L})'"/>
  <xsl:analyze-string select="." regex="{$regex}">
    <xsl:matching-substring>
      <xsl:value-of select="concat(regex-group(1), regex-group(3),
'b', regex-group(2), regex-group(4))"/>
    </xsl:matching-substring>
    <xsl:non-matching-substring>
      <xsl:value-of select="."/>
    </xsl:non-matching-substring>
  </xsl:analyze-string>

-Brandon :)


On Sat, Oct 22, 2011 at 10:55 AM, Birnbaum, David J 
<djbpitt(_at_)pitt(_dot_)edu> wrote:
Dear XSLT-List,

I'd be grateful for advice about a two-part <xsl:analyze-string> problem. I'm 
post-processing messy OCR output, and the situation I'm trying to address 
involves patterns and patterned errors that can be identified through regex 
matching. Some of the patterns are traditional up-conversion (e.g., find a 
certain pattern of digits and punctuation and wrap markup around it); some of 
them are corrections (e.g., the digit "6" and the letter "b" are confused, 
but a digit "6" adjacent to a letter is probably an error and should be 
corrected automatically, while a digit "6" not adjacent to a letter probably 
isn't and should be left alone).

1. The first part of my problem involves general program logic. I'm currently 
using a strategy like the following:

   <xsl:template match="text()">
       <xsl:call-template name="editionLineNo">
           <xsl:with-param name="current" select="."/>
       </xsl:call-template>
   </xsl:template>
   <xsl:template name="editionLineNo">
       <!-- 1. check for digits plus period, \d+\., edition line no -->
       <xsl:param name="current"/>
       <xsl:analyze-string select="$current" regex="(\d+)\.">
           <xsl:matching-substring>
               <editionLineNo>
                   <xsl:value-of select="regex-group(1)"/>
               </editionLineNo>
           </xsl:matching-substring>
           <xsl:non-matching-substring>
               <xsl:call-template name="msFolioNo">
                   <xsl:with-param name="current" select="$current"/>
               </xsl:call-template>
           </xsl:non-matching-substring>
       </xsl:analyze-string>
   </xsl:template>

That is, at the beginning I grab a pristine text node and look for a pattern. 
If it's there, I'm done; if not, I pass the non-matching substring to the 
next template to look for a different pattern. One template calls another, 
passing the unmatched substrings, until the end, when I just output the text.

This works, but is it the best approach? Should I instead, for example, use a 
single callable template and pass it both the haystack string and the needle 
regex? My highest priorities are legibility and ease of development and 
maintenance; efficiency of operation is less important. In case this is 
important, the order in which the patterns are matched matters, at least in a 
few instances. For example, digits followed by a period get one kind of 
markup and digits not followed by a period get another, so I want to capture 
the first type first and get them out of the way before looking for the 
second.

2. The second part of my problem involves a particular type of regex, one 
that will, for example, identify a digit "6" that is adjacent to a letter and 
replace it with a letter "b". The adjacent letter could precede or follow the 
digit or both. If I make the preceding and following letter(s) optional in 
the pattern, I've made both optional, and I'll erroneously catch an isolated 
digit "6". If I use a disjunct pattern, it becomes harder to capture the 
pieces and output the ones I want to retain with regex-group(). I suspect 
that this is a common problem with a standard solution, but I haven't run 
into it before and no single, elegant but legible regex leaps to mind. Is 
there one?

Thanks for any advice,,

David
djbpitt(_at_)gmail(_dot_)com

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>