xsl-list
[Top] [All Lists]

[xsl] Hanging regex

2012-11-17 06:05:31
First let me dissect the regex

           <xsl:analyze-string select="." flags="x"
                               regex="(.+?)
                                      ((-?\d*\s*)+$)"

is targeted at lines of balance sheet text such as below  where we do
not know how many amounts will occur

  1. Total Quick Assets                              1,511
2,829          1,694          4,429

(.+?)  lazily matches the non-financial half of the line  - in this
case it will gobble up 1. Total Quick Assets

((-?\d*\s*)+$) captures the financial half - allowing for a leading
minus sign - the inner brackets are for grouping not capture.

Here is some test data - a file containing the following


 I. Current Assets                                   1,871
2,829          1,694          4,429
  1. Total Quick Assets                              1,511
2,829          1,694          4,429
   Short-term financial instrument                      31
16             45              -
  2. Total Inventories                                 359
 -              -              -
 II. Leased Housing Assets                               -
 -              -              -
 III. Deferred Liabilities                               -
 -              -              -
 III.Capital Adjustments                                 -
 -            -28            -30
 V. Retained Earnings                               -2,840
-4,664         -4,363         -4,383



**********************************************************************************************************




FINANCIAL INFORMATION                                                1. 
Financial Statements

Income Statement
------------------
                                                                            
(Unit : KRW million)
**********************************************************************************************************

Here is the stylesheet

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
        xmlns:xs="http://www.w3.org/2001/XMLSchema";
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
        exclude-result-prefixes="xs" version="2.0">
  <xsl:output indent="yes"/>
  <xsl:param name="input" as="xs:string" required="yes"/>

  <xsl:template match="/">
    <!-- read in text whilst removing comma punctuation from monetary
fields -->
    <xsl:for-each select="tokenize(replace(unparsed-text($input,
'iso-8859-1'),',(\d{3}\D)','$1'), '\r?\n')">

       <!-- Delete lines that don't contain alphanumeric text -->       
       <xsl:if test="matches(.,'\w') and matches(.,'(-|\d)+')"> 
         <line>
           <xsl:analyze-string select="." flags="x"
                               regex="(.+?)
                                      ((-?\d*\s*)+$)">

            <xsl:matching-substring>
              <lineItem><xsl:value-of
select="normalize-space(regex-group(1))"/></lineItem>

              <yearlyFigures>
                <xsl:for-each
select="tokenize(normalize-space(regex-group(2)),'\s+')">

                  <figure year='{position()}'>
                    <xsl:value-of select="."/>                  
                  </figure>
                </xsl:for-each> 

              </yearlyFigures>
            </xsl:matching-substring>

            <xsl:non-matching-substring>        
              <xsl:value-of select="."/>
            </xsl:non-matching-substring>
          </xsl:analyze-string>
         </line>
       </xsl:if>
    </xsl:for-each>
  </xsl:template>
        
</xsl:stylesheet>

and it works very well.

However if I add the following text to the data

                                                Jan.1,2005
Jan.1,2006     Jan.1,2007     Jan.1,200

it hangs.

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>