xsl-list
[Top] [All Lists]

Re: [xsl] Hanging regex

2012-11-17 06:40:45
On 17/11/2012 12:05, Ihe Onwuka wrote:
First let me dissect the regex

           <xsl:analyze-string select="." flags="x"
                               regex="(.+?)
                                      ((-?\d*\s*)+$)"

is targeted at lines of balance sheet text such as below  where we do
not know how many amounts will occur

   1. Total Quick Assets                              1,511
2,829          1,694          4,429

(.+?)  lazily matches the non-financial half of the line  - in this
case it will gobble up 1. Total Quick Assets

((-?\d*\s*)+$) captures the financial half - allowing for a leading
minus sign - the inner brackets are for grouping not capture.

Here is some test data - a file containing the following


  I. Current Assets                                   1,871
2,829          1,694          4,429
   1. Total Quick Assets                              1,511
2,829          1,694          4,429
    Short-term financial instrument                      31
16             45              -
   2. Total Inventories                                 359
  -              -              -
  II. Leased Housing Assets                               -
  -              -              -
  III. Deferred Liabilities                               -
  -              -              -
  III.Capital Adjustments                                 -
  -            -28            -30
  V. Retained Earnings                               -2,840
-4,664         -4,363         -4,383



**********************************************************************************************************




FINANCIAL INFORMATION                                                1. 
Financial Statements

Income Statement
------------------
                                                                            
(Unit : KRW million)
**********************************************************************************************************

Here is the stylesheet

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
         xmlns:xs="http://www.w3.org/2001/XMLSchema";
         xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
        exclude-result-prefixes="xs" version="2.0">
   <xsl:output indent="yes"/>
   <xsl:param name="input" as="xs:string" required="yes"/>

   <xsl:template match="/">
     <!-- read in text whilst removing comma punctuation from monetary
fields -->
     <xsl:for-each select="tokenize(replace(unparsed-text($input,
'iso-8859-1'),',(\d{3}\D)','$1'), '\r?\n')">

        <!-- Delete lines that don't contain alphanumeric text -->        
        <xsl:if test="matches(.,'\w') and matches(.,'(-|\d)+')">        
          <line>
           <xsl:analyze-string select="." flags="x"
                               regex="(.+?)
                                      ((-?\d*\s*)+$)">

That's a wildly expensive regex. If I change it to

           <xsl:analyze-string select="." flags="x"
                               regex="(.+?)
                                      ((-|\d|\s)+$)">


I get identical output for your input and adding the extra line doesn't make it take appreciably longer


David


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>