On 17/11/2012 12:05, Ihe Onwuka wrote:
First let me dissect the regex
<xsl:analyze-string select="." flags="x"
regex="(.+?)
((-?\d*\s*)+$)"
is targeted at lines of balance sheet text such as below where we do
not know how many amounts will occur
1. Total Quick Assets 1,511
2,829 1,694 4,429
(.+?) lazily matches the non-financial half of the line - in this
case it will gobble up 1. Total Quick Assets
((-?\d*\s*)+$) captures the financial half - allowing for a leading
minus sign - the inner brackets are for grouping not capture.
Here is some test data - a file containing the following
I. Current Assets 1,871
2,829 1,694 4,429
1. Total Quick Assets 1,511
2,829 1,694 4,429
Short-term financial instrument 31
16 45 -
2. Total Inventories 359
- - -
II. Leased Housing Assets -
- - -
III. Deferred Liabilities -
- - -
III.Capital Adjustments -
- -28 -30
V. Retained Earnings -2,840
-4,664 -4,363 -4,383
**********************************************************************************************************
FINANCIAL INFORMATION 1.
Financial Statements
Income Statement
------------------
(Unit : KRW million)
**********************************************************************************************************
Here is the stylesheet
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
exclude-result-prefixes="xs" version="2.0">
<xsl:output indent="yes"/>
<xsl:param name="input" as="xs:string" required="yes"/>
<xsl:template match="/">
<!-- read in text whilst removing comma punctuation from monetary
fields -->
<xsl:for-each select="tokenize(replace(unparsed-text($input,
'iso-8859-1'),',(\d{3}\D)','$1'), '\r?\n')">
<!-- Delete lines that don't contain alphanumeric text -->
<xsl:if test="matches(.,'\w') and matches(.,'(-|\d)+')">
<line>
<xsl:analyze-string select="." flags="x"
regex="(.+?)
((-?\d*\s*)+$)">
That's a wildly expensive regex. If I change it to
<xsl:analyze-string select="." flags="x"
regex="(.+?)
((-|\d|\s)+$)">
I get identical output for your input and adding the extra line doesn't
make it take appreciably longer
David
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--