First let me dissect the regex
<xsl:analyze-string select="." flags="x"
regex="(.+?)
((-?\d*\s*)+$)"
is targeted at lines of balance sheet text such as below where we do
not know how many amounts will occur
1. Total Quick Assets 1,511
2,829 1,694 4,429
(.+?) lazily matches the non-financial half of the line - in this
case it will gobble up 1. Total Quick Assets
((-?\d*\s*)+$) captures the financial half - allowing for a leading
minus sign - the inner brackets are for grouping not capture.
Here is some test data - a file containing the following
I. Current Assets 1,871
2,829 1,694 4,429
1. Total Quick Assets 1,511
2,829 1,694 4,429
Short-term financial instrument 31
16 45 -
2. Total Inventories 359
- - -
II. Leased Housing Assets -
- - -
III. Deferred Liabilities -
- - -
III.Capital Adjustments -
- -28 -30
V. Retained Earnings -2,840
-4,664 -4,363 -4,383
**********************************************************************************************************
FINANCIAL INFORMATION 1.
Financial Statements
Income Statement
------------------
(Unit : KRW million)
**********************************************************************************************************
Here is the stylesheet
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
exclude-result-prefixes="xs" version="2.0">
<xsl:output indent="yes"/>
<xsl:param name="input" as="xs:string" required="yes"/>
<xsl:template match="/">
<!-- read in text whilst removing comma punctuation from monetary
fields -->
<xsl:for-each select="tokenize(replace(unparsed-text($input,
'iso-8859-1'),',(\d{3}\D)','$1'), '\r?\n')">
<!-- Delete lines that don't contain alphanumeric text -->
<xsl:if test="matches(.,'\w') and matches(.,'(-|\d)+')">
<line>
<xsl:analyze-string select="." flags="x"
regex="(.+?)
((-?\d*\s*)+$)">
<xsl:matching-substring>
<lineItem><xsl:value-of
select="normalize-space(regex-group(1))"/></lineItem>
<yearlyFigures>
<xsl:for-each
select="tokenize(normalize-space(regex-group(2)),'\s+')">
<figure year='{position()}'>
<xsl:value-of select="."/>
</figure>
</xsl:for-each>
</yearlyFigures>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</line>
</xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
and it works very well.
However if I add the following text to the data
Jan.1,2005
Jan.1,2006 Jan.1,2007 Jan.1,200
it hangs.
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--