My CSV has some commas in some cells - in those cases the
entire cell value is itself enclosed in quotes. So a simple
tokenize that splits at comma boundaries would not work - so
I replaced the tokenize for the cells with a regex that took
care of the quotes (is there any alternative here other than
using regex?). I had to specify the quotes in the regex as
" After this, it started taking 45 minutes to transform
a 20 columns-35 rows CSV.
Are you using Saxon? Performance information is only interesting if we know
what processor you are using.
Next problem I found was that for columns that contain commas
in the value, all cells in that column are not enclosed in
quotes - only those cells that actually have commas are
enclosed in quotes. So I changed the regex to account for
0/more quotes. Now it transformed in 45 secs - surprise?
But even now, I see that the 0/more quotes regex throws it
off and the csv gets incorrectly parsed resulting in the
wrong xml content.
So I made some changes and the current xsl has the regex as:
<xsl:analyze-string select="."
regex="(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),
(.*),(.*),&quo
t;*(.*)"*,(.*),"*(.*)"*,(.*),(.*),"*($.*)&
quot;*,(.*)">
There's a lot of potential backtracking here: it might be better to replace
each "(.*)," with "[^,]*" or with "(.*?),".
My own instinct would be to use something like:
([^"]*,|"[^"]*",)*
Michael Kay
Saxonica
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--