xsl-list
[Top] [All Lists]

RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl

2006-06-23 01:39:47
My CSV has some commas in some cells - in those cases the 
entire cell value is itself enclosed in quotes. So a simple 
tokenize that splits at comma boundaries would not work - so 
I replaced the tokenize for the cells with a regex that took 
care of the quotes (is there any alternative here other than 
using regex?). I had to specify the quotes in the regex as 
" After this, it started taking 45 minutes to transform 
a 20 columns-35 rows CSV.

Are you using Saxon? Performance information is only interesting if we know
what processor you are using.

Next problem I found was that for columns that contain commas 
in the value, all cells in that column are not enclosed in 
quotes - only those cells that actually have commas are 
enclosed in quotes. So I changed the regex to account for 
0/more quotes. Now it transformed in 45 secs - surprise?
But even now, I see that the 0/more quotes regex throws it 
off and the csv gets incorrectly parsed resulting in the 
wrong xml content.

So I made some changes and the current xsl has the regex as:
<xsl:analyze-string select="."
regex="(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),
(.*),(.*),&quo
t;*(.*)&quot;*,(.*),&quot;*(.*)&quot;*,(.*),(.*),&quot;*($.*)&
quot;*,(.*)">

There's a lot of potential backtracking here: it might be better to replace
each "(.*)," with "[^,]*" or with "(.*?),".

My own instinct would be to use something like:

([^"]*,|"[^"]*",)*

Michael Kay
Saxonica


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--