xsl-list
[Top] [All Lists]

RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl

2006-06-23 15:25:35


-----Original Message-----
From: Michael Kay [mailto:mike(_at_)saxonica(_dot_)com]

My CSV has some commas in some cells - in those cases the
entire cell value is itself enclosed in quotes. So a simple
tokenize that splits at comma boundaries would not work - so
I replaced the tokenize for the cells with a regex that took
care of the quotes (is there any alternative here other than
using regex?). I had to specify the quotes in the regex as
" After this, it started taking 45 minutes to transform
a 20 columns-35 rows CSV.

Are you using Saxon? Performance information is only interesting if we know
what processor you are using.
[Pantvaidya, Vishwajit] Yes, I am using oxygen as editor which is using
Saxon8B.


Next problem I found was that for columns that contain commas
in the value, all cells in that column are not enclosed in
quotes - only those cells that actually have commas are
enclosed in quotes. So I changed the regex to account for
0/more quotes. Now it transformed in 45 secs - surprise?
But even now, I see that the 0/more quotes regex throws it
off and the csv gets incorrectly parsed resulting in the
wrong xml content.

So I made some changes and the current xsl has the regex as:
<xsl:analyze-string select="."
regex="(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),
(.*),(.*),&quo
t;*(.*)&quot;*,(.*),&quot;*(.*)&quot;*,(.*),(.*),&quot;*($.*)&
quot;*,(.*)">

There's a lot of potential backtracking here: it might be better to replace
each "(.*)," with "[^,]*" or with "(.*?),".

[Pantvaidya, Vishwajit] Does "[^,]*" work the same as "(.*)," - I understand
that ^ is start of line metachar. How does the former match the alphabet
chars?


My own instinct would be to use something like:

([^"]*,|"[^"]*",)*


[Pantvaidya, Vishwajit] Oxygen would not accept this regex as "it matches a
zero-length string".
Anyway, how does this regex work - it does not seem to have anything that
matches the alphabet chars.
And does the ,|" match comma or double quotes - because actually some field
will have both.

Generally, it seems that the problems with transforming such CSVs where the
field names may themselves have commas, maybe due to there being no way to
- remember current state (e.g. opening double quotes) and match the
remaining string based on knowledge of that state i.e. something like "if
opening double quotes encountered, then continue matching chars till closing
double quote, else match till next comma" or
- assign priority to specific matches over others e.g. give preference to
matching quotes if found over commas.

Maybe this conversion is easier done with some Java code.


Thanks a lot Michael for all your help...


Vish.


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--