xsl-list
[Top] [All Lists]

RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl

2006-06-27 15:06:33
Hi Nathan,

Thanks a lot for your help. I was just trying to exhaust the XSL route. It
seemed to be a really good one for us, considering that we may need to
transform CSV's (with differing schemas) into corresponding XML's i.e. the
schema of one xml would differ from that of another.

From Michael's email, it seems like that may be doable - but may need more
reading on the regex stuff.

I will also look into what you have suggested.

Thanks,

Vish.

-----Original Message-----
From: Nathan Young -X (natyoung - Artizen at Cisco)
[mailto:natyoung(_at_)cisco(_dot_)com]
Sent: Monday, June 26, 2006 11:02 AM
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: RE: [xsl] Converting CSV to XML without hardcoding schema details
in xsl

Hi.

I don't know how you need to treat performance but regular expressions
are going to be a lot slower than the low level css parsing routines you
can get by using a perl, java or c library someone wrote to parse csv.
These are cleverly written and perform very well, a quick web search for
your language will turn up useful links if you go this route.

If "good enough" is good enough for you performance-wise, regular
expressions probably can work for you.  If you do pursue this I strongly
recommend an application called "regex coach" for troubleshooting and
learning regular expressions.  It really makes the effects of your
expression visible to you and lets you quickly adjust and try
variations.

----->Nathan



There's a lot of potential backtracking here: it might be
better to
replace each "(.*)," with "[^,]*" or with "(.*?),".

[Pantvaidya, Vishwajit] Does "[^,]*" work the same as "(.*),"
- I understand that ^ is start of line metachar. How does the
former match the alphabet chars?

No, within square brackets, ^ means "not". So [^,]* matches a
sequence of
any characters except comma.

The problem with your expression is that (.*) matches as many
characters as
it can. Then it sees ",", so it backtracks to find the last
comma. Then it
sees the next (.*), and has to backtrack again; and so on.


My own instinct would be to use something like:

([^"]*,|"[^"]*",)*


[Pantvaidya, Vishwajit] Oxygen would not accept this regex as
"it matches a zero-length string".

Perhaps then you want to change the final "*" to a "+".

Anyway, how does this regex work - it does not seem to have
anything that matches the alphabet chars.

See above: [^"] matches everything except quotes.

And does the ,|" match comma or double quotes - because
actually some field will have both.

The first alternative, [^"]*, matches any field that ends
with a comma, and
doesn't contain a quotation mark. The second alternative,
"[^"]*,", matches
any field that begins and ends with quotes (followed by a
comma), and might
contain a comma between the quotes.

It's very hard to find out what the exact rules for CSV files
used by a
particular product are: for example, how it represents a
field that contains
quotation marks as well as commas. (That's one of the great
advantages of
XML< you can find a specification!) If you know the exact
rules for your
particular flavour of CSV, you can adapt the regex to match
(well, you can
if you study a bit more about regular expressions).


Maybe this conversion is easier done with some Java code.

I'm sure it can be done using regular expressions but it
looks as if you
need to do some learning in this area.

Michael Kay
http://www.saxonica.com/


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--