xsl-list
[Top] [All Lists]

RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl

2006-06-27 14:48:42
From: Michael Kay [mailto:mike(_at_)saxonica(_dot_)com]
Sent: Saturday, June 24, 2006 12:41 AM

There's a lot of potential backtracking here: it might be better to
replace each "(.*)," with "[^,]*" or with "(.*?),".

[Pantvaidya, Vishwajit] Does "[^,]*" work the same as "(.*),"
- I understand that ^ is start of line metachar. How does the
former match the alphabet chars?

No, within square brackets, ^ means "not". So [^,]* matches a sequence of
any characters except comma.

The problem with your expression is that (.*) matches as many characters as
it can. Then it sees ",", so it backtracks to find the last comma. Then it
sees the next (.*), and has to backtrack again; and so on.


My own instinct would be to use something like:

([^"]*,|"[^"]*",)*


[Pantvaidya, Vishwajit] Oxygen would not accept this regex as
"it matches a zero-length string".

Perhaps then you want to change the final "*" to a "+".

[Pantvaidya, Vishwajit] That's is the first thing I tried when the * did not
work - but even then it does not seem to be working.

Anyway, how does this regex work - it does not seem to have
anything that matches the alphabet chars.

See above: [^"] matches everything except quotes.

And does the ,|" match comma or double quotes - because
actually some field will have both.

The first alternative, [^"]*, matches any field that ends with a comma, and
doesn't contain a quotation mark. The second alternative, "[^"]*,", matches
any field that begins and ends with quotes (followed by a comma), and might
contain a comma between the quotes.

It's very hard to find out what the exact rules for CSV files used by a
particular product are: for example, how it represents a field that
contains
quotation marks as well as commas. (That's one of the great advantages of
XML< you can find a specification!) If you know the exact rules for your
particular flavour of CSV, you can adapt the regex to match (well, you can
if you study a bit more about regular expressions).


Maybe this conversion is easier done with some Java code.

I'm sure it can be done using regular expressions but it looks as if you
need to do some learning in this area.

[Pantvaidya, Vishwajit] Thanks a lot for all the clarifications and help.
Actually I did look at the regex documentation in the XSLT2 spec, but not
very exhaustively - the info on back-references I found there made me feel
that could be potentially useful here e.g. to tell the regex that if a
starting quote is found, look for an ending one. But the more I look into
it, the more it seems like I maynot be able to use it.

Thanks and regards,

Vish.


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--