xsl-list
[Top] [All Lists]

RE: [xsl] regex in csv2xml

2006-03-27 01:57:37
I would do something like this:

<xsl:variable name="regex1">".*?"</xsl:variable>
<xsl:variable name="s1" as="xs:string*">
  <xsl:analyze-string select="$in" regex="{$regex1}">
    <xsl:matching-substring>
      <xsl:sequence select="replace(., '\n', &pua1;"/>
    </xsl:matching-substring>
    <xsl:non-matching-substring>
      <xsl:sequence select="."/>
    </xsl:non-matching-substring>
  </xsl:analyze-string>
</xsl:variable>

xsl:variable s2 select="string-join($s1, '')"

for each select tokenize($s2, '\n')
  for each select tokenize(., ',')
    replace(&pua1, '\n')

That is: first take the total string and identify substrings in quotes. The
fact that this treats "He said ""don't""" as three strings ("He said",
"don't", "") doesn't matter. Replace a newline appearing between quotes by a
private-use-area character (or any other 'spare' character). Then put the
strings back together again.

Now take the reassembled string and split it first at newlines, then at
commas, and within each identified token, convert the private character back
to a newline.

Michael Kay
http://www.saxonica.com/ 


-----Original Message-----
From: Jesper Tverskov [mailto:jesper(_at_)tverskov(_dot_)dk] 
Sent: 27 March 2006 08:51
To: Xsl-List(_at_)Lists(_dot_) Mulberrytech. Com
Subject: [xsl] regex in csv2xml

Hi list,

I am trying to make a csv2xml XSLT 2.0 stylesheet using the 
Excel csv format
as example:
If delimiter, newline or quotes are part of data the data is 
quoted, quotes
are doubled.

My last problem is that the newline character can be part of 
data. I would
like to detect thise newline characters and replace them 
temporarily with
some unique code.
But have can I detect them in the first place?

Look at the sample below, we have 3 records and 3 fields:

34,"""yes"", I said",46
25,"I said:
""Hello"", and I added: ""nice day, stranger""
and, ""look at the sun"" , and: 
""bye for now.""",33
47,,35

Line 1 and 6 are records. We have an empty field in line 6.
But line 2, 3, 4, 5 are one record with three linefeeds and 
several commas
as part of data.

How can I detect with a regex, that the linefeeds at the end 
of line 2, 3
and 4 are part of data?
As I see it line 2 and 5 are the easy part, they will always 
have an uneven
number of quotes.
But the linefeeds in line 3 and 4 can only be detected as 
part of data if we
compare all the lines being part of a record?

Best regards,
Jesper Tverskov


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--





--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>