xsl-list
[Top] [All Lists]

RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl

2006-06-23 09:53:40
Hi.

The rules you describe for handling cells with commas in them and cells
with quotes in them are widely used conventions for encoding data in
csv.  Unless you are able to prevent cells from ever containing commas
or quotes you will not be able to make the csv "uniform" in a way that
does not require these (or some other) irregularities.

There is another way of parsing csv files that works faster than regular
expressions, very generally by reading the file character by character
into a buffer and applying a set of rules at each character to decide if
you have reached the end of a cell, at which point you empty the buffer
into a cell variable (or whatever you need to do with it) and continue.
I think this is best not done in XSL though.

If performance is indeed an issue, you are likely to be well served by
parsing out the csv file into a very simple XML format using another
language.  Many existing programming languages have very robust and
performant csv parsers for them already, so you'd have that problem
mostly solved from the outset.

------------>Nathan



.:||:._.:||:._.:||:._.:||:._.:||:._.:||:._.:||:._.:||:._.:||:._.:||:._.:
||:.

Nathan Young
Cisco.com->Interface Development
A: ncy1717
E: natyoung(_at_)cisco(_dot_)com  

-----Original Message-----
From: Pantvaidya, Vishwajit [mailto:vpantvai(_at_)selectica(_dot_)com] 
Sent: Thursday, June 22, 2006 8:51 PM
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: RE: [xsl] Converting CSV to XML without hardcoding 
schema details in xsl

Thanks a lot for the xsl, Michael.

My CSV has some commas in some cells - in those cases the 
entire cell value
is itself enclosed in quotes. So a simple tokenize that 
splits at comma
boundaries would not work - so I replaced the tokenize for 
the cells with a
regex that took care of the quotes (is there any alternative 
here other than
using regex?). I had to specify the quotes in the regex as "
After this, it started taking 45 minutes to transform a 20 
columns-35 rows
CSV.

Next problem I found was that for columns that contain commas 
in the value,
all cells in that column are not enclosed in quotes - only 
those cells that
actually have commas are enclosed in quotes. So I changed the regex to
account for 0/more quotes. Now it transformed in 45 secs - surprise?
But even now, I see that the 0/more quotes regex throws it 
off and the csv
gets incorrectly parsed resulting in the wrong xml content.

So I made some changes and the current xsl has the regex as:
<xsl:analyze-string select="."
regex="(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),
(.*),(.*),&quo
t;*(.*)&quot;*,(.*),&quot;*(.*)&quot;*,(.*),(.*),&quot;*($.*)&
quot;*,(.*)">

(now it is taking even more time - 1hour+ and still not done. 
Lets see if
atleast the xml comes out correctly.)

Any suggestions to mitigate these regex complexity due to 
non-uniformity of
input CSV?

Or am I am better off asking the CSV provider of the CSV to 
keep the CSV
uniform so that either all cells in the column are 
with/without quotes?


Thanks,

Vish.

-----Original Message-----
From: Michael Kay [mailto:mike(_at_)saxonica(_dot_)com]
Sent: Thursday, June 22, 2006 12:43 AM
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: RE: [xsl] Converting CSV to XML without hardcoding 
schema details
in xsl

Can anybody suggest how to convert CSV data in the format

Field1,Field2
Value11,Value12

to xml like

<Field1>Value11</Field1>
<Field2>Value12</Field2>

without hardcoding the fieldnames in the xsl?

<xsl:variable name="lines" as="xs:string*"
             select="tokenize(unparsed-text($input-file, 
'\r?\n'"))"/>
<xsl:variable name="field-names as="xs:string*"
             select="tokenize($lines[1], ',')"/>
<xsl:for-each select="subsequence($lines,2)">
<row>
 <xsl:variable name="cells" select="tokenize(., ',')"/>
 <xsl:for-each select="$cells">
   <xsl:variable name="p" as="xs:integer" select="position()"/>
   <xsl:element name="$fields[$p]"/>
     <xsl:value-of select="."/>
   </
 </
</
</

Michael Kay
http://www.saxonica.com/



I was thinking of something like

<xsl:for-each select="tokenize(., ',')"> &lt;<xsl:value-of
select="item-at($elementNames,index-of(?parent of current
node?,.))"/>&gt; <xsl:value-of select="."/>
&lt;/<xsl:value-of
select="item-at($elementNames,index-of(?parent of current
node?,.))"/>&gt; </xsl:for-each>

where elementNames is a tokenized list of the fieldnames -
but I am unable to get it to work.



-----Original Message-----
From: Pantvaidya, Vishwajit
Sent: Wednesday, June 21, 2006 12:17 AM
To: 'xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com'
Subject: [xsl] Converting CSV to XML without hardcoding
schema details
in xsl

Hello,

I am trying to convert a CSV datafile into XMl format.
The headers for the CSV data are in a file header.csv e.g.
Field1,Field2 The data is in a file Data.csv e.g.
Value11,Value12
Value21,Value22

I need to convert the CSV data into xml output by creating
xml elements
using the names in the csv header and taking the
corresponding values
from the data file, so that I get an xml as follows:

<doc>
<line>
<Field1>Value11</Field1>
<Field2>Value12</Field2>
</line>
<line>
<Field1>Value21</Field1>
<Field2>Value22</Field2>
</line>
</doc>

I was trying to see if I can do this without hardcoding the header
names in the xsl. I reached upto the point where my xsl
looks as below:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
xmlns:op="http://www.w3.org/2001/12/xquery-operators";
   xmlns:xf="http://www.w3.org/2001/12/xquery-functions";
version="2.0">

   <xsl:output  name="xmlFormat" method="xml" indent="yes"
omit-xml-declaration="yes"/>

   <xsl:variable name="source1" select="'data.csv'"/>
   <xsl:variable name="elementNamesList" select="'Header.csv'"/>
   <xsl:variable name="encoding" select="'iso-8859-1'"/>

   <xsl:variable name="elementNames"

select="tokenize(unparsed-text($elementNamesList,$encoding),',')"/>
   <xsl:variable name="src">
       <doc>
           <xsl:for-each
select="tokenize(unparsed-text($source1,$encoding), '\r?\n')">
               <line>
                   <xsl:for-each select="tokenize(., ',')">
                       &lt;<xsl:value-of
select="op:item-at($elementNames,index-of(?parent of current
node?,.))"/>&gt;
                           <xsl:value-of select="."/>
                           &lt;/<xsl:value-of
select="item-at($elementNames,3)"/>&gt;
                   </xsl:for-each>
               </line>
           </xsl:for-each>
       </doc>
   </xsl:variable>

   <xsl:template match="/">
       <xsl:result-document format = "xmlFormat" href = 
"src1.xml">
           <xsl:copy-of select="$src"/>
       </xsl:result-document>
   </xsl:template>

</xsl:stylesheet>

In the yet-incomplete statement <xsl:value-of
select="op:item-at($elementNames,index-of(?parent of current
node?,.))"/>, I am trying to generate an xml element with
the Nth field
name from the headers name list for the Nth field value. Couple of
issues/questions here:

- I am getting the error "Cannot find a matching 
2-argument function
named {http://www.w3.org/2001/12/xquery-operators}item-at()"
when I try
to validate the xsl. What could be the reason?

- How can I get the ?parent of current node? Needed to compute the
index of the current data in the data record?

- Is there any other better way to do it? Any way that I 
can do the
same using xsl:element?

In general, is this the only/best way or is there any other
better way
to achieve the same goal?


Thanks and Regards,

Vish.


--~------------------------------------------------------------------
XSL-List info and archive:  
http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--


--~------------------------------------------------------------------
XSL-List info and archive:  
http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>