xsl-list
[Top] [All Lists]

[xsl] RE: [xquery-talk] deduplicating information in XML files

2012-10-19 06:06:18
Hi,

I'm currently trying out a pure XSLT solution.  As an add-on, I'd like to 
mention I like where Zorba is going although I did not have the time to use it 
yet. I really need to take a closer look once I have time at the documentation. 
 

Thx for the tip by the way.

Robby

-----Original Message-----
From: Helena Galhardas 
[mailto:helena(_dot_)galhardas(_at_)ist(_dot_)utl(_dot_)pt] 
Sent: Wednesday, October 17, 2012 12:19 PM
To: Robby Pelssers
Cc: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com; xquery-discuss; Bruno 
Martins; Daniela Florescu
Subject: Re: [xquery-talk] deduplicating information in XML files

Dear Robby,

Zorba XQuery processor currently supports a data cleaning module (look for data 
cleaning in http://www.zorba-xquery.com/html/modules)
that we may want to try.

In principle, the requirements associated to your data de-duplication problem 
can be addressed by writing an XQuery program that invokes some of the 
functions available in this data cleaning module.

We would appreciate your feedback if you decide to do so and please let us know 
if we can help somehow.

Thanks.
Best Regards,
Helena Galhardas

On Oct 12, 2012, at 1:02 PM, Robby Pelssers wrote:

Hi all,

This time I have a rather challenging task at hand.  Let me first 
describe the use case.  We have lots of product information stored in 
XML.  Some of that information describes . Technical applications . 
Features and benefits . Technical summary

One of the problems is a lot of products had e.g. the same features 
and benefits as they are of the same product family or group.  But as 
we stored that info per product it got duplicated.  Now we want to 
deduplicate that info by generating DITA maps and topics (both are 
just XML).  Now for simplicity let's assume we generate the following 
content for product1 and product2.  The goal is to get from INPUT to 
OUTPUT by checking if the body of the linked topics are duplicates, 
next create 1 generic topic and rewrite the links in the map to  point 
to that single topic.  I have XSLT / XQuery (XMLDB) and Java at my 
disposal to get the job done.  I'm not sure what will be the easiest 
way to get the job done.  Keep also in mind that my INPUT will contain 
a few 1000 files (maps and linked topics) and I will need to 
deduplicate the whole set ;-)

Thx upfront for any input,
Robby

INPUT

Product1_map.xml
<map>
  <features-benefits-ref href="features-benefits/Product1_FandB.xml 
"/> </map>

Product1_FandB.xml:
<content>
  <meta>
    <id>product1</id>
  <meta>
  <body>
    <p>Suitable for high frequency applications due to fast switching 
characteristics</p>
    <p>Suitable for logic level gate drive sources</p>
  <body>
</content>

Product2_map.xml
<map>
  <features-benefits-ref href="features-benefits/Product2_FandB.xml 
"/> </map>

Product2_FandB.xml:
<content>
  <meta>
    <id>product2</id>
  <meta>
  <body>
    <p>Suitable for high frequency applications due to fast switching 
characteristics</p>
    <p>Suitable for logic level gate drive sources</p>
  <body>
</content>

Expected output:

Product1_map.xml
<map>
  <features-benefits-ref href="features-benefits/FandB_1.xml "/> 
</map>

Product2_map.xml
<map>
  <features-benefits-ref href="features-benefits/FandB_1.xml "/> 
</map>

FandB_1.xml:
<content>
  <meta>
    <id><!- can become empty  -> </id>
  <meta>
  <body>
    <p>Suitable for high frequency applications due to fast switching 
characteristics</p>
    <p>Suitable for logic level gate drive sources</p>
  <body>
</content>


_______________________________________________
talk(_at_)x-query(_dot_)com
http://x-query.com/mailman/listinfo/talk


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--


<Prev in Thread] Current Thread [Next in Thread>