xsl-list
[Top] [All Lists]

[xsl] deduplicating information in XML files

2012-10-12 07:03:04
Hi all,

This time I have a rather challenging task at hand.  Let me first describe the 
use case.  We have lots of product information stored in XML.  Some of that 
information describes 
. Technical applications
. Features and benefits
. Technical summary

One of the problems is a lot of products had e.g. the same features and 
benefits as they are of the same product family or group.  But as we stored 
that info per product it got duplicated.  Now we want to deduplicate that info 
by generating DITA maps and topics (both are just XML).  Now for simplicity 
let's assume we generate the following content for product1 and product2.  The 
goal is to get from INPUT to OUTPUT by checking if the body of the linked 
topics are duplicates, next create 1 generic topic and rewrite the links in the 
map to  point to that single topic.  I have XSLT / XQuery (XMLDB) and Java at 
my disposal to get the job done.  I'm not sure what will be the easiest way to 
get the job done.  Keep also in mind that my INPUT will contain a few 1000 
files (maps and linked topics) and I will need to deduplicate the whole set ;-)

Thx upfront for any input,
Robby  

INPUT

Product1_map.xml
<map>
  <features-benefits-ref href="features-benefits/Product1_FandB.xml "/>
</map>

Product1_FandB.xml:
<content>
  <meta>
    <id>product1</id>
  <meta>
  <body>
    <p>Suitable for high frequency applications due to fast switching 
characteristics</p>
    <p>Suitable for logic level gate drive sources</p>
  <body>
</content>

Product2_map.xml
<map>
  <features-benefits-ref href="features-benefits/Product2_FandB.xml "/>
</map>

Product2_FandB.xml:
<content>
  <meta>
    <id>product2</id>
  <meta>
  <body>
    <p>Suitable for high frequency applications due to fast switching 
characteristics</p>
    <p>Suitable for logic level gate drive sources</p>
  <body>
</content>

Expected output:

Product1_map.xml
<map>
  <features-benefits-ref href="features-benefits/FandB_1.xml "/>
</map>

Product2_map.xml
<map>
  <features-benefits-ref href="features-benefits/FandB_1.xml "/>
</map>

FandB_1.xml:
<content>
  <meta>
    <id><!- can become empty  -> </id>
  <meta>
  <body>
    <p>Suitable for high frequency applications due to fast switching 
characteristics</p>
    <p>Suitable for logic level gate drive sources</p>
  <body>
</content>


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--


<Prev in Thread] Current Thread [Next in Thread>