xsl-list
[Top] [All Lists]

Re: [xsl] Advice on dictionary conversion

2011-01-18 17:16:23
The DITA for Publishers project (http://dita4publishers.sourceforge.net)
includes a "Word to DITA" framework that allows you to define styled Word to
(DITA) XML using a declarative configuration file plus custom XSLT if needed
(that is, it provides defined extension points and modes for plugging in
your own code to handle whatever the declarative mapping can't).

The target is DITA XML, which may or may not be suitable for your ultimate
purposes, but it's a good starting point in any case.

The framework is DITA-specific because it depends on some unique
characteristics of DITA to make it possible to have a declarative mapping to
arbitrary tag names (but not arbitrary structures, at least not without
custom code).

The framework is documented in the DITA for Publishers User's Guide and
there's a video that shows the process of setting up a conversion
configuration.

Note that it requires consistently-styled Word. If you already have that
then you're in good shape. If you don't, you may be able to use Word's
search and replace features to replace consistent formatting with consistent
styles.

It also requires that you use DOCX (the XML Word format) but you can convert
any .DOC file to DOCX using either Word itself or a separate batch facility
provided by Microsoft.

Cheers,

Eliot


On 1/17/11 2:14 PM, "Ciarán Ó Duibhín" 
<ciaran(_at_)oduibhin(_dot_)freeserve(_dot_)co(_dot_)uk>
wrote:

I wish to convert a bilingual dictionary from MS-Word format to
"properly"-tagged XML, and I hope I may ask for some comment on the
feasibility of this, using XSLT or otherwise.

First I found several programs which automatically convert the Word files to
FO:XSL, either from .doc or .rtf.  My preferred one of those I examined is
the Novosoft converter (http://www.rtf-to-xml.com/).  I painlessly converted
the entire letter D using their online interface.

Now I have to replace the presentational tags by tags like <HEADWORD>,
<EXPLANATION>, <EXAMPLE> etc.  I tried doing this manually, but it is not
practical.  Besides, I have to start from scratch again for each new letter
of the alphabet.  I have zero experience of XSLT, but it seemed that an XSLT
program might be what was needed.  I started with XRay2 (really nice for a
beginner in some ways) and have now moved on to the Essential XML Editor
with Saxon.  But progress has been minimal.

The main problem is my ignorance of XSLT, although I am an experienced
general programmer.  A particular difficulty is that "italics" (for example)
might be used for more than one part of the dictionary entry.  However the
choice of which tag to replace it with might well be decided by the target
DTD (if I were to formulate it).  Is this an example of what people
sometimes refer to on this list as "schema-aware XSLT"?  If so, I have no
idea how to make my XSLT schema-aware.

Another problem is that the dictionary contains quite a few "mistakes" which
are all but invisible in Word, eg. a single space might be inadvertently
bolded in an unbold field.  This sort of thing is faithfully copied by a
converter and complicates the starting XML unnecessarily, of course.

I would be grateful for advice as to how best to proceed.  I took on this
job as a favour, hoping it would help me to learn something of these
technologies, but it seems now there is too much to learn on one's own in
any reasonable short space of time (XSLT is not for amateurs :-(.  Perhaps I
should advise to have the job done professionally.  Unless there is
something I am missing...

On a related matter, I have recently discovered LIFT as a particular XML
format for lexicographical work (http://code.google.com/p/lift-standard/)
Any experience of that as a target format for XSLT would also be of
interest.

Thanks,
Ciarán Ó Duibhín.




--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--


-- 
Eliot Kimber
Senior Solutions Architect
"Bringing Strategy, Content, and Technology Together"
Main: 512.554.9368
www.reallysi.com
www.rsuitecms.com


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--