xsl-list
[Top] [All Lists]

Re: [xsl] Advice on dictionary conversion

2011-01-18 11:55:26
[oops, forgot to send this!]

On Mon, 2011-01-17 at 20:14 +0000, Ciarán Ó Duibhín wrote:
I wish to convert a bilingual dictionary from MS-Word format to 
"properly"-tagged XML, and I hope I may ask for some comment on the 
feasibility of this, using XSLT or otherwise.

I've done a lot of this in the past, too, and still do sometimes.

As others have said, it can be very time-consuming.

I'd be tempted to try to go via the new XML Word format, although, since
I don't have access right now to a recent copy of MS Word, I don't know
how good it is in practice. I'm guessing, at least as ugly as the
OpenOffice format, *but*, the good thing is, all the information will be
there.  Most conversion programs will occasionally make mistakes and
lose some formatting.

Again as others have said, a pipeline of small tasks.  Tie them together
with Make or ant or a shell script that does a check at each stage and
quits on errors. Mine are usually (after a practice of Kate Hamilton)
called "runme" if they are shell scripts, and "makefile" for use with
make. I use,
    xmllint --noout somefile.xml || exit 1
in shell scripts, after each stage.

Typical tasks might be
* convert the Word file :-)
* normalise, e.g. to remove irrelevant output from the converter, and to
  make the next step as easy as possible...
* identify the start of each article or entry in the dictionary, and the
primary word or phrase defined; the output of this should have each
entry with <entry><head>Word being defined</head> more stuff here
</entry>
* add an XML id attribute to identify each entry (I often end up doing
this step in Perl, using a hash, although streaming XSLT 3 with grouping
will make it easier in the future I expect)
* identify any dictionary entries that are out of order -- either one of
the scripts went wrong (most likely), or add an exception to the
checker, or, if it's an option, move the entry in the dictionary to the
right place.

Important - if you for any reason change the Word file, keep the
original!!!!

The same applies if you use an online converter and then edit its output
by hand.


The main problem is my ignorance of XSLT, although I am an experienced 
general programmer.  A particular difficulty is that "italics" (for example) 
might be used for more than one part of the dictionary entry.  However the 
choice of which tag to replace it with might well be decided by the target 
DTD (if I were to formulate it).  Is this an example of what people 
sometimes refer to on this list as "schema-aware XSLT"?  If so, I have no 
idea how to make my XSLT schema-aware.

It's more likely an example of being context sensitive, and you might
end up with logic like,
    if we're in the body of an entry {
       if the word "Example:" or "Examples:" occurs in bold after
       this italic element {
           we have notyet reached the examples, so it's something else
       } else {
           it's probably an example
       }
    } else {
       it's a qualifier on the headword, or we're defining a phrase
    }

You may find it helpful to use markup like,
    <i role="example">...</i>
or
    <i role="example" why="script6:rule14 inExamples">...</i>
The second form can make it *much* easier to debug everything.

Another problem is that the dictionary contains quite a few "mistakes" which 
are all but invisible in Word, eg. a single space might be inadvertently 
bolded in an unbold field.  This sort of thing is faithfully copied by a 
converter and complicates the starting XML unnecessarily, of course.

One possibility is to fix some such errors in a COPY of the Word file --
I have "input-handedited.txt" for one of my conversion projects.  If
there are many such errors, maybe when you are more familiar with the
technologies you can write a script to fix most of them.

I would be grateful for advice as to how best to proceed.  I took on this 
job as a favour, hoping it would help me to learn something of these 
technologies,

It can be a lot of work. Watch out! But it can be very rewarding, too.

Along with all the low-level advice, don't forget to be very clear about
the goal -- is it to make a semantically-marked up database for
querying, e.g. for linguists to use, or to makes omething that looks
more or less the same when you print it out.

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org www.advogato.org


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--