Wendell,
I attended.
It was very well done. A great help for beginners as well as good
insights for those with lots of battle scars.
Thanks,
Mike Ferrando
Library Technician
Library of Congress
Washington, DC
202-707-4454
--- Wendell Piez <wapiez(_at_)mulberrytech(_dot_)com> wrote:
Walter,
At Mulberry we recently gave a seminar on the topic of converting
HTML to XML, so the issues are fresh in my mind.
You're facing a fairly complex set of problems, but they can be
simplified (as you are discovering) by distinguishing between
A. The syntactic conversion of HTML to XML
B. The "semantic" conversion from HTML display-oriented tagging to
a
stronger form of tagging in XML.
Other contributors have posted links to tools that help you with
job
A -- Tidy and its ilk -- and it appears you've got a handle on
that.
This work can be largely or entirely automated. Of course, what you
get out the other end is still HTML tagging, albeit in XML syntax
(it'll be either valid XHTML or a similar XML-compliant HTML), so
as
you're finding it's not good to go for everything you might do with
well-designed XML markup. But to have it XML syntactically is
already
a big step, because you can then use more and better tools on it to
take it the rest of the way -- including (which is the question
isn't
totally off topic here) XSLT.
To do conversion B, however, is an entirely different kettle of
fish
-- and it is beyond the scope of this list, I'm afraid.
As long as I'm already on it, however, I am willing to comment that
the scope and difficulty of conversion B is directly related both
to
the quality of tagging in your source (HTML can be "clean" or
"dirty", consistent or messy, even after it's made XML-conformant
in
its syntax) and, most dramatically, to the nature of your target
tag
set and to the feasibility of mapping from the HTML you have to
this target.
Sometimes this conversion can be automated; sometimes it can be
mostly automated; often it requires a good measure of attention
from
human beings to determine how things should be converted in any
given case.
The design of that target markup, however, is critical; by itself,
this factor alone can make or break your project. There is an
infinity of things potentially expressible in XML, which a machine,
even one programmed with very sophisticated heuristics, will not
know
how to tag correctly, even when it's starting with some kind of
HTML tagging.
Accordingly, generally successful efforts at this kind of
conversion
include both designing that format up front, and controlling its
design carefully. Design it to concrete requirements, not just to
what you think might be useful or fun to have some day, and don't
be
over-ambitious. You can't convert to a target you can't see. But if
you have a design, the places where conversion is easy or difficult
will fairly quickly come to light and you can figure out how to
deal with them.
I think earlier someone suggested you prototype this first before
attempting it. That's very good advice.
There are also professionals who will gladly share their experience
in this area, if you are in a position to save money over the long
term by investing it intelligently in the near term.
Good luck,
Wendell
At 11:52 AM 3/9/2006, you wrote:
On Wed, March 8, 2006 5:28 pm, Florent Georges wrote:
Walter Torres wrote:
1) convert HMTL into well formed HTML (many are not)
2) convert well formed HTML into xHTML
Tidy HTML will give you XHTML from HTML.
Yes, just found it late last night. Been playing with it all
morning.
Getting it to work in PHP5 is waht I'm focusing on now.
3) convert xHTML into XML
An XHTML instance is already an XML instance.
Yes, I understand that.
But I'm trying to get this to a "pure" xml, no display
characteristics
markup what so ever!
The idea here is to have a "raw/naked" file as possible, that way
any
system can display this as they see fit.
If you want to translate the instance from XHTML to an other
XML document
type, XSLT may be of great help.
Sure, that way I can great a look for website A which is different
than
website B, then create a text or RTF only or even email text or
HTML or
even via web-phone.
This is why I was asking about how different folks hand this kind
of
content. What kind of markup it contains, etc.
4) create XSLTs to transpose XML back to HTML for page display
Here again, XSLT may be of great help.
Right.
Thanks
Walter
======================================================================
Wendell Piez
mailto:wapiez(_at_)mulberrytech(_dot_)com
Mulberry Technologies, Inc.
http://www.mulberrytech.com
17 West Jefferson Street Direct Phone:
301/315-9635
Suite 207 Phone:
301/315-9631
Rockville, MD 20850 Fax:
301/315-8285
----------------------------------------------------------------------
Mulberry Technologies: A Consultancy Specializing in SGML and
XML
======================================================================
--~------------------------------------------------------------------
XSL-List info and archive:
http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail:
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--