RE: [xsl] plea for help...
2006-03-09 11:17:19
Walter,
At Mulberry we recently gave a seminar on the topic of converting
HTML to XML, so the issues are fresh in my mind.
You're facing a fairly complex set of problems, but they can be
simplified (as you are discovering) by distinguishing between
A. The syntactic conversion of HTML to XML
B. The "semantic" conversion from HTML display-oriented tagging to a
stronger form of tagging in XML.
Other contributors have posted links to tools that help you with job
A -- Tidy and its ilk -- and it appears you've got a handle on that.
This work can be largely or entirely automated. Of course, what you
get out the other end is still HTML tagging, albeit in XML syntax
(it'll be either valid XHTML or a similar XML-compliant HTML), so as
you're finding it's not good to go for everything you might do with
well-designed XML markup. But to have it XML syntactically is already
a big step, because you can then use more and better tools on it to
take it the rest of the way -- including (which is the question isn't
totally off topic here) XSLT.
To do conversion B, however, is an entirely different kettle of fish
-- and it is beyond the scope of this list, I'm afraid.
As long as I'm already on it, however, I am willing to comment that
the scope and difficulty of conversion B is directly related both to
the quality of tagging in your source (HTML can be "clean" or
"dirty", consistent or messy, even after it's made XML-conformant in
its syntax) and, most dramatically, to the nature of your target tag
set and to the feasibility of mapping from the HTML you have to this target.
Sometimes this conversion can be automated; sometimes it can be
mostly automated; often it requires a good measure of attention from
human beings to determine how things should be converted in any given case.
The design of that target markup, however, is critical; by itself,
this factor alone can make or break your project. There is an
infinity of things potentially expressible in XML, which a machine,
even one programmed with very sophisticated heuristics, will not know
how to tag correctly, even when it's starting with some kind of HTML tagging.
Accordingly, generally successful efforts at this kind of conversion
include both designing that format up front, and controlling its
design carefully. Design it to concrete requirements, not just to
what you think might be useful or fun to have some day, and don't be
over-ambitious. You can't convert to a target you can't see. But if
you have a design, the places where conversion is easy or difficult
will fairly quickly come to light and you can figure out how to deal with them.
I think earlier someone suggested you prototype this first before
attempting it. That's very good advice.
There are also professionals who will gladly share their experience
in this area, if you are in a position to save money over the long
term by investing it intelligently in the near term.
Good luck,
Wendell
At 11:52 AM 3/9/2006, you wrote:
On Wed, March 8, 2006 5:28 pm, Florent Georges wrote:
> Walter Torres wrote:
>
>
>> 1) convert HMTL into well formed HTML (many are not)
>> 2) convert well formed HTML into xHTML
>>
>
> Tidy HTML will give you XHTML from HTML.
Yes, just found it late last night. Been playing with it all morning.
Getting it to work in PHP5 is waht I'm focusing on now.
>> 3) convert xHTML into XML
>>
>
> An XHTML instance is already an XML instance.
Yes, I understand that.
But I'm trying to get this to a "pure" xml, no display characteristics
markup what so ever!
The idea here is to have a "raw/naked" file as possible, that way any
system can display this as they see fit.
> If you want to translate the instance from XHTML to an other XML document
> type, XSLT may be of great help.
Sure, that way I can great a look for website A which is different than
website B, then create a text or RTF only or even email text or HTML or
even via web-phone.
This is why I was asking about how different folks hand this kind of
content. What kind of markup it contains, etc.
>> 4) create XSLTs to transpose XML back to HTML for page display
>
> Here again, XSLT may be of great help.
Right.
Thanks
Walter
======================================================================
Wendell Piez
mailto:wapiez(_at_)mulberrytech(_dot_)com
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
----------------------------------------------------------------------
Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--
|
|