xsl-list
[Top] [All Lists]

RE: [xsl] plea for help...

2006-03-09 11:17:19
Walter,

At Mulberry we recently gave a seminar on the topic of converting HTML to XML, so the issues are fresh in my mind.

You're facing a fairly complex set of problems, but they can be simplified (as you are discovering) by distinguishing between

A. The syntactic conversion of HTML to XML
B. The "semantic" conversion from HTML display-oriented tagging to a stronger form of tagging in XML.

Other contributors have posted links to tools that help you with job A -- Tidy and its ilk -- and it appears you've got a handle on that. This work can be largely or entirely automated. Of course, what you get out the other end is still HTML tagging, albeit in XML syntax (it'll be either valid XHTML or a similar XML-compliant HTML), so as you're finding it's not good to go for everything you might do with well-designed XML markup. But to have it XML syntactically is already a big step, because you can then use more and better tools on it to take it the rest of the way -- including (which is the question isn't totally off topic here) XSLT.

To do conversion B, however, is an entirely different kettle of fish -- and it is beyond the scope of this list, I'm afraid.

As long as I'm already on it, however, I am willing to comment that the scope and difficulty of conversion B is directly related both to the quality of tagging in your source (HTML can be "clean" or "dirty", consistent or messy, even after it's made XML-conformant in its syntax) and, most dramatically, to the nature of your target tag set and to the feasibility of mapping from the HTML you have to this target.

Sometimes this conversion can be automated; sometimes it can be mostly automated; often it requires a good measure of attention from human beings to determine how things should be converted in any given case.

The design of that target markup, however, is critical; by itself, this factor alone can make or break your project. There is an infinity of things potentially expressible in XML, which a machine, even one programmed with very sophisticated heuristics, will not know how to tag correctly, even when it's starting with some kind of HTML tagging.

Accordingly, generally successful efforts at this kind of conversion include both designing that format up front, and controlling its design carefully. Design it to concrete requirements, not just to what you think might be useful or fun to have some day, and don't be over-ambitious. You can't convert to a target you can't see. But if you have a design, the places where conversion is easy or difficult will fairly quickly come to light and you can figure out how to deal with them.

I think earlier someone suggested you prototype this first before attempting it. That's very good advice.

There are also professionals who will gladly share their experience in this area, if you are in a position to save money over the long term by investing it intelligently in the near term.

Good luck,
Wendell

At 11:52 AM 3/9/2006, you wrote:

On Wed, March 8, 2006 5:28 pm, Florent Georges wrote:
> Walter Torres wrote:
>
>
>> 1) convert HMTL into well formed HTML (many are not)
>> 2) convert well formed HTML into xHTML
>>
>
> Tidy HTML will give you XHTML from HTML.

Yes, just found it late last night. Been playing with it all morning.

Getting it to work in PHP5 is waht I'm focusing on now.


>> 3) convert xHTML into XML
>>
>
> An XHTML instance is already an XML instance.

Yes, I understand that.

But I'm trying to get this to a "pure" xml, no display characteristics
markup what so ever!

The idea here is to have a "raw/naked" file as possible, that way any
system can display this as they see fit.


> If you want to translate the instance from XHTML to an other XML document
> type, XSLT may be of great help.

Sure, that way I can great a look for website A which is different than
website B, then create a text or RTF only or even email text or HTML or
even via web-phone.

This is why I was asking about how different folks hand this kind of
content. What kind of markup it contains, etc.


>> 4) create XSLTs to transpose XML back to HTML for page display
>
> Here again, XSLT may be of great help.

Right.

Thanks

Walter


======================================================================
Wendell Piez                            
mailto:wapiez(_at_)mulberrytech(_dot_)com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>