xsl-list
[Top] [All Lists]

Re: identify sections in an xhtml document

2005-02-11 05:33:31
Tempore 01:28:30, die 02/11/2005 AD, hinc in xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com scripsit Dean Maslic <dean(_dot_)maslic(_at_)gmail(_dot_)com>:

Im thinking in a generic way, with any site.
Some ideas I had were eg. calculate total num of nodes, then go
through block level nodes (div, table,tr, ol etc) and calculate a
ratio between their number of nodes vs. total number of nodes. If the
numbers are roughly the same (say > 0.9), don't label, go to the child
nodes and apply the same. If they are different, look for collections
of of links (eg. count(descendant::html:a) > 5) or size of text nodes
etc.
Im sure there would be a way to do it for a generic 'standard' site
(ie.page that contains a Top link-bar, left/right sidebar, and some
text/image content)
Hi,

The algorithms you have in mind can be applied with XSLT, but I doubt they'll ever result in something usable. It will only work with structured and well-designed -consistency - sites that are in XHTML. But such sites typically already have decent structure and/or well chosen class attributes from which you can easily derive it.

I don't think it will ever work with a "standard" website, which tends to equal a messy and bloated tag soup.


regards,
--
Joris Gillis (http://www.ticalc.org/cgi-bin/acct-view.cgi?userid=38041)
Veni, vidi, wiki (http://www.wikipedia.org)

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



<Prev in Thread] Current Thread [Next in Thread>