Phil,
You are entirely correct that this is an emotional issue. Paradoxically,
people get most upset when what seems like "correct" and "common sense"
to them is ignored or defied in favor of some other method whose logic
is obscure. (For whatever reason, it seems people have less patience
with unknowns they believe to be knowable and controvertible than they
do with the apparently unknowable and incontrovertible. And machines are
supposed to be knowable.)
So the first rule is, make it possible to turn off the behavior, and if
other features depend on it, make those dependencies very clear.
On the issue of whitespace in XML, this is one of the most vexing areas,
largely because many people don't know what the rules are -- but (and,
or) do have their own notions of what's right. The rules you enumerate
are a start, except where they blur (such as #3 -- in the XSLT
namespace, for example, the 'text' element is sacrosanct, but in the TEI
namespace it follows rule #1). Trouble will start with the blurry cases
if it hasn't already.
Accordingly, I think the second rule is to be very conservative.
In XSLT, I think this means you can follow the specs regarding where
whitespace is not significant (i.e. it is significant except for
whitespace-only text nodes outside xsl:text).
In XML (and SGML, including SGML-conformant HTML), I think this means
you can follow a schema -- significant whitespace is anywhere character
data is permitted. Regrettably, this means that all whitespace (outside
tags) is significant when there is no schema. (Whether you can take a
schema to be implicit when it is not given is another problem.)
Whether XML (or HTML) fragments embedded in XSLT can be taken to
reference a schema depends, I'm afraid, on the XSLT: it won't always be
true. Conservatively, we might say it's never definitively true except
when a schema is specifically assigned using xsl:import-schema and
xsl:result-document/@validation='strict'. But I suppose an application
might also let a user declare such a binding by other means.
In plain text, I think all bets are off, in the general case. Variants
of plain text that conform to particular specifications may constitute
exceptions, and maybe you could define such a spec for a "smart" plain
text format. But as you say, it wouldn't be perfect.
Finally, I think it's important to distinguish between whitespace
handling in tag-formatting applications from the way whitespace may, or
may not, be collapsed, re-flowed or munged for display in a receiving
application. These are two different issues that are frequently
confused. The fact that some tag-formatting applications may (usefully)
reformat whitespace in some places where it is not entirely stripped --
perhaps on the grounds that receiving applications will be doing
likewise, so it doesn't matter -- makes for another set of troublesome
blurry cases.
My $0.02.
Cheers,
Wendell
On 6/7/2011 10:26 AM, Philip Fearon wrote:
This wouldn't ever be perfect, but there's a large set of rules that
could be used to determine formatting-only whitespace. The following
set is a distilled version:
From the XML context:
1. Outside mixed-content
2. Outside where xsl:space "preserve" is in scope
3. Outside defined elements such as 'pre' and 'text'
4. If it precedes an attribute name beginning a new line
5. If it precedes an attribute value on a new line
From text context:
4. Where the number of characters found are (within a defined margin)
is consistent with the current nesting-level - a pattern can normally
be established
5. Where irregular leading whitespace is found on consecutive lines in
a node value
The approach would be progressive where, on first load a minimal set
of 'obvious' formatting is removed, then further options provided to
the developer that steadily impose a more rigorous filtering
Phil Fearon
http://qutoric
--
======================================================================
Wendell Piez
mailto:wapiez(_at_)mulberrytech(_dot_)com
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
----------------------------------------------------------------------
Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--