xsl-list
[Top] [All Lists]

RE: [XSLT2.0] xsl:analyze-string(_at_)regex syntax too limited

2004-12-15 12:48:21
Hi, just FYI, I have made a petition to the XSLT and XPath 2.0
public comments list to remove most of the artificial restrictions
on the regex syntax in the match, replace functions and 
analyze-string instructions. 

It doesn't seem to be there yet...

Please note there's no need to comment separately on the two documents. XSLT
will automatically pick up any changes made to the XPath functions.

Michael Kay had to add a pretty complex piece of code to his 
Saxon processor just to cripple the available regex syntax which
was previously supported. That's ridiculous.


It's very unlikely that XPath will support the whole of the Java regex
syntax, for example the POSIX character classes won't get past the I18N
scrutineers. Also, Java regexes match 16-bit UTF16 values, not Unicode
characters: so given a character outside the BMP, it counts as two
characters in a Java regex but as one character in an XPath regex - a lot of
the regex translation code in Saxon is designed to handle such differences,
not to remove functionality. So any changes to the XPath syntax won't remove
the need for the regex translator. (The translator, incidentally, was
written by James Clark to implement the XML Schema regex syntax, and I
extended it to handle the XPath extensions.)

As I've commented elsewhere, one of the main difficulties in "adding back"
further Perl regex features is the need to write an unambiguous
specification that is consistent with existing implementations. Writing a
spec that turns out to be inconsistent with existing implementations would
obviously be a disaster. This always turns out to be more difficult than you
think. To take just one example that you want to add, in Perl:

" A word boundary (`\b') is a spot between two characters
that has a `\w' on one side of it and a `\W' on the other
side of it (in either order), counting the imaginary char-
acters off the beginning and end of the string as matching
a `\W'. (Within character classes `\b' represents
backspace rather than a word boundary, just as it normally
does in any double-quoted string.) 

Firstly, that's too informal for the WGs to accept it as written (what is a
"spot"? what is an "imaginary character"). Secondly, Perl classifies \b as a
"zero-width assertion" but it doesn't say clearly where in the overall
scheme of things a zero-width assertion can appear. Thirdly the exception
doesn't apply, because backspace isn't a legal XML character. So getting an
agreed spec just for \b could easily take an hour of WG time, and the WG is
getting pretty impatient about proposals that consume time unless there is a
problem that absolutely must be solved.

Just warning you...

Michael Kay
http://www.saxonica.com/


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--