xsl-list
[Top] [All Lists]

Re: [XSLT2.0] xsl:analyze-string(_at_)regex syntax too limited

2004-12-15 15:25:48
Thanks, Michael, for the "warning".

It doesn't seem to be there yet...

... was held up by the confirmation system they now have in place. Should be
there now.
 
Please note there's no need to comment separately on the two documents. XSLT
will automatically pick up any changes made to the XPath functions.

O.K. too late now. I figured: make more noise so you will be heard :-)

Michael Kay had to add a pretty complex piece of code to his 
Saxon processor just to cripple the available regex syntax which
was previously supported. That's ridiculous.

It's very unlikely that XPath will support the whole of the Java regex
syntax, for example the POSIX character classes won't get past the I18N
scrutineers. 

I never use these character classes, they are unnecessary syntactic
shugar. I always use generic character classes [...] instead and never
saw the point for remembering all those \w \W \s \p{Quark} things.

Also, Java regexes match 16-bit UTF16 values, not Unicode
characters: so given a character outside the BMP, it counts as two
characters in a Java regex but as one character in an XPath regex - a lot of
the regex translation code in Saxon is designed to handle such differences,
not to remove functionality.

O.K. can you actually overcome that? Sounds to me that that's an extension
request that needs to go to Java, because I bet that many of the present
Java XML processing gizmos would fail on Unicode above the BMP range.

Regarding the specifications I see the problem. All I am asking for
is to put back \b, (?:...), (?=...) and (?!...). There seems to be
now formal regex specification (but there isn't a formal specification
for many other things either.) So, all that needs to be done is to
add specification of these 4 elements that are as formal as the current
XPath F&O specification for regex.

It doesn't seem to hard to meet that standard though. See the specification
on the reluctant quantifiers. All it really says is "matches the shortest
possible substring consistent with the match as a whole succeeding".

So, for boundary we can just say:

<quote>
Boundary matcher is supported. This is indicated by a "\b". 

The boundary matcher matches a zero-width substring between a character
matching the character class [A-Za-z_0-9] and a character matching the
character class [^A-Za-z_0-9] or vice versa. 
</quote>

This is pretty clear. It may not make the internationalization people 
very happy because I can't do word-boundary matches on Hindi text. That's
a true concern. Again, something that needs to be taken up with the 
Java specification as well.

As a fallback, positive lookahead and look-behind may help that situation. 
So, let's address that:

<quote>
Positive look-ahead is supported. This is indicated by a parenthesis
beginning with "(?=" and ending with the matching ")".

Positive look-ahead matches if the present matching substring M is 
followed by a substring L matching the positive lookahead but without 
L being part of M.
</quote>

That way a \b could be emulated as

"[A-Za-z_0-9]+\b"  -> "[A-Za-z_0-9]+(?=[^A-Za-z_0-9])"

"\b[A-Za-z_0-9]+"  -> "(?<=[^A-Za-z_0-9])[A-Za-z_0-9]+"

and I could now use Devnagri (or Thai for James Clark :-) instead of the 
US ASCII word characters.

As far as WG time, this could be prepared offline by email before the 
meeting so that it doesn't chew up WG time.

regards,
-Gunther

-- 
Gunther Schadow, M.D., Ph.D.                  
gschadow(_at_)regenstrief(_dot_)org
Associate Professor           Indiana University School of Informatics
Regenstrief Institute, Inc.      Indiana University School of Medicine
tel:1(317)630-7960                       http://aurora.regenstrief.org

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--