David Carlisle wrote:
I'm looking to parse sentences out of paras.
to be more exact you are trying to parse a sentence with a regular
expression, which would cause you to fail a logic course as natural
language must be the canonical example of a non regular language:-)
Highly likely.
You need to define a sentence.
I tried with the worst examples in the source text.
So perhaps a sentence is terminated by . followed by end of string or
whitespace
([^.]|\.[^ \n\r\t])*\.(\s+|$)
but this would of course still fail if the sentence were to contain
". " coming from "D. P. Carlisle" or "dr. " or ...
If you try to parse natural language with a single regular expression,
it will _always_ fail. But you can cover more or less arbitrarily
complicated subsets of the language by making the regexp
correspondingly more complicated (and slow)
<para>Sentance containing Dr. Michael Kay and D.P. Carlisle</para>
<grin/> I'd expect that to break most regexen :-)
<xsl:template match="para">
<para>
<xsl:analyze-string select="." regex="([^.]|\.[^ \n\r\t])*\.(\s+|$)">
<xsl:matching-substring>
<s> <xsl:value-of select="normalize-space(.)"/></s>
</xsl:matching-substring>
<xsl:non-matching-substring>
<error> <xsl:value-of select="normalize-space(.)"/> </error>
</xsl:non-matching-substring>
</xsl:analyze-string>
</para>
</xsl:template>
Thanks David. That's better than my improvement.
No 'error' elements in 12000 lines.
Much appreciated.
regards
--
Dave Pawson
XSLT XSL-FO FAQ.
http://www.dpawson.co.uk
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--