I'm looking to parse sentences out of paras.
to be more exact you are trying to parse a sentence with a regular
expression, which would cause you to fail a logic course as natural
language must be the canonical example of a non regular language:-)
"((.+).)
. is a meta character matching any character so that is a sequence of
one or more characters, followed by a character, ie it's any sequence of
2 or more characters.
You need to define a sentence. If a sentemce can not contain a ".", but
always ends wiith a "." then you could do [^.]*\.
but then
it cost $2.00.
is two sentences.
So perhaps a sentence is terminated by . followed by end of string or
whitespace
([^.]|\.[^ \n\r\t])*\.(\s+|$)
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="para">
new para
<xsl:analyze-string select="." regex="([^.]|\.[^ \n\r\t])*\.(\s+|$)">
<xsl:matching-substring>
sentence: <xsl:value-of select="normalize-space(.)"/>
</xsl:matching-substring>
<xsl:non-matching-substring>
oops: <xsl:value-of select="normalize-space(.)"/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
</xsl:stylesheet>
saxon9 para.xml para.xsl
new para
sentence: It is sometimes desired to have a specific heading which should not
be numbered.
sentence: This corresponds to unnumbered list headers in lists (see sections
4.3).
sentence: To facilitate this, an optional attribute text:is-list-header can be
used.
sentence: If true, the given header will not be numbered, even if an explicit
list-style is given.
new para
sentence: A text:style-name attribute references a paragraph style, while a
text:cond-style-name attribute references a conditional-style, that is, a style
that contains conditions and maps to other styles (see section 14.1.1).
sentence: If a conditional style is applied to a paragraph, the
text:style-name attribute contains the name of the style that was the result of
the conditional style evaluation, while the conditional style name itself is
the value of the text:cond-style-name attribute.
sentence: This XML structure simplifies [XSLT] transformations because XSLT
only has to acknowledge the conditional style if the formatting attributes are
relevant.
sentence: The referenced style can be a common style or an automatic style.
new para
sentence: A text:class-names attribute takes a whitespace separated list of
paragraph style names.
sentence: The referenced styles are applied in the order they are contained in
the list.
sentence: If both, text:style-name and text:class-names are present, the style
referenced by the text:style-name attribute is as the first style in the list
in text:class-names.
sentence: If a conditional style is specified together with a
style:class-names attribute, but without the text:style-name attribute, then
the first style in the style list is used as the value of the missing
text:style-name attribute.
new para
sentence: A page sequence element <text:page-sequence> specifies a sequence of
master pages that are instantiated in exactly the same order as they are
referenced in the page sequence.
sentence: If a text document contains a page sequence, it will consist of
exactly as many pages as specified.
sentence: Documents with page sequences do not have a main text flow
consisting of headings and paragraphs as is the case for documents that do not
contain a page sequence.
sentence: Text content is included within text boxes for documents with page
sequences.
sentence: The only other content that is permitted are drawing objects.
but this would of course still fail if the sentence were to contain
". " coming from "D. P. Carlisle" or "dr. " or ...
If you try to parse natural language with a single regular expression,
it will _always_ fail. But you can cover more or less arbitrarily
complicated subsets of the language by making the regexp
correspondingly more complicated (and slow)
David
________________________________________________________________________
The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.
This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs.
________________________________________________________________________
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--