xsl-list
[Top] [All Lists]

Re: [xsl] Katakana substitution regex

2010-08-07 02:34:32
I suppose that there can't be a sequence of two or more ー
characters. If so, I'd just go ahead and replace all substrings with
the substring + #12540 and then, in a second call, replace all
#12540#12540 by #12540.

Sometimes it is simpler not to try to avoid to do something that can
be easily undone.

Below is the stylesheet. Substrings are sorted by descending length -
I don't know whether there  are substrings similar to 'abcd' and 'bc',
where the suffix must be appended to 'abcd' but not to the 'bc'
within.

<xsl:stylesheet version="2.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
    xmlns:xs="http://www.w3.org/2001/XMLSchema";
    xmlns:wl="w.l">

<xsl:function name="wl:make-pattern" as="xs:string">
  <xsl:param name="reps" as="xs:string*"/>
  <xsl:variable name="sorted" as="xs:string*">
    <xsl:perform-sort select="$reps" >
      <xsl:sort select="string-length(.)" order="descending"/>
    </xsl:perform-sort>
  </xsl:variable>
  <xsl:sequence select="concat('(',string-join($sorted,'|'),')')"/>
</xsl:function>

<xsl:function name="wl:rep-subs" as="xs:string">
  <xsl:param name="text"    as="xs:string"/>
  <xsl:param name="pattern" as="xs:string"/>
  <xsl:sequence select="replace(replace($text, $pattern,
'$1&#12540;'), '&#12540;&#12540;', '&#12540;')"/>
</xsl:function>

<xsl:variable name="pattern"
              select="wl:make-pattern(('ab', 'abcd', 'cd', 'bc'))"/>

<xsl:template match="/">
   <xsl:apply-templates/>
</xsl:template>

<xsl:template match="text">
  <xsl:copy>
    <xsl:value-of select="wl:rep-subs(text(),$pattern)"/>
  </xsl:copy>
</xsl:template>

</xsl:stylesheet>


On 6 August 2010 22:14, Hoskins & Gretton 
<hoskgret(_at_)rochester(_dot_)rr(_dot_)com> wrote:

HI, I have to convert some Katakana strings from "original" to "new" by 
adding &#12540; (#x30fc;) a pronunciation character (see 
http://www.fileformat.info/info/unicode/char/30fc/index.htm).
In Japanese, there aren't any word boundaries, so essentially all of my 
search strings are substrings of the text of the current element.
When substring "a" is followed by the character &#12540; I do not want to 
make the replacement.

example:        &#12502;&#12521;&#12454;&#12470; is a search string but it is 
followed by &#12540; already -- do nothing

When substring "a" is not followed by the character &#12540; I want to make 
the replacement to create "a" followed by &#12540;.

example:        &#12502;&#12521;&#12454;&#12470; is a search string but it is 
not followed by #x30fc; already
               add to the end to make it
               &#12502;&#12521;&#12454;&#12470;&#12540;

If I was going to just add the &#12540;, I was able to do that with a regex 
that contained the strings that I wanted to find by using regex and 
analyze-string, where $regexSearch contains all of my search Katakana strings:

               <xsl:analyze-string select="." regex="({$regexSearch})">
                   <xsl:matching-substring>
                       <xsl:value-of select="regex-group(1)"/>
                       <xsl:text>&#12540;</xsl:text>
                   </xsl:matching-substring>
                   <xsl:non-matching-substring>
                       <xsl:value-of select="."/>
                   </xsl:non-matching-substring>
               </xsl:analyze-string>
However,I can't figure out how I should fit this in to an overall xslt, where 
I need to check check ahead in the element text before I decide to make the 
substitution. Currently, if there is a string:               
 &#12502;&#12521;&#12454;&#12470;&#12540;
it becomes:     &#12502;&#12521;&#12454;&#12470;&#12540;&#12540; (doubling 
the last character).

If someone has some experience with this type of search and replace problem, 
I would appreciate some guidance.
Regards, Dorothy

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>