Re: japanese sorting

On Mon, Apr 28, 2003 at 07:08:38PM -0400, Paul Hermans 
<paul_hermans(_at_)protext(_dot_)be> wrote:


Anyone having an idea how to sort Japanese the Hiragana way versus the
Katakana way ?
Regards,

Paul
Paul Hermans
Pro Text
www.protext.be
phermans(_at_)protext(_dot_)be


Hi, Paul, I don't completely understand your question, but at least
after a first-year Japanese course (which I've done twice -- funny
how much you forget if you don't use it for 7 years), I can give
you some help, and no one else has posted to the list.

First, Hiragana and Katakana are orthogonal orthographies[1].  A word would
be spelled in either one form or the other.  Hiragana is used to
spell Japanese words phonetically instead of with so-called Chinese
characters ("Kanji").  Katakana is used to write out words borrowed
from other languages, like san-do-wi-chi.  

Both syllbaries follow more or less the same order, following the
ah-oo-ee-eh-oh form horizontally, and the a-ka-sa-ta-na ha-ya-ma-wa-n
order vertically.

Hiragana characters occupy Unicode code points Ux3042 - Ux3094.

Katakana characters occupy Unicode code points Ux30A0 - Ux30FF.

So you can see that Hiragana and Katakana characters sort in different
orders.  From what I know, the Unicode tables follow dictionary-order
sorting.

I constructed this simple input, where each item contains a
single character.  The attribute indicates which Kana it's
from, and how I would expect it to sort in ascending order
in its group.  I've used a UTF-8 encoding, which I don't think
will cause too many readers problems these days.  You might
see something like <item a='k5'>[a with tilde][~]C[accent acute]
[a with tilde][~][B][upside down !]</item>.  That character
is the utf-8 encoding of character Ux30F7 ("va", used only
in Katakana).

<?xml version="1.0" encoding="utf-8"?>
<items>
<item a='k5'>ã?´ã?¡</item>
<item a='h3'>ã??</item>
<item a='h4'>ã??</item>
<item a='k2'>ã?</item>
<item a='h1'>ã??</item>
<item a='k4'>ã?¢</item>
<item a='k1'>ã?«</item>
<item a='h2'>ã??</item>
<item a='k3'>ã?¹</item>
</items>

Here's the XSLT -- the sort is as simple as it gets:

<?xml version="1.0"?> 
<xslt:stylesheet xmlns:xslt="http://www.w3.org/1999/XSL/Transform"; 
version="1.0" >

<xslt:output indent='yes' method='xml' encoding='utf-8' />
  
<xslt:template match='items'>
    <outitems what='Starting sorting'>
        <xslt:apply-templates select='item'>
            <xslt:sort select='.'/>
        </xslt:apply-templates>
    </outitems>
</xslt:template>

<xslt:template match='item'>
    <outitem><xslt:attribute name='ord'><xslt:value-of 
select='@a'/></xslt:attribute>
        <xslt:value-of select='.'/>
    </outitem>
</xslt:template>

</xslt:stylesheet>



And the output (from both Xalan and Saxon):

<?xml version="1.0" encoding="utf-8"?>
<outitems what="Starting sorting">
<outitem ord="h1">ã??</outitem>
<outitem ord="h2">ã??</outitem>
<outitem ord="h3">ã??</outitem>
<outitem ord="h4">ã??</outitem>
<outitem ord="k1">ã?«</outitem>
<outitem ord="k2">ã?</outitem>
<outitem ord="k3">ã?¹</outitem>
<outitem ord="k4">ã?¢</outitem>
<outitem ord="k5">ã?´ã?¡</outitem>
</outitems>


However, and if I've been pendantic it might be due to this problem,
the .NET XSLT transform fails to sort the input, and ends up
echoing the input.

I think Japanese dictionary-order sorting folds H and K (at least
the Kodansha Busy People books do), but this isn't what you asked.

Hope this helps.  If you're spending a lot of time on East Asian
inputs I highly recommend Ken Lunde's CKJV (ISBN 1565922247).

- Eric

------------------------------------------------
Eric Promislow
Visual Studio .NET Plugins Development Lead
EricP(_at_)ActiveState(_dot_)com
--

[1] -- Couldn't resist.  This will be a googlewhack one day, and
a lexically ordered one taboot.

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list