On Mon, Apr 28, 2003 at 07:08:38PM -0400, Paul Hermans
<paul_hermans(_at_)protext(_dot_)be> wrote:
Anyone having an idea how to sort Japanese the Hiragana way versus the
Katakana way ?
Regards,
Paul
Paul Hermans
Pro Text
www.protext.be
phermans(_at_)protext(_dot_)be
Hi, Paul, I don't completely understand your question, but at least
after a first-year Japanese course (which I've done twice -- funny
how much you forget if you don't use it for 7 years), I can give
you some help, and no one else has posted to the list.
First, Hiragana and Katakana are orthogonal orthographies[1]. A word would
be spelled in either one form or the other. Hiragana is used to
spell Japanese words phonetically instead of with so-called Chinese
characters ("Kanji"). Katakana is used to write out words borrowed
from other languages, like san-do-wi-chi.
Both syllbaries follow more or less the same order, following the
ah-oo-ee-eh-oh form horizontally, and the a-ka-sa-ta-na ha-ya-ma-wa-n
order vertically.
Hiragana characters occupy Unicode code points Ux3042 - Ux3094.
Katakana characters occupy Unicode code points Ux30A0 - Ux30FF.
So you can see that Hiragana and Katakana characters sort in different
orders. From what I know, the Unicode tables follow dictionary-order
sorting.
I constructed this simple input, where each item contains a
single character. The attribute indicates which Kana it's
from, and how I would expect it to sort in ascending order
in its group. I've used a UTF-8 encoding, which I don't think
will cause too many readers problems these days. You might
see something like <item a='k5'>[a with tilde][~]C[accent acute]
[a with tilde][~][B][upside down !]</item>. That character
is the utf-8 encoding of character Ux30F7 ("va", used only
in Katakana).
<?xml version="1.0" encoding="utf-8"?>
<items>
<item a='k5'>ã?´ã?¡</item>
<item a='h3'>ã??</item>
<item a='h4'>ã??</item>
<item a='k2'>ã?</item>
<item a='h1'>ã??</item>
<item a='k4'>ã?¢</item>
<item a='k1'>ã?«</item>
<item a='h2'>ã??</item>
<item a='k3'>ã?¹</item>
</items>
Here's the XSLT -- the sort is as simple as it gets:
<?xml version="1.0"?>
<xslt:stylesheet xmlns:xslt="http://www.w3.org/1999/XSL/Transform"
version="1.0" >
<xslt:output indent='yes' method='xml' encoding='utf-8' />
<xslt:template match='items'>
<outitems what='Starting sorting'>
<xslt:apply-templates select='item'>
<xslt:sort select='.'/>
</xslt:apply-templates>
</outitems>
</xslt:template>
<xslt:template match='item'>
<outitem><xslt:attribute name='ord'><xslt:value-of
select='@a'/></xslt:attribute>
<xslt:value-of select='.'/>
</outitem>
</xslt:template>
</xslt:stylesheet>
And the output (from both Xalan and Saxon):
<?xml version="1.0" encoding="utf-8"?>
<outitems what="Starting sorting">
<outitem ord="h1">ã??</outitem>
<outitem ord="h2">ã??</outitem>
<outitem ord="h3">ã??</outitem>
<outitem ord="h4">ã??</outitem>
<outitem ord="k1">ã?«</outitem>
<outitem ord="k2">ã?</outitem>
<outitem ord="k3">ã?¹</outitem>
<outitem ord="k4">ã?¢</outitem>
<outitem ord="k5">ã?´ã?¡</outitem>
</outitems>
However, and if I've been pendantic it might be due to this problem,
the .NET XSLT transform fails to sort the input, and ends up
echoing the input.
I think Japanese dictionary-order sorting folds H and K (at least
the Kodansha Busy People books do), but this isn't what you asked.
Hope this helps. If you're spending a lot of time on East Asian
inputs I highly recommend Ken Lunde's CKJV (ISBN 1565922247).
- Eric
------------------------------------------------
Eric Promislow
Visual Studio .NET Plugins Development Lead
EricP(_at_)ActiveState(_dot_)com
--
[1] -- Couldn't resist. This will be a googlewhack one day, and
a lexically ordered one taboot.
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list