xsl-list
[Top] [All Lists]

Re: [xsl] Tokenization - Thai language

2011-06-15 06:22:15
On Wed, June 15, 2011 11:24 am, Jan Pour wrote:
I would like to tokenize Thai text on all places, where it can be
broken to new line.
How could I do it in XSLT? Using extensions in java??

My first thought would be to build an extension based on the International
Components for Unicode [1].  See, e.g., the documentation on boundary
analysis [2].

You wouldn't get very far tokenizing using regular expressions based on
'\w' or '\W' since, as you probably know, Thai ordinarily doesn't have
separators between words.

Regards,


Tony Graham                                   tgraham(_at_)mentea(_dot_)net
Consultant                                 http://www.mentea.net
Mentea       13 Kelly's Bay Beach, Skerries, Co. Dublin, Ireland
 --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --
    XML, XSL FO and XSLT consulting, training and programming

[1] http://site.icu-project.org/
[2] http://userguide.icu-project.org/boundaryanalysis

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>