On Wed, June 15, 2011 11:24 am, Jan Pour wrote:
I would like to tokenize Thai text on all places, where it can be
broken to new line.
How could I do it in XSLT? Using extensions in java??
My first thought would be to build an extension based on the International
Components for Unicode [1]. See, e.g., the documentation on boundary
analysis [2].
You wouldn't get very far tokenizing using regular expressions based on
'\w' or '\W' since, as you probably know, Thai ordinarily doesn't have
separators between words.
Regards,
Tony Graham tgraham(_at_)mentea(_dot_)net
Consultant http://www.mentea.net
Mentea 13 Kelly's Bay Beach, Skerries, Co. Dublin, Ireland
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
XML, XSL FO and XSLT consulting, training and programming
[1] http://site.icu-project.org/
[2] http://userguide.icu-project.org/boundaryanalysis
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--