perl-unicode

RE: DBD:mysql and UNICODE

2001-08-05 22:31:38
On Thu, 2 Aug 2001, Vuillemot, Ward W wrote:

Date: Thu, 2 Aug 2001 07:58:28 -0700 
From: "Vuillemot, Ward W" 
<Ward(_dot_)Vuillemot(_at_)PSS(_dot_)Boeing(_dot_)com>
To: 'Andrew McNaughton' <andrew(_at_)squiz(_dot_)co(_dot_)nz>,
    "'perl-unicode(_at_)perl(_dot_)org'" <perl-unicode(_at_)perl(_dot_)org>
Subject: RE: DBD:mysql and UNICODE

Just so I understand. . .and I think I understood UNICODE BEFORE I started 
reading all the literature that seemed to confuse the matter. :)

UNICODE is a character encoding ...

Wrong.  Unicode is not a character encoding.  There are many different
character encodings which are used to encode unicode, notably utf-8,
ucs16 and perl's own utf-8 like encoding.

Unicode is firstly a character set, relating glyphs to numerical codes,
and it also embodies many formal rules for problems like sorting text,
text flow order, combining characters and so forth.

Unicode defines the meaning of a sequence of numbers representing the
text.  The character encoding defines how those numbers are represented as
a sequence of bytes.

... that can handle any character irrespective of language
When I output to the web I will need to convert UNICODE to some appropriate 
character-set based upon the language selection. 

You need to first handle any characters in the unicode text which are not
representable in the character set you are able to represent in your
output character encoding.  Typically you would either replace difficult
characters with some sort of place holder character, or fall back to
something you can represent.  Depending on the software you are using, you
might be using any of a number of representations at this stage, with
utf-8 or perl's approximation to it being the most likely, but some CPAN
code uses ucs16. Conversion to the target encoding is a seperate but
related step.
 
Is this correct?  Or can this be done automatically. . .or at least, can I 
just avoid it and send the UNICODE data directly to a web-browser and let the 
browser do whatever is necessary.  As I intend to develop a system that can 
handle an arbitrary number of languages, I want let the code handle any 
language without me necessarily having to add more and more code to support 
it -- I would love it if I could just choose one flavor -- UNICODE -- and 
that be it.  But hey, I know I do not live in an ideal world. . .  ;)

Take a look in CPAN.

perl -MCPAN -e 'i /Unicode::/'


Andrew McNaughton



I do appreciate your help.

Thanks,
Ward

-----Original Message-----
From: Andrew McNaughton [mailto:andrew(_at_)aniwa(_dot_)wallace(_dot_)lan]
Sent: Wednesday, August 01, 2001 9:27 PM
To: Vuillemot, Ward W
Subject: Re: DBD:mysql and UNICODE




On Wed, 1 Aug 2001, Vuillemot, Ward W wrote:

Date: Wed, 1 Aug 2001 15:57:16 -0700
From: "Vuillemot, Ward W" 
<Ward(_dot_)Vuillemot(_at_)PSS(_dot_)Boeing(_dot_)com>
To: "'perl-unicode(_at_)perl(_dot_)org'" <perl-unicode(_at_)perl(_dot_)org>
Subject: DBD:mysql and UNICODE

I am looking to develop a set of databases that can handle
international character sets.  For example, I want to have menu items
that can be changed on the fly from, say, English to Japanese to
German to Chinese.

Should I create a table that correlates each language with a UNICODE
set?  And then create a table where each row is for a specific
language and the columns being the individual entries?  After that,
can I use a lookup into the first table based on the key of the second
table to determine what type of UNICODE character-set it is.  (sorry,
I am typing out load as it were ;) ).

Your character set in the database *is* unicode.  There's only one unicode
character set.  All other common to medium-rare character sets are subsets
of that one big set.  Keep things simple and store nothing in your
database that's not in unicode.

You could store your strings as you say, but I'd be inclined to have every
string in its own row, and have a column which identifies the language.

For a given language (eg english), there might be multiple possible
character encodings (eg iso-8859-1, cp1252, utf-8), and you might choose
to support more than one in your web output.  You might store
language/character encoding combinations in your database, but character
encoding and character set are not to be confused.



--
Andrew McNaughton
Scoop Media Ltd
andrew(_at_)scoop(_dot_)co(_dot_)nz

"Every year the international financial system kills more people than the
second world war. But at least Hitler was mad ... "
        -- Ken Livingstone 

<Prev in Thread] Current Thread [Next in Thread>