Re: UTF-8 versions (was: Re: RFC 2047 and gatewaying)

The below basically argues that Unicode (and UTF-8) is big enough forthe foreseeable future, so that no change to the UTF-8 specificationwill be necessary.


If you don't care about Unicode, just stop reading here.

Bruce Lilly writes:

Arnt Gulbrandsen wrote:

Why would you expect Unicode to change substantively?


The 3.0->3.1 experience. A.k.a. "once burned, twice shy".

That change was basically an admission that 64k wasn't enough. It isstill possible that some bigger number is enough. The unicodeconsortium believes that 17*64k is enough, and I agree.

The number of characters used for human communication desn't seem tobe rising much, and there's plenty of space left in the currentspecification. IIRC Unicode still uses less than 200,000 of themillion-odd possible code points.
Famous last words. From my handy dead-tree copy of Unicode 2.0, page2-4, under the "Full Encoding heading":
"There are over 18,000 unassigned code positions that are availablefor future allocation. This number far exceeds anticipated characterencoding requirements for all world characters and symbols."

Yep. I have that too. The fact that 18,000 isn't enough doesn't meanthat about a million isn't enough.

Cough, cough. It is nearly a universal truth that things tend toexpand to fill the available space (and/or time). Why do you(apparently) think that Unicode is exempt?

I don't. I do think that people's ability/willingness to learncharacters is a (much) stricter limitation than the number of availablecode points in Unicode.

Some people will invent new scripts for some languages, but I doubt_many_ characters will be added in this way. The costs of teaching kidsbig alphabets are too high, for a start.

Some people will take books which mix a "dance notation" font withEnglish, write up a proposal adding those characters, and submit it. Orthe chess notation used in the newspaper's chess column. That doesn'tadd up to much either - the number of characters added in that way islimited to what the font vendors and publishers use, and what theaudience(s) will learn.

I suppose you could argue that Unicode adds alphabets. But do youthink Unicode still hasn't reached the 20% mark?
They add more than "alphabets", and that's part of the problem. Againquoting Unicode 2.0 (page 1-3 this time):
"Graphologies unrelated to text, such as musical and dance notations,are outside the scope of the Unicode Standard."

"Unrelated to text". If something like that is used in booksintermingled with English text, it's hard to say that it's unrelated toEnglish text.


--Arnt