[no subject]

this is just a quick note to correct some errors by Ohta-san.

Untrue. 16 bit or 32 bit are not implementation details.

With 16 bit wchar_t, you can write

      array[(unsigned)char_code]

and your program should work on most modern machines.

With 32 bit wchar_t, it is often impossible to write:

      array[(unsigned)char_code]

because hardly no machine have >4GB virtual memory.


        first up, C arrays are only guaranteed to work if they're <32KB.
secondly, the type doesn't matter (short or int or long); just the value.
thirdly, stepping back a little and not being Unix/C biased, many if not
most of the platforms being addressed by this group have memory references
well in excess of 16 bits (4-16MB is more the norm). so unless 10646 goes
hog wild and allocates many more planes than the one they have, we can still
index by our character type. it is a seperate issue that this is often
not the best technique for large character sets.

They changed from the old
10646-appendix UTF to UTF-2 in an afternoon:


They both use 16 bits.


        this is more interesting. the assertion is that it would take much
longer to convert if the type of Rune were to change to a signed long from
an unsigned short. I checked on this; the answer is it would take about the
same time (a slight concern is that Rune would change from unsigned to signed
which may screw up shifts). However, we would resist doing it for both
resource reasons (memory structures for text doubling in size) and for
performance reasons (same reason; memcpy for twice as many bytes).

        I think Henry is still right; going to 32 bits is straightforward to
implement but will have system performance/resource effects. of course,
this has precious little to do with e-mailer science.

                        andrew hume