Peter N Lewis wrote:
There may not be any point in supporting Unicode any further. From what I've seen as the trend over the last decade, UTF-8 appears to be winning the battle, being both compact in most normal use, 8-bit, and yet supporting the full Unicode range. UTF-8 therefore allows the compiler to continue to ignore the entire issue, except perhaps adding a few support routines (eg, LengthInCharactres) and/or enhancing the RTS runtime routines to support UTF-8 (not much is really needed).
I don't agree. It's not only Length (which is defined by the standard and all known dialects AFAIK, so saying "use another Length function to get the real length" seems more than a kludge, i.e. Length itself must count characters). I mean, this is Pascal, not C. In Pascal, `Char' is a character, and strings are made out of characters, and the predefined string operations operate on strings. Unlike C, where `char' is just some integer (sic!) type, `char *' is some pointer type, and everything else are library routines, and the user is told to just use another set of types and routines for Unicode chars/strings.
And it also affects all functions that search for characters, take substrings etc. I agree with what Scott Moore wrote in that NG discussion: For internal usage, a type with fixed character width seems far preferable. (Of course, if we provide the option of building with 8 bit chars, like now, nobody will stop you from (mis-)using them as UTF-8 bytes, but then you're on your own if Length, SubStr/Copy, Index/Pos etc. behave strangely.)
Also, it appears that Unicode as a 16 bit standard is also winning, so 32-bit chars would probably be extreme too.
I heard the Chinese and Japanese were not too happy about it ...
Anyway, if we provide the option of 8 or 32 bits chars, we can probably have 16 bits with no additional effort as well, so everyone can then build what he prefers.
(BTW, the 32 bit type will actually only need a range of 2^20+2^16. This matters for sets, unless/until we have spare sets, not for chars and strings probably as deailng with a 21 or 24 bit type is not very efficient.)
Frank