Scott Moore wrote:
Frank Heckenbach wrote:
BTW, Scott, you argued partly as if Pascal was based on ASCII. As you certainly know, this is not required, and in fact, many non-English speaking countries use an extended charset (e.g. ISO-8859-n). I do this myself with GPC today (of course, not for program identifiers, but for `Char' data, just to avoid confusion). While Latin1 (ISO-8859-1), and only this one, is upward compatible to Unicode, none of them are compatible to UTF-8 (except for the ASCII subset), so your compatibility arguments already fall down here. And in your I/O list, you'd definitely have to add I/O in an 8 bit charset -- where how to select the charset to convert Unicode to and from is another question, but there must be an (easy) way, because that's what a large part of the world uses today.
Didn't mean to sound Amerocentric :-)
My understanding of the ISO pages is that the characters outside of ASCII are in the > 127 codes. So, for example, IP Pascal specifically leaves the 8th bit unmolested, so would I/O other ISO code pages ok, and would accept ISO pages as source, since it treats c < 32 or c > 127 as characters to be ignored.
That's a valid decision, according to ISO Pascal, but not the one we've made, or which I personally like. Ignoring control characters is not exactly my idea, and characters > 127 (usually interpreted in ISO-8859-n) have been in use for a long time ...
UTF-8/Unicode is certainly not compatible with the ISO code page idea, but rather replaces it. So certainly, UTF-8 is designed to be compatible with the ASCII code set and no other.
Exactly.
My take on it is that I support ISO code pages in the 8 bit mode, and Unicode replaces ISO code pages in the 16 bit mode. Does the upward compatibility suck for Europe ?
Not too much in Western Europe, since Latin1 is (intentionally, of course) a proper subset of Unicode. But still, of course, files in 8 bit Latin1 and UTF-8 are not compatible. There will be both kinds of files to deal with, apart from 16 bit (perhaps 20 bit, stored as 32 bit) Unicode files, so a full solution will probable have to support them all.
Certainly. I see the resolution of that being Unicode internal processing, i.e., "world centered" code. The beauty of UTF-8 (and other forms) is that nobody has to know or care that my programs are Unicode internally.
Mostly yes. But when, e.g., storing data (even consisting of only Latin1, but not only ASCII characters) in a file, there is a difference between UTF-8 and 8 bit coding.
Frank