Waldek Hebisch wrote:
Scott Moore wrote:
Waldek Hebisch wrote:
Scott Moore wrote:
Wide strings, wide sets, ability to read and write wide characters and probably also utf-8 to and from files.
To me it well illustrates the fallacy of using C as an example for new extensions. Pascal is not C, and has fundamentally different aims.
Could you be more explicit here? Do you think that we should have a single char type? If yes, then would you limit that type to 8 bits (or maybe 16)?
There are going to be all kinds of opinions here. I believe Unicode support makes the most sense for Pascal by embracing it. In Pascal "char" is a distinct type. No assumptions are made about its size or characteristics other than ordering. Char is not used to mean "byte like entity" as in C.
So in a "wide mode program", char is 16 bits, set of char is still possible, meaning a minimum set is 65535 elements (and implementing sparse sets becomes more important), and string constants and data are 16 bit. This makes sense because Unicode is a superset of ASCII, i.e., a program that follows the rules of ISO 7185 Pascal should be recompilable for Unicode mode without any changes whatever.
I do not like the concept of "compilation modes". Unicode is not merely a huge collections of code-points -- trurly Unicode aware program is likely to use different algorithms and data structures. Sure, many program will work without changes, but so using 8-bit bytes and UTF-8.
I agree. All units/libraries would have to exist in both compilation modes (and would need to be tested twice), etc. And, of course, once other issues appear that "want" to have compilation modes, we get to have 2^n copies of libraries which is even less practical.
Note also that full Unicode is 20 bits (and because of combining chars still you cannot indentify characters with Unicode code-points). Glibc normally uses 32-bit 'wchar_t', which in another argument for 32-bit chars.
I also agree. If anything then either 32 bits immediately (or just match C's wchar_t, which the backend seems to provide us -- I still have to check in detail).
It is "good Wirth/Pascal" tradition to offer only one size. But IHMO it is _not_ GNU Pascal tradition.
Of course, even Wirth's Pascal has integer subranges, so if we could define ASCII char as a subrange of Unicode char, we might be fine. But this would mean (a) `Char' would be Unicode (the standard type must be the biggest if we want to follow the standard's "spirit" here) which is too big a change I fear, (b) it would not cater for a UTF8 type (in particular such a string type) which is neither an array of ASCII chars or of Unicode chars.
So I think we should (must) leave `Char' as it is. Besides the usual suspects -- binary files and other protocols which depend on data type layout -- changing `Char' would mean breaking most programs that handle text(!) files with 7 bit (i.e., ASCII) and 8 bit charsets. We can't realistically do that.
BTW, Scott, you argued partly as if Pascal was based on ASCII. As you certainly know, this is not required, and in fact, many non-English speaking countries use an extended charset (e.g. ISO-8859-n). I do this myself with GPC today (of course, not for program identifiers, but for `Char' data, just to avoid confusion). While Latin1 (ISO-8859-1), and only this one, is upward compatible to Unicode, none of them are compatible to UTF-8 (except for the ASCII subset), so your compatibility arguments already fall down here. And in your I/O list, you'd definitely have to add I/O in an 8 bit charset -- where how to select the charset to convert Unicode to and from is another question, but there must be an (easy) way, because that's what a large part of the world uses today.
One of GPC selling points is interfacing to GNU C, so we should provide a type matching with 'wchar_t'.
Agreed.
We try to be compatible with all significant Pascal dialects (including Delphi), and Delphi for long time had two (maybe more) charactes types.
Of course, as often, badly named -- if you mean `AnsiChar'. Which ANSI standard does it refer to? (I actually don't know American standards too well. I suppose Unicode has an ANSI number, but so does ASCII, I suppose, and perhaps Latin1 etc.) If read in a Pascal context, it would sound like the `Char' type of ANSI Pascal (which is equivalent to ISO Pascal). To old-time Unix users, "ansi" is better known as a terminal type, and to old-time Dos users it also refers to "ansi"-like terminal sequences, sometimes, for some contorted reasons, plus IBM PC/MS DOS specific 8 bit character sets (which are quite unlike ISO-8859-n or Unicode).
So at best, the name seems ambiguous and confusion, and as often, I'd rather build in another type name, and make (leave) `AnsiChar' an "only Borland compatibility" thing.
Also, we have a bunch of integer types. So I see no reason to insist on single char type. There is a reason to insist on single type for string literls. However if normal string can represent enough codes and if we provide builtin conversion function (builtin, because we want to constant-fold them), than can initialse other types via explicit conversion.
In principle I agree, though I shudder at the tought of implementing the string types this will need and their conversions. But it might be inevitable in the long run ...
Anyway, regarless of what solution we choose to support Unicode, we need a type for interfacing to C. If we think that the type is only for interfacing, then name like 'WcharInt' (and making it an integer type) make sense.
If it is *only* for interfacing, I'd even suggest something like `CWCharT' or `CWideChar', to make that clear, just like we have `CString', `CInteger' (in the next release) etc.
But if we foresee to use it in Pascal later, I'd rather use a more readable name instead. Perhaps just `WideChar' (but what does "wide" mean, isn't an "m" wider than an "i"? ;-).
And perhaps we could even make it an "abstract ordinal type" for now, i.e., like an enum type (with a rather large range though) whose identifiers were forgotten (as in, not exported from a module). So one could use ordinal functions like `Ord', `Succ' etc., but not much more. This should avoid breaking too many things if we'll actually turn it into a char type sometime.
I personally think that we may use such type as Pascal wide character type, but if that is too controversial, then I propose to just add an interfacing type now and postpone the question of proper Unicode support.
I'd also postpone it (since we've both probably have enough other things to do first), but implementing such a type (such as, layout compatible to C's `wchar_t', abstract ordinal type, meaningful name) might be reasonable now ...
Frank