Scott Moore wrote:
Waldek Hebisch wrote:
Scott Moore wrote:
Wide strings, wide sets, ability to read and write wide characters and probably also utf-8 to and from files.
To me it well illustrates the fallacy of using C as an example for new extensions. Pascal is not C, and has fundamentally different aims.
Could you be more explicit here? Do you think that we should have a single char type? If yes, then would you limit that type to 8 bits (or maybe 16)?
There are going to be all kinds of opinions here. I believe Unicode support makes the most sense for Pascal by embracing it. In Pascal "char" is a distinct type. No assumptions are made about its size or characteristics other than ordering. Char is not used to mean "byte like entity" as in C.
So in a "wide mode program", char is 16 bits, set of char is still possible, meaning a minimum set is 65535 elements (and implementing sparse sets becomes more important), and string constants and data are 16 bit. This makes sense because Unicode is a superset of ASCII, i.e., a program that follows the rules of ISO 7185 Pascal should be recompilable for Unicode mode without any changes whatever.
I do not like the concept of "compilation modes". Unicode is not merely a huge collections of code-points -- trurly Unicode aware program is likely to use different algorithms and data structures. Sure, many program will work without changes, but so using 8-bit bytes and UTF-8.
Note also that full Unicode is 20 bits (and because of combining chars still you cannot indentify characters with Unicode code-points). Glibc normally uses 32-bit 'wchar_t', which in another argument for 32-bit chars.
It is "good Wirth/Pascal" tradition to offer only one size. But IHMO it is _not_ GNU Pascal tradition. One of GPC selling points is interfacing to GNU C, so we should provide a type matching with 'wchar_t'. We try to be compatible with all significant Pascal dialects (including Delphi), and Delphi for long time had two (maybe more) charactes types. Also, we have a bunch of integer types. So I see no reason to insist on single char type. There is a reason to insist on single type for string literls. However if normal string can represent enough codes and if we provide builtin conversion function (builtin, because we want to constant-fold them), than can initialse other types via explicit conversion.
Issue #2, and separate from my point of view, is what form the support libraries take on such a program. Obviously the 16 bit characters must be accommodated, but in addition, (at minimum) the following options should exist for input and output of characters (as text files or even "file of char"):
- I/O as ASCII, i.e., reduce and expand characters from 8 bit in files.
- I/O as Unicode 16 bit characters, in "big endian" format.
- As 3, but "little endian" format (note that Unicode has an "endian"
marker to tell readers what endian format is in a file).
Now, putting my salesman hat on, I believe (1) should be replaced with UTF-8.
I am already sold on UTF-8 :). But I have found that in practice Unicode processing can be 100 times slower than ASCII processing (on the same pure ASCII data). So to make Unicode practical one have to employ various tricks. And such ticks involve choice of data representation. For some problems UTF-32 (or UTF-16) is the fastest. But for others UTF-8 is both fastest and simplest -- AFAICS for example in many cases regex matching (or parsing) of UTF-8 data can be as fast as matching pure 8-bit data (and much faster then UTF-16/32). As I wrote, I think it is best to give the programmer choice.
Anyway, regarless of what solution we choose to support Unicode, we need a type for interfacing to C. If we think that the type is only for interfacing, then name like 'WcharInt' (and making it an integer type) make sense.
I personally think that we may use such type as Pascal wide character type, but if that is too controversial, then I propose to just add an interfacing type now and postpone the question of proper Unicode support.