Waldek Hebisch wrote:
Scott Moore wrote:
Wide strings, wide sets, ability to read and write wide characters and probably also utf-8 to and from files.
To me it well illustrates the fallacy of using C as an example for new extensions. Pascal is not C, and has fundamentally different aims.
Could you be more explicit here? Do you think that we should have a single char type? If yes, then would you limit that type to 8 bits (or maybe 16)?
There are going to be all kinds of opinions here. I believe Unicode support makes the most sense for Pascal by embracing it. In Pascal "char" is a distinct type. No assumptions are made about its size or characteristics other than ordering. Char is not used to mean "byte like entity" as in C.
So in a "wide mode program", char is 16 bits, set of char is still possible, meaning a minimum set is 65535 elements (and implementing sparse sets becomes more important), and string constants and data are 16 bit. This makes sense because Unicode is a superset of ASCII, i.e., a program that follows the rules of ISO 7185 Pascal should be recompilable for Unicode mode without any changes whatever.
Issue #2, and separate from my point of view, is what form the support libraries take on such a program. Obviously the 16 bit characters must be accommodated, but in addition, (at minimum) the following options should exist for input and output of characters (as text files or even "file of char"):
1. I/O as ASCII, i.e., reduce and expand characters from 8 bit in files. 2. I/O as Unicode 16 bit characters, in "big endian" format. 3. As 3, but "little endian" format (note that Unicode has an "endian" marker to tell readers what endian format is in a file).
Now, putting my salesman hat on, I believe (1) should be replaced with UTF-8. UTF-8 is completely upward and downward compatible with Unicode and ASCII, so it can replace ASCII format in a Unicode enabled program. If the program deals in ASCII only (even though it manipulates it as 16 bit Unicode), it will be I/Oed as ASCII. If any Unicode is used (characters > 127), then Unicode is generated (in UTF-8 escapes format).
In addition, UTF-8 is gaining in popularity. UTF-8 manipulation inside a program is ungainly, since you have to compensate for multibyte characters, but UTF-8 as an I/O format with full 16 bit Unicode processing internally is, IMHO, an ideal solution.