Consider the following program:
program fstr(output); type sat = packed array [1..16] of 'K'..'O'; var sa : sat; begin sa := 'OK'; writeln('Sizeof(sa) = ', Sizeof(sa)); writeln(sa) end .
ATM it "works". However, it seems that this program is illegal. Namely ISO says:
: Each value of a string-type shall be structured as a one-to-one : mapping from an index-domain to a set of components possessing : the char-type
And later it again speaks about `the char-type'. So it seems that subranges of char type are _not_ allowed as component types of strings.
Form implementation point of view subranges can be packed more tightly then char type, so such restriction make some sense.
Now, the question is what shall we do? And do you agree with my reading of the standard? Shall we disallow the program above. Or maybe accept it as an extension, but report an error in standard mode.
Waldek Hebisch wrote:
Consider the following program:
program fstr(output); type sat = packed array [1..16] of 'K'..'O'; var sa : sat; begin sa := 'OK'; writeln('Sizeof(sa) = ', Sizeof(sa)); writeln(sa) end .
ATM it "works". However, it seems that this program is illegal. Namely ISO says:
: Each value of a string-type shall be structured as a one-to-one : mapping from an index-domain to a set of components possessing : the char-type
And later it again speaks about `the char-type'. So it seems that subranges of char type are _not_ allowed as component types of strings.
There is a (maybe somewhat curious) definition of strings in the ISO 7185 Pascal Report (third edition) in section 6.2:
"An array type is called a *string type* if it is packed, if it has as its component type the predefined type *Char* and if it has as its index type a subrange of *Integer* from 1 to n, for n greater than 1"
The report cleary speaks of "the predefined type Char". But, if I am correct, the definition seems to imply that we also have to reject (in iso modes) the following
{$standard-pascal} program teststr( Output); var s: packed array[ 1..1] of char; begin s:= '?'; writeln( s) end.
GPC correctly rejects (in iso modes) an array of char as a string type if the array is not "packed".
Form implementation point of view subranges can be packed more tightly then char type, so such restriction make some sense.
Now, the question is what shall we do? And do you agree with my reading of the standard? Shall we disallow the program above. Or maybe accept it as an extension, but report an error in standard mode.
The latter, I would say. Will there be any consequences when GPC starts to support Unicode character sets ?
Regards,
Adriaan van Os
Adriaan van Os wrote:
Waldek Hebisch wrote:
Consider the following program:
program fstr(output); type sat = packed array [1..16] of 'K'..'O'; var sa : sat; begin sa := 'OK'; writeln('Sizeof(sa) = ', Sizeof(sa)); writeln(sa) end .
ATM it "works". However, it seems that this program is illegal. Namely ISO says:
: Each value of a string-type shall be structured as a one-to-one : mapping from an index-domain to a set of components possessing : the char-type
And later it again speaks about `the char-type'. So it seems that subranges of char type are _not_ allowed as component types of strings.
There is a (maybe somewhat curious) definition of strings in the ISO 7185 Pascal Report (third edition) in section 6.2:
"An array type is called a *string type* if it is packed, if it has as its component type the predefined type *Char* and if it has as its index type a subrange of *Integer* from 1 to n, for n greater than 1"
The report cleary speaks of "the predefined type Char". But, if I am correct, the definition seems to imply that we also have to reject (in iso modes) the following
{$standard-pascal} program teststr( Output); var s: packed array[ 1..1] of char; begin s:= '?'; writeln( s) end.
Yes, thanks for test program. The restriction is unnatural and EP allows the program above, but AFAICS it is illegal in ISO 7185.
Form implementation point of view subranges can be packed more tightly then char type, so such restriction make some sense.
Now, the question is what shall we do? And do you agree with my reading of the standard? Shall we disallow the program above. Or maybe accept it as an extension, but report an error in standard mode.
The latter, I would say. Will there be any consequences when GPC starts to support Unicode character sets ?
I have now disallowed such things. If one wants to accept such things there is some real work to do. Namely, we also accepted:
program fstr1; type sat = packed array [1..16] of 'K'..'O' value [otherwise 'O']; var sa : sat; begin sa := '111'; end .
To handle this properly one would have to implement special range checking routine. Also both programs crash with 4.0 backend due to type mismatch (one would have to add a conversion).
Concering Unicode: I am not sure if Unicode strings should be compatible with normal ones. Since normal chars are smaller normal strings and Unicode strings can not be compatible in var parameters. In value context the compiler could generate a conversion, but then we are in messy busines of codepages.
Waldek Hebisch wrote:
Adriaan van Os wrote:
Waldek Hebisch wrote:
Consider the following program:
program fstr(output); type sat = packed array [1..16] of 'K'..'O'; var sa : sat; begin sa := 'OK'; writeln('Sizeof(sa) = ', Sizeof(sa)); writeln(sa) end .
ATM it "works". However, it seems that this program is illegal. Namely ISO says:
: Each value of a string-type shall be structured as a one-to-one : mapping from an index-domain to a set of components possessing : the char-type
And later it again speaks about `the char-type'. So it seems that subranges of char type are _not_ allowed as component types of strings.
There is a (maybe somewhat curious) definition of strings in the ISO 7185 Pascal Report (third edition) in section 6.2:
"An array type is called a *string type* if it is packed, if it has as its component type the predefined type *Char* and if it has as its index type a subrange of *Integer* from 1 to n, for n greater than 1"
The report cleary speaks of "the predefined type Char". But, if I am correct, the definition seems to imply that we also have to reject (in iso modes) the following
{$standard-pascal} program teststr( Output); var s: packed array[ 1..1] of char; begin s:= '?'; writeln( s) end.
Yes, thanks for test program. The restriction is unnatural and EP allows the program above, but AFAICS it is illegal in ISO 7185.
Indeed, seems so. I suppose the intention was that 'x' is a char literal and not a string literal, and excluding the case above was an accident, but for full compatibility I guess we should issue an error, but perhaps speaking of a "an obscure ISO 7185 Pascal restriction" (cf. the message for `{ ... *)' comments).
Form implementation point of view subranges can be packed more tightly then char type, so such restriction make some sense.
Now, the question is what shall we do? And do you agree with my reading of the standard? Shall we disallow the program above. Or maybe accept it as an extension, but report an error in standard mode.
The latter, I would say. Will there be any consequences when GPC starts to support Unicode character sets ?
I have now disallowed such things. If one wants to accept such things there is some real work to do. Namely, we also accepted:
program fstr1; type sat = packed array [1..16] of 'K'..'O' value [otherwise 'O']; var sa : sat; begin sa := '111'; end .
To handle this properly one would have to implement special range checking routine. Also both programs crash with 4.0 backend due to type mismatch (one would have to add a conversion).
Previously I was undecided, but this argument convinces me that we should forbid such "strings" (as you did). I can't see a useful purpose of them, leave alone one that justifies the additional complications.
Concering Unicode: I am not sure if Unicode strings should be compatible with normal ones. Since normal chars are smaller normal strings and Unicode strings can not be compatible in var parameters. In value context the compiler could generate a conversion, but then we are in messy busines of codepages.
There was a discussion last Jun/Jul in c.l.p.m "Sets and portability", with a large part about Pascal and Unicode, where I stated some of my views. In short, I think there should be a single `Char' type internally, and thus a single kind of strings, in keeping with the standards, where `Char' is always character (not a byte that may be part of a character representation as is `char' in C).
Conversion should be done on I/O (so there should be ways to set the charset on files, probably by default using the standard locale environment variables, plus ways to explicitly set it per file), and by explicit conversion calls. We can have the possibility of building GPC with an 8 bit (as now) or a 32 bit (Unicode) `Char' type, but these might better be complie-time options, resulting, e.g., in two separate complied RTS libraries (built from the same source code, of course), IMHO.
When `Char' is a Unicode type, one could, of course, declare a subrange `Chr (0) .. Chr (255)', and possibly make it 8-bit if we allow the `Size' attribute for chars. But it would only be Latin1 chars (by the definition of Unicode), and given that Latin1 only works for a few (though major) languages, but e.g. not for the Euro symbol (which requires Latin9 instead), I don't think a special facility for such Latin1-8-bit-strings with internal conversions is justified. (That's referring to the original question -- for charsets other than Latin1 or plain ASCII (Chr (0) .. Chr (127)), a correspnding char type wouldn't be a subrange of Unicode `Char' at all, since the "high" characters are encoded differently.)
However, I also think more detailed discussions about Unicode implementation should be postponed until someone actually plans to do anything about it.
Frank
At 15:50 +0100 12/3/06, Frank Heckenbach wrote:
Concering Unicode: I am not sure if Unicode strings should be compatible with normal ones. Since normal chars are smaller normal strings and Unicode strings can not be compatible in var parameters. In value context the compiler could generate a conversion, but then we are in messy busines of codepages.
There was a discussion last Jun/Jul in c.l.p.m "Sets and portability", with a large part about Pascal and Unicode, where I stated some of my views. In short, I think there should be a single `Char' type internally, and thus a single kind of strings, in keeping with the standards, where `Char' is always character (not a byte that may be part of a character representation as is `char' in C).
Conversion should be done on I/O (so there should be ways to set the charset on files, probably by default using the standard locale environment variables, plus ways to explicitly set it per file), and by explicit conversion calls. We can have the possibility of building GPC with an 8 bit (as now) or a 32 bit (Unicode) `Char' type, but these might better be complie-time options, resulting, e.g., in two separate complied RTS libraries (built from the same source code, of course), IMHO.
There may not be any point in supporting Unicode any further. From what I've seen as the trend over the last decade, UTF-8 appears to be winning the battle, being both compact in most normal use, 8-bit, and yet supporting the full Unicode range. UTF-8 therefore allows the compiler to continue to ignore the entire issue, except perhaps adding a few support routines (eg, LengthInCharactres) and/or enhancing the RTS runtime routines to support UTF-8 (not much is really needed).
Also, it appears that Unicode as a 16 bit standard is also winning, so 32-bit chars would probably be extreme too.
However, I also think more detailed discussions about Unicode implementation should be postponed until someone actually plans to do anything about it.
Makes sense. Using UTF-8, there isn't really any current problems using GPC and any character set.
Enjoy, Peter.
Peter N Lewis wrote:
There may not be any point in supporting Unicode any further. From what I've seen as the trend over the last decade, UTF-8 appears to be winning the battle, being both compact in most normal use, 8-bit, and yet supporting the full Unicode range. UTF-8 therefore allows the compiler to continue to ignore the entire issue, except perhaps adding a few support routines (eg, LengthInCharactres) and/or enhancing the RTS runtime routines to support UTF-8 (not much is really needed).
I don't agree. It's not only Length (which is defined by the standard and all known dialects AFAIK, so saying "use another Length function to get the real length" seems more than a kludge, i.e. Length itself must count characters). I mean, this is Pascal, not C. In Pascal, `Char' is a character, and strings are made out of characters, and the predefined string operations operate on strings. Unlike C, where `char' is just some integer (sic!) type, `char *' is some pointer type, and everything else are library routines, and the user is told to just use another set of types and routines for Unicode chars/strings.
And it also affects all functions that search for characters, take substrings etc. I agree with what Scott Moore wrote in that NG discussion: For internal usage, a type with fixed character width seems far preferable. (Of course, if we provide the option of building with 8 bit chars, like now, nobody will stop you from (mis-)using them as UTF-8 bytes, but then you're on your own if Length, SubStr/Copy, Index/Pos etc. behave strangely.)
Also, it appears that Unicode as a 16 bit standard is also winning, so 32-bit chars would probably be extreme too.
I heard the Chinese and Japanese were not too happy about it ...
Anyway, if we provide the option of 8 or 32 bits chars, we can probably have 16 bits with no additional effort as well, so everyone can then build what he prefers.
(BTW, the 32 bit type will actually only need a range of 2^20+2^16. This matters for sets, unless/until we have spare sets, not for chars and strings probably as deailng with a 21 or 24 bit type is not very efficient.)
Frank
At 13:16 +0100 13/3/06, Frank Heckenbach wrote:
Peter N Lewis wrote:
There may not be any point in supporting Unicode any further. From
I don't agree. It's not only Length (which is defined by the (mis-)using them as UTF-8 bytes, but then you're on your own if Length, SubStr/Copy, Index/Pos etc. behave strangely.)
Actually, with UTF-8, there is rarely any issues with Length, SubStr, Copy, Index, or Pos.
With UTF-8:
* Assuming valid UTF-8 strings, Pos will never mis-match. * Length returns the "size" of the string. Given UTF-8, there must be two different functions, one to return the size of the string in chars - which you call "Length" is personal preference. * Searching for an ASCII character will always work as expected. * SubStr/Copy require valid indexes and length, but the result will be explicitly either correct, or an invalid UTF-8 string.
For example, if you have a search string, a replace string, and a source string, the exact same code using Pos and Copy will work for ASCII and for UTF-8, assuming all the strings are valid ASCII or valid UTF-8 respectively.
Handling case insensitively is more entertaining of course, but then it's already rarely handled well even with just ISO-8859-1.
Anyway, if someone things Unicode32 is worth implementing in the RTS, go for it, I'd just suggest that it's becoming less and less relevant.
Enjoy, Peter.
Peter N Lewis wrote:
At 13:16 +0100 13/3/06, Frank Heckenbach wrote:
Peter N Lewis wrote:
There may not be any point in supporting Unicode any further. From
I don't agree. It's not only Length (which is defined by the (mis-)using them as UTF-8 bytes, but then you're on your own if Length, SubStr/Copy, Index/Pos etc. behave strangely.)
Actually, with UTF-8, there is rarely any issues with Length, SubStr, Copy, Index, or Pos.
With UTF-8:
- Assuming valid UTF-8 strings, Pos will never mis-match.
- Length returns the "size" of the string. Given UTF-8, there must
be two different functions, one to return the size of the string in chars - which you call "Length" is personal preference.
No, it isn't. It's clearly defined in the standards and all dialacts I'm aware of.
- Searching for an ASCII character will always work as expected.
- SubStr/Copy require valid indexes and length, but the result will
be explicitly either correct, or an invalid UTF-8 string.
For some definition of valid, but that's not the definition used in the standards, or BP etc.
Your definition seems to be something like: Pos returns some value that, when passed as the 2nd argument to SubStr/Copy yields a substring starting with the string searched for in Pos.
But that's not the standard definition (more precisely, it's a weaker condition that follows from the standard definition, but not vice versa). The standard/BP definitions require that Index/Pos return a character index, and that SubStr/Copy accept all (in-range) character indexes and return valid strings.
Just one other example of the standard requirements:
6.1.9 CharacterÂstrings
characterÂstring = `'' { stringÂelement } `'' . stringÂelement = apostropheÂimage | stringÂcharacter . apostropheÂimage = `''' . stringÂcharacter = oneÂofÂaÂsetÂofÂimplementationÂdefinedÂcharacters .
In particular this means that the set of valid characters (and their interpretation) in implementation-defined, but given this set, any sequence is a valid string. This doesn't fit with UTF-8 (where not every byte sequence is valid, of course).
For example, if you have a search string, a replace string, and a source string, the exact same code using Pos and Copy will work for ASCII and for UTF-8, assuming all the strings are valid ASCII or valid UTF-8 respectively.
Well, that's one example where it works (i.e., only requires the weaker axioms). Other examples don't work. E.g., you might iterate over the characters of a string (for i := 1 to Length (s) do DoSomethingWith (s[i])), and depending on what DoSomethingWith does, this may or may not work. Or you could implement a scrolling routine that outputs the substring Copy (s, i, Width) where i in-/decreases by 1. With UTF-byte-strings, this would often produce invalid output. Or deleting a single character from a string (say, in an editor when the user presses delete or backspace) will generally leave an invalid UTF-8 string if done on bytes instead of characters.
As I said, you can, of course, (mis)use the 8-bit Char type as UTF-8 bytes, if you're always aware that you don't get full Pascal semantics (so you have to work around the differences, e.g., checking UTF-8 sequences before deleting a character or scrolling). Might be fine for you, not for other users (including myself, most of the time).
Of course, Pascal doesn't require Unicode (or in fact any character set of more than 22 characters ;-), but if we're talking about supporting standard (or BP etc. compatible) Pascal *and* Unicode, we have to recognize that we currently don't.
Handling case insensitively is more entertaining of course, but then it's already rarely handled well even with just ISO-8859-1.
Isn't it? Works fine for me (provided my locale is set up correctly).
Anyway, if someone things Unicode32 is worth implementing in the RTS, go for it, I'd just suggest that it's becoming less and less relevant.
I disagree. Even forgetting about Pascal semantics, I'd rather work with fixed-width representations internally, unless data size is a *real* concern, as the processing usually gets simpler.
Frank